エラーコードが分かりません

word2vecで名詞のみかつリスト化された文章から分散表現取得したいです。
gensim のword2vecを使用し、分散表現はPC にファイルとして保存したいです。

python3
1from pymongo import MongoClient
2from bs4 import BeautifulSoup
3import MeCab
4from gensim.models import word2vec
5
6mecab = MeCab.Tagger ('/usr/local/lib/mecab/dic/mecab-ipadic-neologd')
7def main():
8    recipes = []
9    client = MongoClient('localhost', 27017)
10    db = client.html.cookpad_html
11    collection = db.test_collection
12    htmls = list(db.find().limit(5))
13    recipes = []
14    for num, html in enumerate(htmls):
15        soup = BeautifulSoup(html["html"], 'lxml')
16        for steps in soup.find_all(attrs={"class": "step_text"}):
17            node = mecab.parseToNode(steps.get_text())
18
19            while node:
20                feature = node.feature.split(",")
21                if feature[0] == "名詞" and feature[1] == "一般":
22                    recipes.append(node.feature.split(",")[6])
23                node = node.next
24                recipes = list(set(recipes))
25                
26    print(recipes)
27
28    model = word2vec.Word2Vec(recipes, size=200,min_count=1)
29
30    out = model.wv.most_similar(positive=[u'レモン'])
31    for x in out:
32        print (x[0],x[1])
33
34
35        
36if __name__ == '__main__':
37    main()

この結果が

['片栗粉', '砂糖', 'オリーブ油', 'パウダー', 'チョコ', 'レモン', '卵', '感じ', '餅', '分量', 'カテゴリ', 'シリコン', '半', 'カラメル', '他', 'おから', 'ボウル', 'ボール', 'ハチミツ', '人気', 'ラップ', '饅頭', 'レンジ', '餡', 'バニラ', 'カップ', 'お菓子', '絶品', 'レン', '*', 'トレー', 'バット', '水', 'べら', '大福', '様子', 'トップ', '茶こし', '卵白', '真ん中', '生地', '牛乳', '玉子', '卵黄', '縦', 'エッセンス', 'レシピ', '横', 'ポイント', '見た目', 'だま', 'たま', '容器', 'イン', '木', '材料', '姉妹', '電子', 'ツノ', 'もち', 'ヨーグルト']

　　と名詞は出てきたんですけど、

python3
1---------------------------------------------------------------------------
2KeyError                                  Traceback (most recent call last)
3<ipython-input-31-32946ebb0ac4> in <module>
4     35 
5     36 if __name__ == '__main__':
6---> 37     main()
7
8<ipython-input-31-32946ebb0ac4> in main()
9     28     model = word2vec.Word2Vec(recipes, size=200,min_count=1)
10     29 
11---> 30     out = model.wv.most_similar(positive=[u'レモン'])
12     31     for x in out:
13     32         print (x[0],x[1])
14
15~/anaconda3/lib/python3.7/site-packages/gensim/models/keyedvectors.py in most_similar(self, positive, negative, topn, restrict_vocab, indexer)
16    551                 mean.append(weight * word)
17    552             else:
18--> 553                 mean.append(weight * self.word_vec(word, use_norm=True))
19    554                 if word in self.vocab:
20    555                     all_words.add(self.vocab[word].index)
21
22~/anaconda3/lib/python3.7/site-packages/gensim/models/keyedvectors.py in word_vec(self, word, use_norm)
23    466             return result
24    467         else:
25--> 468             raise KeyError("word '%s' not in vocabulary" % word)
26    469 
27    470     def get_vector(self, word):
28KeyError: "word 'レモン' not in vocabulary"
29

とエラーが出てきました。

どうすればいいですか？

hayataka2049

2019/10/08 06:38 編集

エラーは省略せず記載してください。質問の修正をお願いします。

hayataka2049

2019/10/08 06:39

tracebackも最初から含めてです。

行動規範の内容に同意します

回答2件

ベストアンサー

out = model.wv.most_similar()(positive=[u'レモン'])

ここ変ですね。

まずmodel.wv.most_similar()の部分が引数なしで呼び出されますが、そうするとデフォルト値のNoneなどが渡されたことになり、質問文のエラーになるのだと思います。

python
1out = model.wv.most_similar(positive=[u'レモン'])

とするべきでしょう。

投稿2019/10/08 06:43

編集2019/10/08 06:50

hayataka2049

総合スコア30935

kawauso.love

2019/10/08 06:49

もともと、 out = model.most_similar(positive=[u'レモン']) にしていたら、非推奨の `most_similar`の呼び出し（メソッドは4.0.0で削除されます。代わりにself.wv.most_similar（）を使用してください）。とエラーが出てきたので、 out = model.wv.most_similar()(positive=[u'レモン']) にしました。

kawauso.love

2019/10/08 06:55

ありがとうございます。直りました！でもそうしたら、 KeyError: "word 'レモン' not in vocabulary" とエラーが出てきました。どうすればいいですか？

hayataka2049

2019/10/08 06:56 編集

代わりにself.wv.most_similar（）を使用してくださいは、 out = model.wv.most_similar()(positive=[u'レモン']) と書けという意味ではありません。慣用的に関数・メソッド名などはカッコつきで表されることがあります。コード上で呼び出すときはカッコを二重に重ねたりはしません。

hayataka2049

2019/10/08 07:01

＞KeyError: "word 'レモン' not in vocabulary" まず、 Word2Vecモデルの初期化時にはlist of lists of tokensを渡す必要があるので、質問文の使い方は間違っています。また、学習データに含まれない単語に対しては距離の計算ができません。ついでにいうと、その規模のデータで現実的な学習を行うのは難しいです（少なくとも1MBくらいはないとまともっぽい結果にはならないかと）。

kawauso.love

2019/10/08 07:18

ご丁寧にありがとうございます。すみません初心者なんでちょっと難しくて分からないんですけど、　まず　self.wv.most_similar（）を使用するのは、https://qiita.com/hideki/items/56dc5c4492b351c1925f このサイトを参考にすれば出来ますか？後、データは沢山あるので、１２行目の　htmls = list(db.find().limit(5)) をhtmls = list(db.find().limit(100000))などにしたら出来ますか？

hayataka2049

2019/10/08 07:24

二つの出来ますか？については、出来ませんとお答えします。16:01のわたしのコメントの引用後の1行目の問題が解決していないからです。公式のリファレンスを参考にされるのが一番良いかと思います。たとえばパラメータについては sentences (iterable of iterables, optional) – The sentences iterable can be simply a list of lists of tokens, https://radimrehurek.com/gensim/models/word2vec.html という説明があります。

kawauso.love

2019/10/10 05:03

分かりました！ありがとうございます。やってみます！

行動規範の内容に同意します