GemsimのTfidfModelの返り値について

ある文章についてgensimのTfidfModelを用いたところ、
(単語id,tfidf値)のリストを得ることが出来たのですが、一部単語id(単語'theのid）は含まれていませんでした。
リファレンスも確認しましたが、原因が分からず困っています。よろしくお願いします。
リファレンス： https://radimrehurek.com/gensim/models/tfidfmodel.html

python3
1
2#解析対象の文章
3corpus = [
4    "The elephant sneezed at the sight of potatoes.",
5    "Bats can see via echolocation. See the bat sight sneeze!",
6    "Wondering, she opened the door to the studio.",
7]
8
9
10####import libraries
11import nltk
12nltk.download('punkt')
13import string
14import numpy as np
15import gensim
16import pandas as pd
17
18
19####単語をtoken化する関数1
20def tokenize(texts):
21
22    stem = nltk.stem.SnowballStemmer('english')
23    stem_list = []
24
25    for text in texts:
26        text = text.lower()
27
28        for token in nltk.word_tokenize(text):
29            if  token in string.punctuation:
30                continue
31            stem_list.append(stem.stem(token))
32    return stem_list
33
34
35####単語をtoken化する関数2
36def tokenize2(text):
37
38    stem = nltk.stem.SnowballStemmer('english')
39    stem_list = []
40
41    for token in nltk.word_tokenize(text):
42        if  token in string.punctuation:
43            continue
44
45        stem_list.append(stem.stem(token))
46
47    return stem_list
48
49
50###tf-idfを計算する関数
51def tf_idf_gensim(corpus):
52
53    dictionary = [tokenize2(doc) for doc in corpus]
54
55    print('dictionary : ',dictionary)
56
57    lexicon = gensim.corpora.Dictionary(dictionary)
58    id2token = lexicon.token2id
59    token2id = {v:k for k,v in lexicon.token2id.items()}
60    print(" token2id : ",token2id)
61
62    corpus_ = [lexicon.doc2bow(tokenize2(doc)) for doc in corpus]
63    print('token : id in corpus_[0]',[(token2id[a],b) for a, b in corpus_[0]])
64
65    tfidf = gensim.models.TfidfModel(corpus= corpus_ ,dictionary=lexicon,normalize=True)
66
67    return [(token2id[token_id],tfidf_value)for token_id, tfidf_value in tfidf[corpus_[0]]]
68
69
70print(tf_idf_gensim(corpus))
71

------出力結果------------------------------
dictionary : [['the', 'eleph', 'sneez', 'at', 'the', 'sight', 'of', 'potato'], ['bat', 'can', 'see', 'via', 'echoloc', 'see', 'the', 'bat', 'sight', 'sneez'], ['wonder', 'she', 'open', 'the', 'door', 'to', 'the', 'studio']]

token2id : {0: 'at', 1: 'eleph', 2: 'of', 3: 'potato', 4: 'sight', 5: 'sneez', 6: 'the', 7: 'bat', 8: 'can', 9: 'echoloc', 10: 'see', 11: 'via', 12: 'door', 13: 'open', 14: 'she', 15: 'studio', 16: 'to', 17: 'wonder'}

#tfidfに代入前
(token : id ) in corpus_[0]: [('at', 1), ('eleph', 1), ('of', 1), ('potato', 1), ('sight', 1), ('sneez', 1), ('the', 2)]

###tfidfに代入後->theがなくなっている
[('at', 0.4837965208957426), ('eleph', 0.4837965208957426), ('of', 0.4837965208957426), ('potato', 0.4837965208957426), ('sight', 0.17855490118826325), ('sneez', 0.17855490118826325)]

hayataka2049

2018/06/20 06:52 編集

質問文の編集画面を開き、コードの部分を選択して<code>を押し、「ここに言語を入力」を「python」に書き換えてください。コピペしてそのまま実行できる形で掲載していただくのが理想です

YuheiFujioka

2018/06/20 06:57

ご指摘ありがとうございます。勉強になりました。

行動規範の内容に同意します

回答1件

ベストアンサー

python
1tfidf[corpus_[0]]

これはけっきょく

python
1tfidf.__getitem__(corpus_[0])

に変換されるのですが（これはpythonの仕様。[]で値を参照しているときは内部的には__getitem__を呼んでいる）、gensim.models.TfidfModel.__getitem__の仕様を調べると

__getitem__(bow, eps=1e-12)
Get tf-idf representation of the input vector and/or corpus.

bow : {list of (int, int), iterable of iterable of (int, int)}
Input document or copus in BoW format.
eps : float
Threshold value, will remove all position that have tfidf-value less than eps.
Returns:
vector (list of (int, float)) – TfIdf vector, if bow is document OR
TransformedCorpus – TfIdf corpus, if bow is corpus.

gensim: models.tfidfmodel – TF-IDF model

となっており、epsが怪しいのでeps=-1で実行するようにしてみました（こうすると絶対にスレッショルドにかからない。負の値を指定して問題ないことは実装を見て確認しています）。

python
1    return [(token2id[token_id],tfidf_value)for token_id, tfidf_value in tfidf.__getitem__(corpus_[0], eps=-1)]

するとこのような結果が得られました。

python
1[('at', 0.4837965208957426), ('eleph', 0.4837965208957426), ('of', 0.4837965208957426), ('potato', 0.4837965208957426), ('sight', 0.17855490118826325), ('sneez', 0.17855490118826325), ('the', 0.0)]

0.0かぁ。とすると、不幸にもlogの中身が1になったのかなぁ。と思って、色々見に行くと

gensim: models.tfidfmodel – TF-IDF model　#gensim.models.tfidfmodel.TfidfModel
（数式の画像とオプションのwglobalを見る）

gensim: models.tfidfmodel – TF-IDF model　#gensim.models.tfidfmodel.df2idf
（デフォルトのwglobalに使われている関数）

とりあえずadd=0.0がデフォルトなのはよくないので、
参考（一番最後の行）：
idf(inverse documet frequency)について

こうしました。

python
1    corpus_ = [lexicon.doc2bow(tokenize2(doc)) for doc in corpus]
2    print('token : id in corpus_[0]',[(token2id[a],b) for a, b in corpus_[0]])
3
4    def wglobal(docfreq, totaldocs):
5        return gensim.models.tfidfmodel.df2idf(docfreq, totaldocs, add=1.0)
6
7    tfidf = gensim.models.TfidfModel(corpus= corpus_ ,dictionary=lexicon, normalize=True, wglobal=wglobal)
8
9    return [(token2id[token_id],tfidf_value)for token_id, tfidf_value in tfidf[corpus_[0]]]

事なきを得ました。

python
1[('at', 0.4323167186448522), ('eleph', 0.4323167186448522), ('of', 0.4323167186448522), ('potato', 0.4323167186448522), ('sight', 0.26507378242266566), ('sneez', 0.26507378242266566), ('the', 0.3344858724443731)]

疲れた、というかgensimのデフォルト設定がひどい・・・（全文書に出てくるような単語が消えても誰も困らないのかもしれないけど・・・）

投稿2018/06/20 07:57

編集2018/06/20 08:08

hayataka2049

総合スコア30939

YuheiFujioka

2018/06/20 08:10

ありがとうございます！ wgglobalが発散しないうように、addの値をデフォルトから変更してあげる必要があったのですね。wgglobalの部分も目を通していましたが、読み込みが足りてませんでした。

hayataka2049

2018/06/20 08:17

発散とはちょっと違いますね。別にinfに行く訳ではないので（log(1)=0というだけ。theは全文書に出現するのでlogの中身が3/3に・・・）

YuheiFujioka

2018/06/20 08:31 編集

ありがとうございます。ご指摘頂いた内容を理解しました。 __getitem__(bow, eps=1e-12)のepsを変更するか、wgklobalのaddを変更するようにします。どちらを変更するかで、tf-idfの値の考え方(※)が変わるので、適宜使い分けたいと思います。 ※文書間における単語のレアリティを重視する場合は、epsを変更。　単語の出現頻度を重視する場合は、addを変更。

hayataka2049

2018/06/20 08:34

epsを変更したところで、どの文書でもtheのtfidf値は0になるだけなので、要らない気がします・・・あまりオリジナルのidf値を損ないたくなければ、小さなの値をaddに渡すという手もあります。全文書に出現する単語は重み0.1にするとか。もちろん他の単語のidfがどれくらいになるかを見て相対的に決めないといけないので、けっこう難しいのですが

hayataka2049

2018/06/20 08:41

たとえば100文書あってうち99文書に出現する単語のidfはadd=0として0.014くらいになるはずなので、100文書中100文書に出現する単語のidfはこれより小さければよしとします。0.01とかにしておけば大勢に影響はなく、今回のような問題も避けられるはずです

YuheiFujioka

2018/06/20 08:46

ありがとうございます。文書の数から、addの値を決めるようにします。

行動規範の内容に同意します