gensimを用いたtfidf処理について

gensimを用いてtfidf処理を行おうとしたら、エラーがでます

以下にエラー箇所とエラー文を示します

python
1gensim_dictionary = corpora.Dictionary(doc_nltk)
2
3------------------------------------------------------------
4Traceback (most recent call last):
5  File "tfidf_gensim_hyouka.py", line 71, in <module>
6    tfidf()
7  File "tfidf_gensim_hyouka.py", line 34, in tfidf
8    gensim_dictionary = corpora.Dictionary(doc_nltk)
9  File "C:\Users\AppData\Local\conda\conda\envs\anaconda\lib\site-packages\gensim\corpora\dictionary.py", line 58, in __init__
10    self.add_documents(documents, prune_at=prune_at)
11  File "C:\Users\AppData\Local\conda\conda\envs\anaconda\lib\site-packages\gensim\corpora\dictionary.py", line 119, in add_documents
12    self.doc2bow(document, allow_update=True)  # ignore the result, here we only care about updating token ids
13  File "C:\Users\AppData\Local\conda\conda\envs\anaconda\lib\site-packages\gensim\corpora\dictionary.py", line 141, in doc2bow
14    raise TypeError("doc2bow expects an array of unicode tokens on input, not a single string")
15TypeError: doc2bow expects an array of unicode tokens on input, not a single string

doc_nltkというのは
['grab', 'briskly', 'slimmer', 'supervisor', 'crowded', …　以下略　…]
のような配列になっています

これは、このような一次元配列ではできないということですか？

一次元配列で処理を行うにはどうすればよいでしょうか

行動規範の内容に同意します

回答1件

ベストアンサー

ここにも書きましたが、

API Document に

"ドキュメントのコレクションから辞書を更新します。各ドキュメントはトークンのリストです" (Update dictionary from a collection of documents. Each document is a list of tokens.

とあります。
つまりcorpora.Dictionary() に渡すデータは「トークンリストの（ドキュメントごとの）リスト」となります。

今回の場合は、１つのドキュメントのトークンリストを渡してますので、

Python
1dictionary = corpora.Dictionary([doc_nltk])

としてください。

投稿2017/11/20 10:23

magichan

総合スコア15898

kohekoh

2017/11/20 11:03

なりました！なりましたがつぎに corpus = [gensim_dictionary.doc2bow(sentence) for sentence in doc_nltk] ここで、エラーをはいてしまって先と同じ要領で[doc_nltk]にして実行すると実行はできるのですが結果が [] になってしまいましたこれはどういうことなのでしょうか

magichan

2017/11/21 02:37 編集

遅くなりましたとりあえず、doc_nltk の値を確認してみてはいかがでしょうか。こちらで検証したところ、 from gensim import corpora doc_nltk = ['grab', 'briskly', 'slimmer', 'supervisor', 'crowded'] dictionary = corpora.Dictionary([doc_nltk]) corpus = [dictionary.doc2bow(sentence) for sentence in [doc_nltk]] print(corpus) で問題なく動作しております

kohekoh

2017/11/21 09:29

そこのときは入ってるみたいです… そのあとに tfidf_model = models.TfidfModel(corpus) corpus_tfidf = tfidf_model[corpus] このような処理をして、 print(tfidf_model)をしたときに TfidfModel(num_docs=1, num_nnz=6853) のようになりますが print(corpus_tfidf)のときに [] のようになります

kohekoh

2017/11/21 09:31

print(corpus_tfidf)ではなくて for a in corpus_tfidf: print(a) です

magichan

2017/11/22 00:35

それは単にモデル生成時に、１つのdocumentからのcorpusしか与えていないからではないでしょうか。 https://radimrehurek.com/gensim/models/tfidfmodel.html を見ると、gensim での TF-IDFの計算は weight_{i,j} = frequency_{i,j} * log_2(D / document_freq_{i}) となっており、D=1 の場合は 0 以外の結果が得られない気がします。

kohekoh

2017/11/22 01:05

なるほど… ありがとうございます

行動規範の内容に同意します