tf-idfを用いて、重要度の低い単語をデータ内から削除したい

tf-idfを用いて、重要度の低い単語*をリスト内から削除したいのですが、そのままだとおそらくTfidfVectorizerにリスト型を放り込めないらしくエラーが出てしまいます。
以下が問題のコードになります。

＊リスト内には複数の文書があり、全体の中で頻出していることから重要度が低いと判断できる単語を削除したいという狙いです。

python3
1import janome
2from gensim.models.doc2vec import Doc2Vec
3from gensim.models.doc2vec import TaggedDocument
4from collections import OrderedDict
5from janome.tokenizer import Tokenizer
6import re
7import time
8from tqdm import tqdm
9import glob
10import collections
11import pandas as pd
12import MeCab
13import math
14import pandas as pd
15import codecs 
16
17with codecs.open("タイトルと冒頭一覧.csv", "r", "Shift-JIS", "ignore") as file:
18    df = pd.read_table(file, delimiter=",")
19
20#文章から単語を抽出する関数
21def words(text):
22    """
23        文章から単語を抽出
24    """
25    out_words = []
26    
27    tagger = MeCab.Tagger('-d /usr/local/lib/mecab/dic/mecab-ipadic-neologd/')
28    tagger.parse('')
29    node = tagger.parseToNode(text)
30
31    while node:
32        word_type = node.feature.split(",")[0]
33        if word_type in ["動詞","形容詞","名詞"]:
34            out_words.append(node.surface)
35        node = node.next
36    return out_words
37
38#学習に使うタイトル
39training_docs = []
40for i,dd in tqdm(df.iterrows()):
41    #print(dd["ID"],dd["title"])
42    training_docs.append(TaggedDocument(words=words(dd["titlebeginning"]), tags=[dd["ID"]]))
43
44
45#エラーが出てしまう問題箇所
46import numpy as np
47from sklearn.feature_extraction.text import TfidfVectorizer
48
49for i in range(0,len(df)):
50    vectorizer = TfidfVectorizer(use_idf=True, token_pattern=u'(?u)\b\w+\b')
51    tf_idfs = vectorizer.fit_transform(training_docs)
52    print(tf_idfs)

以上の処理を行い、最終的には、以下のコードにtf-idf処理を施して重要度の低い単語を除いたtraining_docsを代入したいのですが、どのようにしたら良いのでしょうか。
何卒よろしくお願い致します。

python3
1# min_count=1:最低1回出現した単語を学習に使用
2# dm=0: 学習モデル=DBOW
3model = Doc2Vec(documents=training_docs, min_count=1, dm=0)
4model.docvecs.similarity(0,1551)

エラー内容

python
1AttributeError                            Traceback (most recent call last)
2<ipython-input-32-a49b5702c1c0> in <module>()
3      4 for i in range(0,len(df)):
4      5     vectorizer = TfidfVectorizer(use_idf=True, token_pattern=u'(?u)\b\w+\b')
5----> 6     tf_idfs = vectorizer.fit_transform(training_docs)
6      7     print(tf_idfs)
7
8~/anaconda3/envs/kenkyuu/lib/python3.6/site-packages/sklearn/feature_extraction/text.py in fit_transform(self, raw_documents, y)
9   1379             Tf-idf-weighted document-term matrix.
10   1380         """
11-> 1381         X = super(TfidfVectorizer, self).fit_transform(raw_documents)
12   1382         self._tfidf.fit(X)
13   1383         # X is already a transformed view of raw_documents so
14
15~/anaconda3/envs/kenkyuu/lib/python3.6/site-packages/sklearn/feature_extraction/text.py in fit_transform(self, raw_documents, y)
16    867 
17    868         vocabulary, X = self._count_vocab(raw_documents,
18--> 869                                           self.fixed_vocabulary_)
19    870 
20    871         if self.binary:
21
22~/anaconda3/envs/kenkyuu/lib/python3.6/site-packages/sklearn/feature_extraction/text.py in _count_vocab(self, raw_documents, fixed_vocab)
23    790         for doc in raw_documents:
24    791             feature_counter = {}
25--> 792             for feature in analyze(doc):
26    793                 try:
27    794                     feature_idx = vocabulary[feature]
28
29~/anaconda3/envs/kenkyuu/lib/python3.6/site-packages/sklearn/feature_extraction/text.py in <lambda>(doc)
30    264 
31    265             return lambda doc: self._word_ngrams(
32--> 266                 tokenize(preprocess(self.decode(doc))), stop_words)
33    267 
34    268         else:
35
36~/anaconda3/envs/kenkyuu/lib/python3.6/site-packages/sklearn/feature_extraction/text.py in <lambda>(x)
37    230 
38    231         if self.lowercase:
39--> 232             return lambda x: strip_accents(x.lower())
40    233         else:
41    234             return strip_accents
42
43AttributeError: 'TaggedDocument' object has no attribute 'lower'

こちらが問題のファイルになります

hayataka2049

2018/06/09 09:13

エラーの内容を追記してください

James1201

2018/06/09 09:17 編集

すみません。最終行の表記に訂正がありました。エラー内容とともに修正させていただきます...!

行動規範の内容に同意します

回答1件

ベストアンサー

TfidfVectorizer.fit_transformの引数は文字列のリストにすれば良いです。

sklearn.feature_extraction.text.TfidfVectorizer — scikit-learn 0.19.1 documentation

ただし、デフォルトのanalyzerが英語用なので、日本語用のanalyzerをanalyzer引数に指定してやる必要があります。これは文字列を受け取って単語のリストを返すような関数にしてやる必要がありますが、幸いwords関数がそんな機能なので、そのまま使えると思います。

doc2vec用のTaggedDocumentの生成でも同じ処理をしているので計算が無駄だと思うなら、analyzerにはlambda x:xのようなものを渡して実質的に機能を殺してやり、fit_transformには

python
1[doc.words for doc in training_docs]

を渡してやれば良いはずです。

投稿2018/06/09 09:24

編集2018/06/09 09:25

hayataka2049

総合スコア30939

James1201

2018/06/09 10:25

回答ありがとうございます。無事、fit_transformに渡してtfidf値を出せたのですが、これをもとにtraining_docsに質問文のようなフィルター処理を行うにはどのようにすれば良いのでしょうか...

hayataka2049

2018/06/09 10:43 編集

各文書のtfidfベクトルから、下位n件を捨てる、しきい値th以下を捨てるみたいな感じでやればよいかと。そんな方針で、np.argsortとか使って、TfidfVectorizerのvocabulary_属性と突き合わせれば使うべき単語のリストを作れるので、あとはそれを使ってフィルタそうするとtfidf計算をループさせている意味がまったくないのと、TaggedDocumentにする処理もフィルタかけてから作れば良いので下でやることになって、プログラムは全面的に書き換えることになりますね…

hayataka2049

2018/06/09 10:45

あと、TfidfVectorizerのmin_dfを適当に指定しないと、「その文書にだけやたら出現する」みたいな単語ばっかり拾っちゃうとか、いろいろ考えるべきことはあります

James1201

2018/06/10 10:33

ありがとうございます。すみません、もう少し回答を集めさせてください・・・！

行動規範の内容に同意します