回答編集履歴

追記

2018/07/08 11:26

投稿

hayataka2049

スコア30939

test CHANGED Viewed

@@ -13,3 +13,103 @@
 ```
 各文書ごとに特徴語を取り出したければ、こんな感じでいけると思います。
+### 追記
+せっかくなので簡単なサンプルを。
+```python
+import numpy as np
+from sklearn.feature_extraction.text import TfidfVectorizer
+from sklearn.datasets import fetch_20newsgroups
+news20 = fetch_20newsgroups()
+vectorizer = TfidfVectorizer(min_df=0.03)
+tfidf = vectorizer.fit_transform(news20.data[:1000]).toarray()
+feature_names = np.array(vectorizer.get_feature_names())
+index = tfidf.argsort(axis=1)[:,::-1]
+feature_words = [feature_names[doc] for doc in index]
+n = 5  # top何単語取るか
+m = 10  # 何記事サンプルとして抽出するか
+targets = np.array(news20.target_names)[news20.target[:m]]
+for fwords, target in zip(feature_words, targets):
+    print(target)
+    print(fwords[:n])
+""" =>
+rec.autos
+['car' 'was' 'this' 'the' 'where']
+comp.sys.mac.hardware
+['washington' 'add' 'guy' 'speed' 'call']
+comp.sys.mac.hardware
+['the' 'display' 'anybody' 'heard' 'disk']
+comp.graphics
+['division' 'chip' 'systems' 'computer' 'four']
+sci.space
+['error' 'known' 'tom' 'memory' 'the']
+talk.politics.guns
+['of' 'the' 'com' 'to' 'says']
+sci.med
+['thanks' 'couldn' 'instead' 'file' 'everyone']
+comp.sys.ibm.pc.hardware
+['chip' 'is' 'fast' 'ibm' 'bit']
+comp.os.ms-windows.misc
+['win' 'help' 'please' 'appreciated' 'figure']
+comp.sys.mac.hardware
+['the' 'file' 'lost' 've' 'it']
+"""
+```
+それっぽく動いているようです。