質問するログイン新規登録

回答編集履歴

1

追記

2018/07/08 11:26

投稿

hayataka2049
hayataka2049

スコア30939

answer CHANGED
@@ -5,4 +5,54 @@
5
5
  n = 10 # いくつほしいか
6
6
  feature_words = [feature_names[doc[:n]] for doc in index]
7
7
  ```
8
- 各文書ごとに特徴語を取り出したければ、こんな感じでいけると思います。
8
+ 各文書ごとに特徴語を取り出したければ、こんな感じでいけると思います。
9
+
10
+ ### 追記
11
+ せっかくなので簡単なサンプルを。
12
+
13
+ ```python
14
+ import numpy as np
15
+ from sklearn.feature_extraction.text import TfidfVectorizer
16
+ from sklearn.datasets import fetch_20newsgroups
17
+
18
+ news20 = fetch_20newsgroups()
19
+ vectorizer = TfidfVectorizer(min_df=0.03)
20
+
21
+ tfidf = vectorizer.fit_transform(news20.data[:1000]).toarray()
22
+ feature_names = np.array(vectorizer.get_feature_names())
23
+ index = tfidf.argsort(axis=1)[:,::-1]
24
+ feature_words = [feature_names[doc] for doc in index]
25
+
26
+ n = 5 # top何単語取るか
27
+ m = 10 # 何記事サンプルとして抽出するか
28
+ targets = np.array(news20.target_names)[news20.target[:m]]
29
+
30
+ for fwords, target in zip(feature_words, targets):
31
+ print(target)
32
+ print(fwords[:n])
33
+
34
+ """ =>
35
+ rec.autos
36
+ ['car' 'was' 'this' 'the' 'where']
37
+ comp.sys.mac.hardware
38
+ ['washington' 'add' 'guy' 'speed' 'call']
39
+ comp.sys.mac.hardware
40
+ ['the' 'display' 'anybody' 'heard' 'disk']
41
+ comp.graphics
42
+ ['division' 'chip' 'systems' 'computer' 'four']
43
+ sci.space
44
+ ['error' 'known' 'tom' 'memory' 'the']
45
+ talk.politics.guns
46
+ ['of' 'the' 'com' 'to' 'says']
47
+ sci.med
48
+ ['thanks' 'couldn' 'instead' 'file' 'everyone']
49
+ comp.sys.ibm.pc.hardware
50
+ ['chip' 'is' 'fast' 'ibm' 'bit']
51
+ comp.os.ms-windows.misc
52
+ ['win' 'help' 'please' 'appreciated' 'figure']
53
+ comp.sys.mac.hardware
54
+ ['the' 'file' 'lost' 've' 'it']
55
+ """
56
+ ```
57
+
58
+ それっぽく動いているようです。