質問編集履歴

修正

2017/07/24 13:28

投稿

kohekoh

スコア140

test CHANGED Viewed

File without changes

test CHANGED Viewed

@@ -141,3 +141,9 @@
     tfidf()
 ```
+ちなみにこのテキストファイルは30万行ほどのテキストファイルです
+このときCPU使用率は40％ほどです

修正

2017/07/24 13:28

投稿

kohekoh

スコア140

test CHANGED Viewed

File without changes

test CHANGED Viewed

@@ -25,3 +25,119 @@
 コードに何か問題があるのか
 他に何か考えられることはありますか？
+```python
+import nltk
+import numpy as np
+import json
+import nltk_exa as nl
+import time
+from multiprocessing import Pool
+#import multiprocessing
+import sys
+def subcalc(word):
+    subdoc = []
+    lists = []
+    collection = nltk.TextCollection(word) #サイトにのっていた
+    #uniqterms = list(set(collection)) #ここも上と同じサイトに載っていた
+    wo=[]
+    for term in set(word):
+        if(collection.tf_idf(term, word) > 0):
+            wo.append([term,collection.tf_idf(term, word)]) #ここも上のサイトにのってる
+            #print(wo)
+    wo.sort(key=lambda x:x[1]) #keyに無名関数lambdaをいれてる woの1番目の要素(WO(1,2)だったら”2”)でソート
+    wo.reverse()
+    try:
+        slice1 = np.array(wo[:20]) #先頭の文字から終了インデックスまでが抽出
+        lists = slice1[:,0] #[:]は戦闘から終了のインデックスまで抽出と、slice1の0番目を格納
+        subdoc.append(list(lists)) #listsが文字列だから、リストに格納
+        del wo
+    except:
+        print(wo)
+    return subdoc
+def tfidf():
+    t1 = time.time()
+    doc0 = []
+    doc = []
+    word0 = []
+    word = []
+    f = open("/Users//Dropbox/prg/dataset/word0_notuseful012.txt")
+    for row in f:
+        word0.append(row.split("]["))
+    f.close()
+    for i in word0[0]:
+        word.append(str(i).replace("[","").replace("]","").replace(",","").replace("'","").replace("\"","").split())
+    #word.pop()
+    ttt = time.time()
+    p = Pool(4)
+    #a = subcalc(word)  #１コアによる実行
+    #print(a)
+    doc = p.map(subcalc, word) #複数コアによる実行
+    t3 = time.time()
+    #print(doc[0][0])
+    #print('processing time(nltkはこんだけかかってる)(終わり): ' + str(ttt - tt) + '(sec)')
+    print('processing time2(終わり): ' + str(t3 - ttt) + '(sec)')
+if __name__ == "__main__":
+    tfidf()
+```