回答率: 85.30%

質問するログイン新規登録

トップに関する質問 byteオブジェクトへの変換

編集履歴

質問編集履歴

2

コードの改善

2019/07/18 11:25

投稿

スコア13

title CHANGED Viewed

File without changes

body CHANGED Viewed

@@ -15,9 +15,53 @@
 ### 該当のソースコード
 ```python
+import MeCab
+from gensim.corpora.dictionary import Dictionary
+from gensim.models import LdaModel
+from gensim.models import HdpModel
+from collections import defaultdict
+# MeCabオブジェクトの生成
+mt = MeCab.Tagger('')
+mt.parse('')
+# トピック数の設定
+NUM_TOPICS = 3
+hdp_num_topics = 10
+if __name__ == "__main__":
+    # トレーニングデータの読み込み
+    # train_texts は二次元のリスト
+    # テキストデータを一件ずつ分かち書き（名詞、動詞、形容詞に限定）して train_texts に格納するだけ
+    train_texts = []
+    with open('train.txt', 'r',encoding='utf-8') as f:
+        for line in f:
+            text = []
+            node = mt.parseToNode(line.strip())
+            while node:
+                fields = node.feature.split(",")
+                if fields[0] == '名詞':
+                    text.append(node.surface)
+                node = node.next
+            train_texts.append(text)
+    words = Dictionary(train_texts)
+print(words)
+from gensim import corpora
+# words はさっきの単語リスト
+dictionary = corpora.Dictionary(train_texts)
+print(dictionary.token2id)
+# no_above: 使われてる文章の割合がno_above以上の場合無視
+dictionary.filter_extremes(no_below=20, no_above=0.3)
+dictionary.save_as_text('train.txt')
 words = bytes('words', 'UTF-8')
-dictionary = corpora.Dictionary.load_from_text('livedoordic.txt')
+dictionary = corpora.Dictionary.load_from_text('train.txt')
 type(words)
 vec = dictionary.doc2bow(words)
 print(vec)
 ```

1

誤字

2019/07/18 11:25

投稿

スコア13

title CHANGED Viewed

File without changes

body CHANGED Viewed

@@ -8,15 +8,6 @@
 TypeError                                 Traceback (most recent call last)
 <ipython-input-55-b3b1a98eecac> in <module>
 ----> 1 vec = dictionary.doc2bow(words)
-      2 print(vec)
-~\Anaconda3\lib\site-packages\gensim\corpora\dictionary.py in doc2bow(self, document, allow_update, return_missing)
-    251         counter = defaultdict(int)
-    252         for w in document:
---> 253             counter[w if isinstance(w, unicode) else unicode(w, 'utf-8')] += 1
-    254
-    255         token2id = self.token2id
 TypeError: decoding to str: need a bytes-like object, int found
 ```