回答率: 85.29%

質問するログイン新規登録

トップ Python 3.xに関する質問 byteオブジェクトへの変換

編集履歴

回答編集履歴

3

修正

2019/07/18 12:47

投稿

スコア30939

answer CHANGED Viewed

@@ -59,7 +59,9 @@
 **data.txt**
 ```plain
 吾輩は猫である。
-名前はまだ無い。
+名前はまだない。
+にゃーん。
 ```
 **結果**

2

修正

2019/07/18 12:47

投稿

スコア30939

answer CHANGED Viewed

@@ -34,7 +34,7 @@
             train_texts.append(text)
     words = Dictionary(train_texts)
-print(words[0])
+print(words)
 from gensim import corpora
@@ -60,4 +60,11 @@
 ```plain
 吾輩は猫である。
 名前はまだ無い。
+```
+**結果**
+```
+Dictionary(4 unique tokens: ['猫', 'ー', '名前', '吾輩'])
+{'猫': 1, 'ー': 3, '名前': 2, '吾輩': 0}
+[(0, 1), (1, 1)]
 ```

1

追記

2019/07/18 12:44

投稿

スコア30939

answer CHANGED Viewed

@@ -1,10 +1,63 @@
-> Parameters:
+以下のコードで動作を確認しました。ご確認ください。
-> document (list of str) – Input document.
->
-> [gensim: corpora.dictionary – Construct word<->id mappings](https://radimrehurek.com/gensim/corpora/dictionary.html#gensim.corpora.dictionary.Dictionary.doc2bow)
-文字列のリストを渡せ、ということなので、従う必要があります。
+```python
+import MeCab
-なので、たとえば`["words"]`を渡せば動作するでしょう。この場合は、単に`"words"`から構成される文書、ということになります。
+from gensim.corpora.dictionary import Dictionary
+from gensim.models import LdaModel
+from gensim.models import HdpModel
+from collections import defaultdict
+# MeCabオブジェクトの生成
+mt = MeCab.Tagger('')
+mt.parse('')
+# トピック数の設定
+NUM_TOPICS = 3
+hdp_num_topics = 10
+if __name__ == "__main__":
+    # トレーニングデータの読み込み
+    # train_texts は二次元のリスト
+    # テキストデータを一件ずつ分かち書き（名詞、動詞、形容詞に限定）して train_texts に格納するだけ
+    train_texts = []
+    with open('data.txt', 'r',encoding='utf-8') as f:
+        for line in f:
+            text = []
+            node = mt.parseToNode(line.strip())
+            while node:
+                fields = node.feature.split(",")
+                if fields[0] == '名詞':
+                    text.append(node.surface)
+                node = node.next
+            train_texts.append(text)
+    words = Dictionary(train_texts)
+print(words[0])
+from gensim import corpora
+# words はさっきの単語リスト
+dictionary = corpora.Dictionary(train_texts)
+print(dictionary.token2id)
-バイト列から一要素ずつ取り出すとint型で返りますので、エラーはその絡みかと。
+# no_above: 使われてる文章の割合がno_above以上の場合無視
+# コメントアウト
+# dictionary.filter_extremes(no_below=20, no_above=0.3)
+dictionary.save_as_text('train.txt')
+words = bytes('words', 'UTF-8')
+dictionary = corpora.Dictionary.load_from_text('train.txt')
+type(words)
+vec = dictionary.doc2bow(train_texts[0])
+print(vec)
+```
+**data.txt**
+```plain
+吾輩は猫である。
+名前はまだ無い。
+```