byteオブジェクトへの変換

Question

### 前提・実現したいこと int foundというエラーをなくしたいです。 ### 発生している問題・エラーメッセージ ``` --------------------------------------------------------------------------- TypeError Traceback (most recent call last) in ----> 1 vec = dictionary.doc2bow(words) TypeError: decoding to str: need a bytes-like object, int found ``` ### 該当のソースコード ```python import MeCab from gensim.corpora.dictionary import Dictionary from gensim.models import LdaModel from gensim.models import HdpModel from collections import defaultdict # MeCabオブジェクトの生成 mt = MeCab.Tagger('') mt.parse('') # トピック数の設定 NUM_TOPICS = 3 hdp_num_topics = 10 if __name__ == "__main__": # トレーニングデータの読み込み # train_texts は二次元のリスト # テキストデータを一件ずつ分かち書き（名詞、動詞、形容詞に限定）して train_texts に格納するだけ train_texts = [] with open('train.txt', 'r',encoding='utf-8') as f: for line in f: text = [] node = mt.parseToNode(line.strip()) while node: fields = node.feature.split(",") if fields[0] == '名詞': text.append(node.surface) node = node.next train_texts.append(text) words = Dictionary(train_texts) print(words) from gensim import corpora # words はさっきの単語リスト dictionary = corpora.Dictionary(train_texts) print(dictionary.token2id) # no_above: 使われてる文章の割合がno_above以上の場合無視 dictionary.filter_extremes(no_below=20, no_above=0.3) dictionary.save_as_text('train.txt') words = bytes('words', 'UTF-8') dictionary = corpora.Dictionary.load_from_text('train.txt') type(words) vec = dictionary.doc2bow(words) print(vec) ``` ### 試したことエラーでbyteのオブジェクトが必要とのことだったので、1行目でwordsをbyteに変換しましたが治りませんでした。

Accepted Answer

以下のコードで動作を確認しました。ご確認ください。


```python
import MeCab
from gensim.corpora.dictionary import Dictionary
from gensim.models import LdaModel
from gensim.models import HdpModel
from collections import defaultdict

# MeCabオブジェクトの生成
mt = MeCab.Tagger('')

mt.parse('')

# トピック数の設定
NUM_TOPICS = 3
hdp_num_topics = 10

if __name__ == "__main__":
    # トレーニングデータの読み込み
    # train_texts は二次元のリスト
    # テキストデータを一件ずつ分かち書き（名詞、動詞、形容詞に限定）して train_texts に格納するだけ
    train_texts = []
    with open('data.txt', 'r',encoding='utf-8') as f:
        for line in f:
            text = []
            node = mt.parseToNode(line.strip())
            while node:
                fields = node.feature.split(",")
                if fields[0] == '名詞':
                    text.append(node.surface)
                node = node.next
            train_texts.append(text)
    words = Dictionary(train_texts)

print(words)

from gensim import corpora

# words はさっきの単語リスト
dictionary = corpora.Dictionary(train_texts)
print(dictionary.token2id)

# no_above: 使われてる文章の割合がno_above以上の場合無視
# コメントアウト
# dictionary.filter_extremes(no_below=20, no_above=0.3)

dictionary.save_as_text('train.txt')
words = bytes('words', 'UTF-8')
dictionary = corpora.Dictionary.load_from_text('train.txt')
type(words)

vec = dictionary.doc2bow(train_texts[0])
print(vec)

```

**data.txt**
```plain
吾輩は猫である。
名前はまだない。
にゃーん。

```

**結果**
```
Dictionary(4 unique tokens: ['猫', 'ー', '名前', '吾輩'])
{'猫': 1, 'ー': 3, '名前': 2, '吾輩': 0}
[(0, 1), (1, 1)]
```

Answer

gensimの処理関係的にwordsはリストだと思います。
また、wordsにNaNは含まれていないでしょうか？

前提・実現したいこと

発生している問題・エラーメッセージ

該当のソースコード

試したこと

関連した質問