pythonのコードについての質問

pythonについての質問です

gensimのチュートリアルのコードです

>>>from gensim import corpora, models, similarities
>>>
>>> documents = ["Human machine interface for lab abc computer applications",
>>>              "A survey of user opinion of computer system response time",
>>>              "The EPS user interface management system",
>>>              "System and human system engineering testing of EPS",
>>>              "Relation of user perceived response time to error measurement",
>>>              "The generation of random binary unordered trees",
>>>              "The intersection graph of paths in trees",
>>>              "Graph minors IV Widths of trees and well quasi ordering",
>>>              "Graph minors A survey"]
>>> # remove common words and tokenize
>>> stoplist = set('for a of the and to in'.split())
>>> texts = [[word for word in document.lower().split() if word not in stoplist]
>>>          for document in documents]
>>>
>>> # remove words that appear only once
>>> all_tokens = sum(texts, [])
>>> tokens_once = set(word for word in set(all_tokens) if all_tokens.count(word) == 1)
>>> texts = [[word for word in text if word not in tokens_once]
>>>          for text in texts]
>>>
>>> print texts
[['human', 'interface', 'computer'],
 ['survey', 'user', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'system'],
 ['system', 'human', 'system', 'eps'],
 ['user', 'response', 'time'],
 ['trees'],
 ['graph', 'trees'],
 ['graph', 'minors', 'trees'],
 ['graph', 'minors', 'survey']]

これはインタプリタで処理を順々に行った、過程とその結果です
以下の過程がわかっていません

>>> stoplist = set('for a of the and to in'.split())
>>> texts = [[word for word in document.lower().split() if word not in stoplist]
>>>          for document in documents]

特にtextsのリストの部分がfor文とif文が混在していてよくわかりません
詳しく教えていただけると幸いです

また、これをインタプリタではなく、.pyのファイル形式で書き換えるなら
どうなりますか
教えていただけるとありがたいです

よろしくお願いいたします

行動規範の内容に同意します

回答2件

ベストアンサー

このように記述すると理解できますでしょうか？

Python
1stoplist =['for', 'a', 'of', 'the', 'and', 'to', 'in']
2texts = []
3for document in documents:
4    words = []
5    for word in document.lower().split():
6        if word not in stoplist:
7            words.append(word)
8    texts.append(words)

ドキュメンを単語毎に分割（小文字化）を行い、stoplist に含まれる文字を取り除くという処理を行っております。
一応、補足をしておくと、

１行目は、単語の重複を取り除くために set() を行っておりますが、listでも基本的に同じです。
２行目を上記の処理を内包表記で記述しているだけです。２重ループと条件分が入っているので少し複雑に見えますが、上記の処理を内側のループから順番に内包表記に変換していくと理解が簡単かと思います。

です。

続いて .py化の件ですが、そのままでいけると思います。
一応書いておきますと、こんな感じです。

Python
1from gensim import corpora, models, similarities
2
3documents = ["Human machine interface for lab abc computer applications",
4              "A survey of user opinion of computer system response time",
5              "The EPS user interface management system",
6              "System and human system engineering testing of EPS",
7              "Relation of user perceived response time to error measurement",
8              "The generation of random binary unordered trees",
9              "The intersection graph of paths in trees",
10              "Graph minors IV Widths of trees and well quasi ordering",
11              "Graph minors A survey"]
12# remove common words and tokenize
13stoplist = set('for a of the and to in'.split())
14texts = [[word for word in document.lower().split() if word not in stoplist] for document in documents]
15
16# remove words that appear only once
17all_tokens = sum(texts, [])
18tokens_once = set(word for word in set(all_tokens) if all_tokens.count(word) == 1)
19texts = [[word for word in text if word not in tokens_once] for text in texts]
20
21print(texts)

投稿2017/06/07 14:11

magichan

総合スコア15898

python
1texts = [[word for word in document.lower().split() if word not in stoplist]

結局これってtextsっていうリストを作ってるわけで、じゃあどういうリストを作っているかというと、documentの中のwordに対して、もしstoplistにwordがなかった場合はtextsに追加して、それ以外は追加していません。

.pyだろうがインタプリタだろうが一緒です。そのままコピペしてください。

投稿2017/06/07 12:31

_Victorique__

総合スコア1392

あなたの回答

tips

プレビュー

行動規範の内容に同意します

質問の解決につながる回答をしましょう。サンプルコードなど、より具体的な説明があると質問者の理解の助けになります。また、読む側のことを考えた、分かりやすい文章を心がけましょう。

15分調べてもわからないことは
teratailで質問しよう！

ただいまの回答率
85.30%

質問をまとめることで
思考を整理して素早く解決

テンプレート機能で
簡単に質問をまとめる

質問する

関連した質問