NLTKでのstopwordsの除去に関する質問

s = [["I", "am", "a", "potato"],["He", "is", "a", "tomato"]]

という多次元配列があるとして、stopwordsを除去したいと考えています。

import
1nltk.download("stopwords")
2from nltk.corpus import stopwords
3stop_words = stopwords.words('english') #About 150 stopwords
4
5
6s = [["I", "am", "a", "potato"],["He", "is", "a", "tomato"]]
7[w for w in s if w not in stop_words]

上記のようなコードを実行すると、

[['I', 'am', 'a', 'potato'], ['He', 'is', 'a', 'tomato']]

と出力されてしまってstopwordsが除去できていません。

あまり情報源がなく、調べても分からなくて困っているのでどなたか教えてくださると嬉しいです。

行動規範の内容に同意します

回答1件

ベストアンサー

以下のように単純に１文章ずつ処理すればよいです。

Python
1stop_words = ['a']
2s = [["I", "am", "a", "potato"],["He", "is", "a", "tomato"]]
3
4ret = []
5for line in s:
6    ret.append([w for w in line if w not in stop_words])
7
8print(ret) # [['I', 'am', 'potato'], ['He', 'is', 'tomato']]