python ngramについて

前提・実現したいこと

前提としてネット上に落ちていたプログラムを参考に、Python2系だったものを3系に直すことと、importの部分の追加を行いました。
目標としてweb上の文章をngram(1-10くらい)にて処理し、どのような単語が多く出ているのかを調べていくプログラムを作りたいと思っています。ただし取得するURLはjsonファイルとして保存しております。それも以下に載せます。

発生している問題・エラーメッセージ

__getNavigableStringsの部分がよく理解できずこのようなエラーが出てしまっているのでどうか解決策、説明をいただければと思っている所存でございます。

Traceback (most recent call last):
  File "ngram.py", line 75, in <module>
    text = hp.get(url)
  File "ngram.py", line 49, in get
    text = '\n'.join(self.__getNavigableStrings(soup))
  File "ngram.py", line 58, in __getNavigableStrings
    for g in self.__getNavigableStrings(c):
  File "ngram.py", line 54, in __getNavigableStrings
    if type(soup) not in (Comment, Declaration) and soup.strip():
NameError: name 'Comment' is not defined

なおデバックのプログラムを組み、__getNavigableStringsの関数の部分がおかしいことはわかっております。Comment,Declarationなどどのように定義するといいのかを教えていただきたいです。

該当のソースコード

python
1# coding: utf-8
2
3import sys
4import json
5import MeCab
6import urllib.request, urllib.error, urllib.parse
7from collections import defaultdict
8from operator import itemgetter
9from bs4 import BeautifulSoup
10from bs4 import NavigableString
11
12
13class Ngram():
14    
15    def __init__(self, N=3):
16        self.N = N
17        self.tagger = MeCab.Tagger("-Owakati")
18    
19    def get(self, text, ngram=None):
20        seq = self.tagger.parse(text).split()
21
22        if ngram is None:
23            ngram = [defaultdict(int) for x in range((self.N + 1))]
24            ngram[0] = None
25        
26        for i in range(len(seq)):
27            for n in range(1, self.N + 1):
28                idx = i - n + 1  # check ngram is valid range
29                if idx >= 0:
30                    key_words = []
31                    for j in range(idx, i+1):
32                        key_words.append(seq[j])
33                    key = '_'.join(key_words)
34                    ngram[n][key] += 1
35        
36        return ngram
37
38
39class HTMLParser():
40    
41    def get(self, url):
42        try:
43            c = urllib.request.urlopen(url)
44        except:
45            print("Could not open %s" % url)
46            return ""
47        
48        soup = BeautifulSoup(c.read(), "lxml")
49        text = '\n'.join(self.__getNavigableStrings(soup))
50        return text
51    
52    def __getNavigableStrings(self, soup):
53        if isinstance(soup, NavigableString):
54            if type(soup) not in (Comment, Declaration) and soup.strip():
55                yield soup
56        elif soup.name not in ('script', 'style'):
57            for c in soup.contents:
58                for g in self.__getNavigableStrings(c):
59                    yield g
60
61
62if __name__ == "__main__":
63    
64    f = open("urls.json", "r")
65    urls = json.load(f)
66    f.close()
67    print("Count of urls : " + str(len(urls)))
68    
69    N = 10
70    hp = HTMLParser()
71    ng = Ngram(N)
72    
73    ngram = None
74    for url in urls:
75        text = hp.get(url)
76        ngram = ng.get(text, ngram)
77    
78    for n in range(1, (N + 1)):
79        f = open('outputs/{:02d}.tsv'.format(n), 'w')
80        out = ""
81        for k, v in sorted(list(ngram[n].items()), key=itemgetter(1), reverse=True):
82            out += "{}\t{}\n".format(k, v)
83        f.write(out)
84        f.close()

son
1["https://headlines.yahoo.co.jp/article?a=20180529-00000053-sasahi-life", "https://headlines.yahoo.co.jp/hl?a=20180601-00000031-asahi-soci", "https://headlines.yahoo.co.jp/hl?a=20180601-00010001-biz_shoko-bus_all", "https://news.yahoo.co.jp/byline/fuwaraizo/20180531-00085864/", "https://headlines.yahoo.co.jp/videonews/nnn?a=20180601-00000032-nnn-soci", "https://headlines.yahoo.co.jp/hl?a=20180601-00010002-reutv-bus_all", "https://headlines.yahoo.co.jp/videonews/", "https://news.yahoo.co.jp/profile/login", "https://www.yahoo-help.jp/app/answers/detail/p/575/a_id/60137", "https://news.yahoo.co.jp/pickup/6284651", "https://about.yahoo.co.jp/docs/info/terms/chapter1.html", "https://news.yahoo.co.jp/hl?c=c_life", "https://headlines.yahoo.co.jp/hl?a=20180601-00000051-asahi-spo", "https://news.yahoo.co.jp/byline/oshimakazuto/20180531-00085867/", "https://news.yahoo.co.jp/pickup/6284647", "https://www.yahoo-help.jp/app/answers/detail/a_id/43880/p/533/", "https://headlines.yahoo.co.jp/docs/copyright.html", "https://news.yahoo.co.jp/ranking", "https://headlines.yahoo.co.jp/docs/tokuteisho.html", "https://headlines.yahoo.co.jp/purchase/", "https://news.yahoo.co.jp/photo", "https://headlines.yahoo.co.jp/hl?a=20180601-00000548-san-hlth", "https://news.yahoo.co.jp/hl?c=c_sci", "https://headlines.yahoo.co.jp/hl?a=20180601-00226362-nksports-spo", "https://news.yahoo.co.jp/pickup/6284656", "https://news.yahoo.co.jp/feature", "https://about.yahoo.co.jp/info/msiesp/", "https://headlines.yahoo.co.jp/article?a=20180601-00223078-toyo-soci", "https://about.yahoo.co.jp/docs/info/terms/", "https://news.yahoo.co.jp/ranking/access?ty=v", "https://headlines.yahoo.co.jp/hl?a=20180601-00000108-spnannex-socc", "https://headlines.yahoo.co.jp/hl?a=20180601-00000519-san-pol", "https://news.yahoo.co.jp/hl?c=c_spo", "https://headlines.yahoo.co.jp/hl?a=20180601-00010003-storyfulv-s_ame", "https://news.yahoo.co.jp/flash", "https://news.yahoo.co.jp/hl?c=dom", "https://headlines.yahoo.co.jp/hl?a=20180601-00029764-mbcnewsv-l46", "https://news.yahoo.co.jp/ranking/access?ty=t", "https://news.yahoo.co.jp/hl?c=c_ent", "https://www.yahoo-help.jp/app/noscript", "https://headlines.yahoo.co.jp/hl?a=20180601-00010002-fnnprimev-soci", "https://headlines.yahoo.co.jp/videonews/jnn?a=20180601-00000029-jnn-soci", "https://news.yahoo.co.jp/pickup/6284633", "https://news.yahoo.co.jp/hl?c=loc", "http://news.yahoo.co.jp/", "https://news.yahoo.co.jp/byline/kohyoungki/20180531-00085870/", "https://news.yahoo.co.jp/pickup/6284642", "https://news.yahoo.co.jp/profile/settings/", "https://news.yahoo.co.jp/polls/", "https://about.yahoo.co.jp/info/mediastatement/", "https://news.yahoo.co.jp/pickup/6284641", "https://headlines.yahoo.co.jp/article?a=20180601-00000021-sasahi-soci", "https://news.yahoo.co.jp/byline/", "https://news.yahoo.co.jp/pickup/6284638", "https://news.yahoo.co.jp/hl?c=c_int", "https://headlines.yahoo.co.jp/videonews/ann?a=20180601-00000005-ann-soci", "https://news.yahoo.co.jp/list/", "https://news.yahoo.co.jp/zasshi", "https://news.yahoo.co.jp/hl?c=bus", "https://news.yahoo.co.jp/ranking/comment/rate?ty=t", "https://news.yahoo.co.jp/ranking/access?ty=z", "https://news.yahoo.co.jp/ranking/access?ty=b", "https://news.yahoo.co.jp/", "https://feedback.ms.yahoo.co.jp/voc/news-voc/input", "https://news.yahoo.co.jp/pickup/6284649", "https://news.yahoo.co.jp/search/advanced", "https://www.yahoo-help.jp/app/home/p/575/", "https://news.yahoo.co.jp/topics"]

参考文献

巨人の肩の上に登る
http://mayo.hatenablog.com/entry/2013/12/09/012939

hayataka2049

2018/06/01 10:42 編集

出典（参考にしたページ）を明記してくれませんか

rrrrrrrry

2018/06/01 10:44

追加させていただきました

行動規範の内容に同意します

回答1件

ベストアンサー

beautifulsoupの使い方はまったく知りませんが、とりあえずimportすればそのエラーは消えませんか？

こんな感じで。

python
1from bs4 import Comment, Declaration

投稿2018/06/01 10:44

hayataka2049

総合スコア30939

あなたの回答

tips

プレビュー

行動規範の内容に同意します

質問の解決につながる回答をしましょう。サンプルコードなど、より具体的な説明があると質問者の理解の助けになります。また、読む側のことを考えた、分かりやすい文章を心がけましょう。

15分調べてもわからないことは
teratailで質問しよう！

ただいまの回答率
85.31%

質問をまとめることで
思考を整理して素早く解決

テンプレート機能で
簡単に質問をまとめる

質問する

質問をすることでしか得られない、回答やアドバイスがある。

15分調べてもわからないことは、質問しよう！