Webスクレイピングコードが動作しません。

独学プログラマー（コーリー・アルソフ著、日経BP）の243頁に、ウェブスクレイパーのコードが載っていて、そのままコピーしてスクレイピングできるか試したら、全く無反応でした。本が間違っているのでしょうか？それとも私のMAC BOOKが変なのでしょうか？
出展：http://tinyurl.com/j55s7hm

import urllib.request
from bs4 import BeautifulSoup


class Scraper:
    def __init__(self, site):
        self.site = site

    def scrape(self):
        r = urllib.request\
            .urlopen(self.site)
        html = r.read()
        parser = "html.parser"
        sp = BeautifulSoup(html,
                           parser)
        for tag in sp.find_all("a"):
            url = tag.get("href")
            if url is None:
                continue
            if "html" in url:
                print("\n" + url)

news = "https://news.google.com/"
Scraper(news).scrape()

行動規範の内容に同意します

回答1件

a タグの href 属性に設定されている URL で .html で終わるものがないように思います。
以下のようにしたところ、URL の一覧が print() できました。

python
1import urllib.request
2from bs4 import BeautifulSoup
3
4class Scraper:
5    def __init__(self, site):
6        self.site = site
7
8    def scrape(self):
9        r = urllib.request.urlopen(self.site)
10        soup = BeautifulSoup(r.read(), 'html.parser')
11
12        for tag in soup.find_all('a'):
13            print(tag.get('href'))
14
15news = 'https://news.google.com/'
16Scraper(news).scrape()

./?hl=en-US&gl=US&ceid=US%3Aen
https://www.google.com/intl/en/options/
https://myaccount.google.com/?utm_source=OGB&utm_medium=app
https://www.google.com/webhp
https://maps.google.com/maps?hl=en
https://www.youtube.com/?gl=US
https://play.google.com/?hl=en
https://news.google.com/nwshp?hl=en
https://mail.google.com/mail/
https://contacts.google.com/?hl=en
https://drive.google.com/
https://www.google.com/calendar
https://translate.google.com/?hl=en
https://photos.google.com/?pageId=none
https://www.google.com/intl/en/options/
http://www.google.com/shopping?hl=en
https://www.google.com/finance
https://docs.google.com/document/?usp=docs_alc
https://books.google.com/bkshp?hl=en
https://www.blogger.com/
https://hangouts.google.com/
https://keep.google.com/
https://www.google.com/save
長いので以下省略

Web サイトの構造というのは頻繁に変わるものであるため、スクレイピングのサンプルコードはそのままだと動かないことがよくあります。
そうした場合は、実際にその Web サイトをブラウザで開き、ソースコードを見て構造がどうなっているか確認し、それに合わせて取得するコードを書いてください。(Chrome でしたら、右クリック→検証で WEb サイトの構造が確認できます。

投稿2018/10/08 08:06