Pythonでの複数のWEBページからのデータ取得に関する質問です

単一のWEBニュースの題名を取得するスクリプトを作成したのですが、リストに基づく複数ページからのデータ取得方法が判りません。
リストもcsv等で作る方が楽でしょうか。
個人的なニュース速報的ガシェットを作りたいです。
宜しくお願いいたします。

行動規範の内容に同意します

回答1件

ベストアンサー

作成したスクリプトを質問文にできるだけ記載してください。
他の方の回答で作成したリストから複数のページをスクレイピングするサンプルコードです。

Python
1# -*- coding: utf-8 -*-
2import time
3import hashlib
4from concurrent.futures import ThreadPoolExecutor, as_completed
5from pathlib import Path
6import requests
7
8
9class Downloader(object):
10    def __init__(self):
11        self.download_dir = Path(r'./download')
12        if not self.download_dir.exists():
13            self.download_dir.mkdir()
14        self.timeout = 2000
15
16    @staticmethod
17    def hash_value(url):
18        h = hashlib.sha512()
19        h.update(url.encode('utf-8'))
20        return '$6$$' + h.hexdigest()
21
22    def get_content(self, url):
23        file = self.download_dir.joinpath(Downloader.hash_value(url))
24        if file.exists():
25            with file.open('rb') as f:
26                return f.read()
27        # sleep
28        time.sleep(5)
29        res = requests.get(url, timeout=self.timeout)
30        with file.open('wb') as f:
31            f.write(res.content)
32        return res.content
33
34
35def main():
36    dl = Downloader()
37    urls = ['http://www.example.com/',
38            'https://teratail.com/',
39            'https://teratail.com/questions/109282',
40            'https://www.google.co.jp/']
41
42    with ThreadPoolExecutor(max_workers=2) as executor:
43        future_to_url = {executor.submit(dl.get_content, url): url for url in urls}
44        for future in as_completed(future_to_url):
45            url = future_to_url[future]
46            try:
47                data = future.result()
48                print(data)
49            except Exception as ex:
50                print('url:{0} exception:{1}'.format(url, ex))
51
52
53if __name__ == '__main__':
54    main()
55

どこまでのページ数をスクレイピングするのか分かりませんが。
最初はcsvかjsonで管理するのが分かりやすくてよいと思います。
管理しずらくなったら、sqliteなどのデータベースを使うほうがいいです。
あとはスクレイピングフレームワーク(Scapy)を使うのも一つの手段です。

投稿2018/01/22 13:04

umyu

総合スコア5846

nossu

2018/01/23 12:36 編集

スイマセン。簡単でした。 # coding: UTF-8 import urllib.request from bs4 import BeautifulSoup urls = ['http://www.example.com/', 'https://teratail.com/', 'https://teratail.com/questions/109282', 'https://www.google.co.jp/'] for url in urls: 　　　html = urllib.request.urlopen(url) 　　　soup = BeautifulSoup(html, "html.parser") 　　　print (soup.title.string)

行動規範の内容に同意します