質問編集履歴
2
タイトルの変更
title
CHANGED
@@ -1,1 +1,1 @@
|
|
1
|
-
Pythonクローリング&スクレイピングのサンプルコードを
|
1
|
+
Pythonクローリング&スクレイピングのサンプルコードを実行したが、エラーが発生する
|
body
CHANGED
File without changes
|
1
書式の改善、詳細について記述
title
CHANGED
File without changes
|
body
CHANGED
@@ -1,78 +1,123 @@
|
|
1
1
|
###前提・実現したいこと
|
2
2
|
Pythonクローリング&スクレイピング ―データ収集・解析のための実践開発ガイド―
|
3
3
|
http://gihyo.jp/book/2017/978-4-7741-8367-1/support
|
4
|
-
上記のサンプルコードをダウンロードし、6-7のtabelogというプログラムを
|
4
|
+
上記のサンプルコードをダウンロードし、6-7のtabelogというプログラムを実行しましたが、エラーが発生しました。
|
5
5
|
|
6
|
+
|
7
|
+
###前提・実現したいこと
|
8
|
+
食べログに掲載されている情報をcsvのリストにしたい。
|
9
|
+
|
6
10
|
###発生している問題・エラーメッセージ
|
7
|
-
|
11
|
+
|
12
|
+
```
|
13
|
+
(scraping3.4) NozomuI-no-MacBook:6-7 nozomui$ scrapy crawl tabelog -o a.csv
|
14
|
+
2017-08-17 15:08:56 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: myproject)
|
15
|
+
2017-08-17 15:08:56 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'myproject', 'SPIDER_MODULES': ['myproject.spiders'], 'DOWNLOAD_DELAY': 1, 'FEED_FORMAT': 'csv', 'FEED_URI': 'a.csv', 'ROBOTSTXT_OBEY': True, 'NEWSPIDER_MODULE': 'myproject.spiders'}
|
16
|
+
2017-08-17 15:08:56 [scrapy.middleware] INFO: Enabled extensions:
|
17
|
+
['scrapy.extensions.feedexport.FeedExporter',
|
18
|
+
'scrapy.extensions.memusage.MemoryUsage',
|
19
|
+
'scrapy.extensions.logstats.LogStats',
|
20
|
+
'scrapy.extensions.telnet.TelnetConsole',
|
21
|
+
'scrapy.extensions.corestats.CoreStats']
|
22
|
+
2017-08-17 15:08:56 [scrapy.middleware] INFO: Enabled downloader middlewares:
|
23
|
+
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
|
24
|
+
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
|
25
|
+
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
|
26
|
+
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
|
27
|
+
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
|
28
|
+
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
|
29
|
+
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
|
30
|
+
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
|
31
|
+
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
|
32
|
+
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
|
33
|
+
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
|
34
|
+
'scrapy.downloadermiddlewares.stats.DownloaderStats']
|
35
|
+
2017-08-17 15:08:56 [scrapy.middleware] INFO: Enabled spider middlewares:
|
36
|
+
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
|
37
|
+
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
|
38
|
+
'scrapy.spidermiddlewares.referer.RefererMiddleware',
|
39
|
+
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
|
40
|
+
'scrapy.spidermiddlewares.depth.DepthMiddleware']
|
41
|
+
2017-08-17 15:08:56 [scrapy.middleware] INFO: Enabled item pipelines:
|
42
|
+
[]
|
43
|
+
2017-08-17 15:08:56 [scrapy.core.engine] INFO: Spider opened
|
44
|
+
2017-08-17 15:08:56 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
|
45
|
+
2017-08-17 15:08:56 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6026
|
46
|
+
2017-08-17 15:08:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://tabelog.com/robots.txt> (referer: None)
|
47
|
+
2017-08-17 15:08:58 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://tabelog.com/tokyo/rstLst/lunch/?LstCosT=2&RdoCosTp=1> (referer: None)
|
48
|
+
2017-08-17 15:09:00 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://tabelog.com/tokyo/rstLst/lunch/2/?LstCosT=2&RdoCosTp=1> (referer: https://tabelog.com/tokyo/rstLst/lunch/?LstCosT=2&RdoCosTp=1)
|
49
|
+
2017-08-17 15:09:00 [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET https://tabelog.com/tokyo/rstLst/lunch/2/?LstCosT=2&RdoCosTp=1> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
|
50
|
+
2017-08-17 15:09:01 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://tabelog.com/tokyo/A1309/A130902/13000852/> (referer: https://tabelog.com/tokyo/rstLst/lunch/?LstCosT=2&RdoCosTp=1)
|
51
|
+
2017-08-17 15:09:01 [scrapy.core.scraper] ERROR: Spider error processing <GET https://tabelog.com/tokyo/A1309/A130902/13000852/> (referer: https://tabelog.com/tokyo/rstLst/lunch/?LstCosT=2&RdoCosTp=1)
|
8
52
|
Traceback (most recent call last):
|
9
|
-
File "/Users/nozomui/.pyenv/versions/3.6.1/bin/scrapy", line 11, in <module>
|
10
|
-
sys.exit(execute())
|
11
|
-
File "/Users/nozomui/.pyenv/versions/3.6.1/lib/python3.6/site-packages/scrapy/cmdline.py", line 148, in execute
|
12
|
-
cmd.crawler_process = CrawlerProcess(settings)
|
13
|
-
File "/Users/nozomui/.pyenv/versions/3.6.1/lib/python3.6/site-packages/scrapy/crawler.py", line 243, in __init__
|
14
|
-
super(CrawlerProcess, self).__init__(settings)
|
15
|
-
File "/Users/nozomui/.pyenv/versions/3.6.1/lib/python3.6/site-packages/scrapy/crawler.py", line 134, in __init__
|
16
|
-
self.spider_loader = _get_spider_loader(settings)
|
17
|
-
File "/Users/nozomui/.pyenv/versions/3.6.1/lib/python3.6/site-packages/scrapy/crawler.py", line 330, in _get_spider_loader
|
18
|
-
return loader_cls.from_settings(settings.frozencopy())
|
19
|
-
File "/Users/nozomui/.pyenv/versions/3.6.1/lib/python3.6/site-packages/scrapy/spiderloader.py", line 61, in from_settings
|
20
|
-
return cls(settings)
|
21
|
-
File "/Users/nozomui/.pyenv/versions/3.6.1/lib/python3.6/site-packages/scrapy/spiderloader.py", line 25, in __init__
|
22
|
-
self._load_all_spiders()
|
23
|
-
File "/Users/nozomui/.pyenv/versions/3.6.1/lib/python3.6/site-packages/scrapy/spiderloader.py", line 47, in _load_all_spiders
|
24
|
-
for module in walk_modules(name):
|
25
|
-
File "/Users/nozomui/.pyenv/versions/3.
|
53
|
+
File "/Users/nozomui/.pyenv/versions/3.4.6/lib/python3.4/site-packages/scrapy/utils/defer.py", line 102, in iter_errback
|
54
|
+
yield next(it)
|
55
|
+
File "/Users/nozomui/.pyenv/versions/3.4.6/lib/python3.4/site-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output
|
56
|
+
for x in result:
|
57
|
+
File "/Users/nozomui/.pyenv/versions/3.4.6/lib/python3.4/site-packages/scrapy/spidermiddlewares/referer.py", line 339, in <genexpr>
|
58
|
+
return (_set_referer(r) for r in result or ())
|
59
|
+
File "/Users/nozomui/.pyenv/versions/3.4.6/lib/python3.4/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
|
26
|
-
|
60
|
+
return (r for r in result or () if _filter(r))
|
61
|
+
File "/Users/nozomui/.pyenv/versions/3.4.6/lib/python3.4/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>
|
62
|
+
return (r for r in result or () if _filter(r))
|
27
|
-
File "/Users/nozomui/.pyenv/versions/3.
|
63
|
+
File "/Users/nozomui/.pyenv/versions/3.4.6/lib/python3.4/site-packages/scrapy/spiders/crawl.py", line 78, in _parse_response
|
28
|
-
|
64
|
+
for requests_or_item in iterate_spider_output(cb_res):
|
29
|
-
File "<frozen importlib._bootstrap>", line 978, in _gcd_import
|
30
|
-
File "<frozen importlib._bootstrap>", line 961, in _find_and_load
|
31
|
-
File "<frozen importlib._bootstrap>", line 950, in _find_and_load_unlocked
|
32
|
-
File "<frozen importlib._bootstrap>", line 655, in _load_unlocked
|
33
|
-
File "<frozen importlib._bootstrap_external>", line 678, in exec_module
|
34
|
-
File "<frozen importlib._bootstrap>", line 205, in _call_with_frames_removed
|
35
|
-
File "/Users/nozomui/
|
65
|
+
File "/Users/nozomui/scraping3.4/6-7/myproject/spiders/tabelog.py", line 38, in parse_restaurant
|
36
|
-
from myproject.utils import get_content
|
37
|
-
|
66
|
+
address=response.css('[rel="address"]').xpath('string()').extract_first().strip(),
|
38
|
-
import readability
|
39
|
-
|
67
|
+
AttributeError: 'NoneType' object has no attribute 'strip'
|
68
|
+
^Z
|
69
|
+
[4]+ Stopped scrapy crawl tabelog -o a.csv
|
40
70
|
```
|
41
71
|
|
42
72
|
###該当のソースコード
|
43
|
-
```
|
73
|
+
```python
|
44
74
|
from scrapy.spiders import CrawlSpider, Rule
|
45
75
|
from scrapy.linkextractors import LinkExtractor
|
46
76
|
|
47
|
-
from myproject.items import
|
77
|
+
from myproject.items import Restaurant
|
48
78
|
|
49
79
|
|
50
|
-
class
|
80
|
+
class TabelogSpider(CrawlSpider):
|
51
|
-
name = "
|
81
|
+
name = "tabelog"
|
52
|
-
allowed_domains = ["
|
82
|
+
allowed_domains = ["tabelog.com"]
|
53
83
|
start_urls = (
|
84
|
+
# 東京の昼のランキングのURL。
|
85
|
+
# 普通にWebサイトを見ていると、もっとパラメーターが多くなるが、
|
86
|
+
# ページャーのリンクを見ると、値が0のパラメーターは省略できることがわかる。
|
54
|
-
|
87
|
+
'https://tabelog.com/tokyo/rstLst/lunch/?LstCosT=2&RdoCosTp=1',
|
55
88
|
)
|
56
89
|
|
57
90
|
rules = [
|
58
91
|
# ページャーをたどる(最大9ページまで)。
|
92
|
+
# 正規表現の \d を \d+ に変えると10ページ目以降もたどれる。
|
59
|
-
Rule(LinkExtractor(allow=r'
|
93
|
+
Rule(LinkExtractor(allow=r'/\w+/rstLst/lunch/\d/')),
|
60
|
-
#
|
94
|
+
# レストランの詳細ページをパースする。
|
61
|
-
Rule(LinkExtractor(allow=r'
|
95
|
+
Rule(LinkExtractor(allow=r'/\w+/A\d+/A\d+/\d+/$'),
|
62
|
-
callback='
|
96
|
+
callback='parse_restaurant'),
|
63
97
|
]
|
64
98
|
|
65
|
-
def
|
99
|
+
def parse_restaurant(self, response):
|
66
100
|
"""
|
67
|
-
|
101
|
+
レストランの詳細ページをパースする。
|
68
102
|
"""
|
103
|
+
# Google Static Mapsの画像のURLから緯度と経度を取得。
|
104
|
+
latitude, longitude = response.css(
|
105
|
+
'img.js-map-lazyload::attr("data-original")').re(
|
106
|
+
r'markers=.*?%7C([\d.]+),([\d.]+)')
|
107
|
+
|
69
|
-
|
108
|
+
# キーの値を指定してRestaurantオブジェクトを作成。
|
70
|
-
item =
|
109
|
+
item = Restaurant(
|
110
|
+
name=response.css('.display-name').xpath('string()').extract_first().strip(),
|
71
|
-
|
111
|
+
address=response.css('[rel="address"]').xpath('string()').extract_first().strip(),
|
112
|
+
latitude=latitude,
|
113
|
+
longitude=longitude,
|
114
|
+
station=response.css('dt:contains("最寄り駅")+dd span::text').extract_first(),
|
72
|
-
|
115
|
+
score=response.css('[rel="v:rating"] span::text').extract_first(),
|
73
|
-
|
74
116
|
)
|
75
117
|
|
76
118
|
yield item
|
77
119
|
|
78
|
-
```
|
120
|
+
```
|
121
|
+
|
122
|
+
###試したこと
|
123
|
+
python3.4.6で実行しました。scrapyのバージョンは1.4.0です。
|