質問編集履歴
2
タイトルの変更
test
CHANGED
@@ -1 +1 @@
|
|
1
|
-
Pythonクローリング&スクレイピングのサンプルコードを
|
1
|
+
Pythonクローリング&スクレイピングのサンプルコードを実行したが、エラーが発生する
|
test
CHANGED
File without changes
|
1
書式の改善、詳細について記述
test
CHANGED
File without changes
|
test
CHANGED
@@ -4,77 +4,137 @@
|
|
4
4
|
|
5
5
|
http://gihyo.jp/book/2017/978-4-7741-8367-1/support
|
6
6
|
|
7
|
-
上記のサンプルコードをダウンロードし、6-7のtabelogというプログラムを
|
7
|
+
上記のサンプルコードをダウンロードし、6-7のtabelogというプログラムを実行しましたが、エラーが発生しました。
|
8
|
+
|
9
|
+
|
10
|
+
|
11
|
+
|
12
|
+
|
13
|
+
###前提・実現したいこと
|
14
|
+
|
15
|
+
食べログに掲載されている情報をcsvのリストにしたい。
|
8
16
|
|
9
17
|
|
10
18
|
|
11
19
|
###発生している問題・エラーメッセージ
|
12
20
|
|
13
|
-
|
21
|
+
|
22
|
+
|
23
|
+
```
|
24
|
+
|
25
|
+
(scraping3.4) NozomuI-no-MacBook:6-7 nozomui$ scrapy crawl tabelog -o a.csv
|
26
|
+
|
27
|
+
2017-08-17 15:08:56 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: myproject)
|
28
|
+
|
29
|
+
2017-08-17 15:08:56 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'myproject', 'SPIDER_MODULES': ['myproject.spiders'], 'DOWNLOAD_DELAY': 1, 'FEED_FORMAT': 'csv', 'FEED_URI': 'a.csv', 'ROBOTSTXT_OBEY': True, 'NEWSPIDER_MODULE': 'myproject.spiders'}
|
30
|
+
|
31
|
+
2017-08-17 15:08:56 [scrapy.middleware] INFO: Enabled extensions:
|
32
|
+
|
33
|
+
['scrapy.extensions.feedexport.FeedExporter',
|
34
|
+
|
35
|
+
'scrapy.extensions.memusage.MemoryUsage',
|
36
|
+
|
37
|
+
'scrapy.extensions.logstats.LogStats',
|
38
|
+
|
39
|
+
'scrapy.extensions.telnet.TelnetConsole',
|
40
|
+
|
41
|
+
'scrapy.extensions.corestats.CoreStats']
|
42
|
+
|
43
|
+
2017-08-17 15:08:56 [scrapy.middleware] INFO: Enabled downloader middlewares:
|
44
|
+
|
45
|
+
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
|
46
|
+
|
47
|
+
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
|
48
|
+
|
49
|
+
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
|
50
|
+
|
51
|
+
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
|
52
|
+
|
53
|
+
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
|
54
|
+
|
55
|
+
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
|
56
|
+
|
57
|
+
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
|
58
|
+
|
59
|
+
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
|
60
|
+
|
61
|
+
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
|
62
|
+
|
63
|
+
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
|
64
|
+
|
65
|
+
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
|
66
|
+
|
67
|
+
'scrapy.downloadermiddlewares.stats.DownloaderStats']
|
68
|
+
|
69
|
+
2017-08-17 15:08:56 [scrapy.middleware] INFO: Enabled spider middlewares:
|
70
|
+
|
71
|
+
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
|
72
|
+
|
73
|
+
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
|
74
|
+
|
75
|
+
'scrapy.spidermiddlewares.referer.RefererMiddleware',
|
76
|
+
|
77
|
+
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
|
78
|
+
|
79
|
+
'scrapy.spidermiddlewares.depth.DepthMiddleware']
|
80
|
+
|
81
|
+
2017-08-17 15:08:56 [scrapy.middleware] INFO: Enabled item pipelines:
|
82
|
+
|
83
|
+
[]
|
84
|
+
|
85
|
+
2017-08-17 15:08:56 [scrapy.core.engine] INFO: Spider opened
|
86
|
+
|
87
|
+
2017-08-17 15:08:56 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
|
88
|
+
|
89
|
+
2017-08-17 15:08:56 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6026
|
90
|
+
|
91
|
+
2017-08-17 15:08:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://tabelog.com/robots.txt> (referer: None)
|
92
|
+
|
93
|
+
2017-08-17 15:08:58 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://tabelog.com/tokyo/rstLst/lunch/?LstCosT=2&RdoCosTp=1> (referer: None)
|
94
|
+
|
95
|
+
2017-08-17 15:09:00 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://tabelog.com/tokyo/rstLst/lunch/2/?LstCosT=2&RdoCosTp=1> (referer: https://tabelog.com/tokyo/rstLst/lunch/?LstCosT=2&RdoCosTp=1)
|
96
|
+
|
97
|
+
2017-08-17 15:09:00 [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET https://tabelog.com/tokyo/rstLst/lunch/2/?LstCosT=2&RdoCosTp=1> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
|
98
|
+
|
99
|
+
2017-08-17 15:09:01 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://tabelog.com/tokyo/A1309/A130902/13000852/> (referer: https://tabelog.com/tokyo/rstLst/lunch/?LstCosT=2&RdoCosTp=1)
|
100
|
+
|
101
|
+
2017-08-17 15:09:01 [scrapy.core.scraper] ERROR: Spider error processing <GET https://tabelog.com/tokyo/A1309/A130902/13000852/> (referer: https://tabelog.com/tokyo/rstLst/lunch/?LstCosT=2&RdoCosTp=1)
|
14
102
|
|
15
103
|
Traceback (most recent call last):
|
16
104
|
|
17
|
-
File "/Users/nozomui/.pyenv/versions/3.6.1/bin/scrapy", line 11, in <module>
|
18
|
-
|
19
|
-
sys.exit(execute())
|
20
|
-
|
21
|
-
File "/Users/nozomui/.pyenv/versions/3.6.1/lib/python3.6/site-packages/scrapy/cmdline.py", line 148, in execute
|
22
|
-
|
23
|
-
cmd.crawler_process = CrawlerProcess(settings)
|
24
|
-
|
25
|
-
File "/Users/nozomui/.pyenv/versions/3.6.1/lib/python3.6/site-packages/scrapy/crawler.py", line 243, in __init__
|
26
|
-
|
27
|
-
super(CrawlerProcess, self).__init__(settings)
|
28
|
-
|
29
|
-
File "/Users/nozomui/.pyenv/versions/3.6.1/lib/python3.6/site-packages/scrapy/crawler.py", line 134, in __init__
|
30
|
-
|
31
|
-
self.spider_loader = _get_spider_loader(settings)
|
32
|
-
|
33
|
-
File "/Users/nozomui/.pyenv/versions/3.6.1/lib/python3.6/site-packages/scrapy/crawler.py", line 330, in _get_spider_loader
|
34
|
-
|
35
|
-
return loader_cls.from_settings(settings.frozencopy())
|
36
|
-
|
37
|
-
File "/Users/nozomui/.pyenv/versions/3.6.1/lib/python3.6/site-packages/scrapy/spiderloader.py", line 61, in from_settings
|
38
|
-
|
39
|
-
return cls(settings)
|
40
|
-
|
41
|
-
File "/Users/nozomui/.pyenv/versions/3.6.1/lib/python3.6/site-packages/scrapy/spiderloader.py", line 25, in __init__
|
42
|
-
|
43
|
-
self._load_all_spiders()
|
44
|
-
|
45
|
-
File "/Users/nozomui/.pyenv/versions/3.6.1/lib/python3.6/site-packages/scrapy/spiderloader.py", line 47, in _load_all_spiders
|
46
|
-
|
47
|
-
for module in walk_modules(name):
|
48
|
-
|
49
|
-
File "/Users/nozomui/.pyenv/versions/3.6
|
105
|
+
File "/Users/nozomui/.pyenv/versions/3.4.6/lib/python3.4/site-packages/scrapy/utils/defer.py", line 102, in iter_errback
|
106
|
+
|
50
|
-
|
107
|
+
yield next(it)
|
108
|
+
|
109
|
+
File "/Users/nozomui/.pyenv/versions/3.4.6/lib/python3.4/site-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output
|
110
|
+
|
111
|
+
for x in result:
|
112
|
+
|
113
|
+
File "/Users/nozomui/.pyenv/versions/3.4.6/lib/python3.4/site-packages/scrapy/spidermiddlewares/referer.py", line 339, in <genexpr>
|
114
|
+
|
115
|
+
return (_set_referer(r) for r in result or ())
|
116
|
+
|
117
|
+
File "/Users/nozomui/.pyenv/versions/3.4.6/lib/python3.4/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
|
118
|
+
|
51
|
-
|
119
|
+
return (r for r in result or () if _filter(r))
|
120
|
+
|
52
|
-
|
121
|
+
File "/Users/nozomui/.pyenv/versions/3.4.6/lib/python3.4/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>
|
122
|
+
|
123
|
+
return (r for r in result or () if _filter(r))
|
124
|
+
|
53
|
-
File "/Users/nozomui/.pyenv/versions/3.6
|
125
|
+
File "/Users/nozomui/.pyenv/versions/3.4.6/lib/python3.4/site-packages/scrapy/spiders/crawl.py", line 78, in _parse_response
|
54
|
-
|
126
|
+
|
55
|
-
ret
|
127
|
+
for requests_or_item in iterate_spider_output(cb_res):
|
56
|
-
|
57
|
-
|
128
|
+
|
58
|
-
|
59
|
-
File "<frozen importlib._bootstrap>", line 961, in _find_and_load
|
60
|
-
|
61
|
-
File "<frozen importlib._bootstrap>", line 950, in _find_and_load_unlocked
|
62
|
-
|
63
|
-
File "<frozen importlib._bootstrap>", line 655, in _load_unlocked
|
64
|
-
|
65
|
-
File "<frozen importlib._bootstrap_external>", line 678, in exec_module
|
66
|
-
|
67
|
-
File "<frozen importlib._bootstrap>", line 205, in _call_with_frames_removed
|
68
|
-
|
69
|
-
File "/Users/nozomui/
|
129
|
+
File "/Users/nozomui/scraping3.4/6-7/myproject/spiders/tabelog.py", line 38, in parse_restaurant
|
70
|
-
|
71
|
-
|
130
|
+
|
72
|
-
|
73
|
-
|
131
|
+
address=response.css('[rel="address"]').xpath('string()').extract_first().strip(),
|
74
|
-
|
75
|
-
|
132
|
+
|
76
|
-
|
77
|
-
|
133
|
+
AttributeError: 'NoneType' object has no attribute 'strip'
|
134
|
+
|
135
|
+
^Z
|
136
|
+
|
137
|
+
[4]+ Stopped scrapy crawl tabelog -o a.csv
|
78
138
|
|
79
139
|
```
|
80
140
|
|
@@ -82,7 +142,7 @@
|
|
82
142
|
|
83
143
|
###該当のソースコード
|
84
144
|
|
85
|
-
```
|
145
|
+
```python
|
86
146
|
|
87
147
|
from scrapy.spiders import CrawlSpider, Rule
|
88
148
|
|
@@ -90,21 +150,27 @@
|
|
90
150
|
|
91
151
|
|
92
152
|
|
93
|
-
from myproject.items import
|
153
|
+
from myproject.items import Restaurant
|
94
|
-
|
95
|
-
|
96
|
-
|
97
|
-
|
98
|
-
|
154
|
+
|
155
|
+
|
156
|
+
|
157
|
+
|
158
|
+
|
99
|
-
class
|
159
|
+
class TabelogSpider(CrawlSpider):
|
100
|
-
|
160
|
+
|
101
|
-
name = "a
|
161
|
+
name = "tabelog"
|
102
|
-
|
162
|
+
|
103
|
-
allowed_domains = ["a
|
163
|
+
allowed_domains = ["tabelog.com"]
|
104
164
|
|
105
165
|
start_urls = (
|
106
166
|
|
167
|
+
# 東京の昼のランキングのURL。
|
168
|
+
|
169
|
+
# 普通にWebサイトを見ていると、もっとパラメーターが多くなるが、
|
170
|
+
|
171
|
+
# ページャーのリンクを見ると、値が0のパラメーターは省略できることがわかる。
|
172
|
+
|
107
|
-
|
173
|
+
'https://tabelog.com/tokyo/rstLst/lunch/?LstCosT=2&RdoCosTp=1',
|
108
174
|
|
109
175
|
)
|
110
176
|
|
@@ -114,35 +180,53 @@
|
|
114
180
|
|
115
181
|
# ページャーをたどる(最大9ページまで)。
|
116
182
|
|
183
|
+
# 正規表現の \d を \d+ に変えると10ページ目以降もたどれる。
|
184
|
+
|
117
|
-
Rule(LinkExtractor(allow=r'
|
185
|
+
Rule(LinkExtractor(allow=r'/\w+/rstLst/lunch/\d/')),
|
118
|
-
|
186
|
+
|
119
|
-
#
|
187
|
+
# レストランの詳細ページをパースする。
|
120
|
-
|
188
|
+
|
121
|
-
Rule(LinkExtractor(allow=r'
|
189
|
+
Rule(LinkExtractor(allow=r'/\w+/A\d+/A\d+/\d+/$'),
|
122
|
-
|
190
|
+
|
123
|
-
callback='parse_
|
191
|
+
callback='parse_restaurant'),
|
124
192
|
|
125
193
|
]
|
126
194
|
|
127
195
|
|
128
196
|
|
129
|
-
def parse_
|
197
|
+
def parse_restaurant(self, response):
|
130
198
|
|
131
199
|
"""
|
132
200
|
|
133
|
-
|
201
|
+
レストランの詳細ページをパースする。
|
134
202
|
|
135
203
|
"""
|
136
204
|
|
205
|
+
# Google Static Mapsの画像のURLから緯度と経度を取得。
|
206
|
+
|
207
|
+
latitude, longitude = response.css(
|
208
|
+
|
209
|
+
'img.js-map-lazyload::attr("data-original")').re(
|
210
|
+
|
211
|
+
r'markers=.*?%7C([\d.]+),([\d.]+)')
|
212
|
+
|
213
|
+
|
214
|
+
|
137
|
-
|
215
|
+
# キーの値を指定してRestaurantオブジェクトを作成。
|
138
|
-
|
216
|
+
|
139
|
-
item =
|
217
|
+
item = Restaurant(
|
218
|
+
|
140
|
-
|
219
|
+
name=response.css('.display-name').xpath('string()').extract_first().strip(),
|
220
|
+
|
141
|
-
|
221
|
+
address=response.css('[rel="address"]').xpath('string()').extract_first().strip(),
|
222
|
+
|
142
|
-
|
223
|
+
latitude=latitude,
|
224
|
+
|
225
|
+
longitude=longitude,
|
226
|
+
|
227
|
+
station=response.css('dt:contains("最寄り駅")+dd span::text').extract_first(),
|
228
|
+
|
143
|
-
|
229
|
+
score=response.css('[rel="v:rating"] span::text').extract_first(),
|
144
|
-
|
145
|
-
|
146
230
|
|
147
231
|
)
|
148
232
|
|
@@ -153,3 +237,9 @@
|
|
153
237
|
|
154
238
|
|
155
239
|
```
|
240
|
+
|
241
|
+
|
242
|
+
|
243
|
+
###試したこと
|
244
|
+
|
245
|
+
python3.4.6で実行しました。scrapyのバージョンは1.4.0です。
|