【Python】ScrapyでSPAサイトからスクレイピングをする方法について

前提・実現したいこと

Python初心者で，スクレイピングの勉強をしています．

PythonでScrapyとSplashを使ってSPAからデータを取得したいです．
下記の選手名鑑のページから，

名前
Pos.
生年月日
身長/体重

を取得したいです．
https://www.jleague.jp/club/kashima/player/

しかし，データが一番先頭の選手1件しか取得できません．
ページ内の選手全員のデータを取得するにはどうすれば良いのか教えていただきたいです．

発生している問題・エラーメッセージ


2019-08-26 00:11:08 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.jleague.jp/club/kashima/player/>
{'name': 'クォン\u3000スンテ', 'position': 'GK', 'birthday': '1984/9/11', 'hgt_wgt': '184/85'}
2019-08-26 00:11:08 [scrapy.core.engine] INFO: Closing spider (finished)
2019-08-26 00:11:08 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 971,
 'downloader/request_count': 3,
 'downloader/request_method_count/GET': 2,
 'downloader/request_method_count/POST': 1,
 'downloader/response_bytes': 192400,
 'downloader/response_count': 3,
 'downloader/response_status_count/200': 2,
 'downloader/response_status_count/404': 1,
 'elapsed_time_seconds': 9.557882,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2019, 8, 25, 22, 11, 8, 441137),
 'item_scraped_count': 1,
 'log_count/DEBUG': 4,
 'log_count/INFO': 10,
 'memusage/max': 51572736,
 'memusage/startup': 51572736,
 'response_received_count': 3,
 'robotstxt/request_count': 2,
 'robotstxt/response_count': 2,
 'robotstxt/response_status_count/200': 1,
 'robotstxt/response_status_count/404': 1,
 'scheduler/dequeued': 2,
 'scheduler/dequeued/memory': 2,
 'scheduler/enqueued': 2,
 'scheduler/enqueued/memory': 2,
 'splash/render.html/request_count': 1,
 'splash/render.html/response_count/200': 1,
 'start_time': datetime.datetime(2019, 8, 25, 22, 10, 58, 883255)}
2019-08-26 00:11:08 [scrapy.core.engine] INFO: Spider closed (finished)

該当のソースコード

Python
1from ..items import FootballAnaliticsItem
2from scrapy_splash import SplashRequest
3from urllib.request import urlopen
4import scrapy
5
6
7class JleagueSpider(scrapy.Spider):
8    name = 'jleague'
9
10    def start_requests(self):
11        teams = ['kashima', 'yokohamafm']
12        for team in teams:
13            url = 'https://www.jleague.jp/club/'+ team +'/player/'
14            yield SplashRequest(url,
15                                callback=self.parse,
16                                endpoint='render.html',
17                                args={'wait': 1.0})
18
19    def parse(self, response):
20        for selector in response.css('tbody'):
21            yield{
22                'name': selector.css('td:nth-child(4)::text').extract_first(),
23                'position': selector.css('td:nth-child(5)::text').extract_first(),
24                'birthday': selector.css('td:nth-child(7)::text').extract_first(),
25                'hgt_wgt': selector.css('td:nth-child(8)::text').extract_first(),
26            }

行動規範の内容に同意します

回答1件

ベストアンサー

Scrapyはあまり使わないのですが

extract_first()だとtbodyの最初のtdになるので

python
1for selector in response.css('tbody > tr'):

まずtrを抽出し、trのなかのtdとしないと複数抽出できません。

投稿2019/08/26 00:18

barobaro

総合スコア1286

あなたの回答

tips

プレビュー

行動規範の内容に同意します

質問の解決につながる回答をしましょう。サンプルコードなど、より具体的な説明があると質問者の理解の助けになります。また、読む側のことを考えた、分かりやすい文章を心がけましょう。

15分調べてもわからないことは
teratailで質問しよう！

ただいまの回答率
85.48%

質問をまとめることで
思考を整理して素早く解決

テンプレート機能で
簡単に質問をまとめる

質問する

質問をすることでしか得られない、回答やアドバイスがある。

15分調べてもわからないことは、質問しよう！

【Python】ScrapyでSPAサイトからスクレイピングをする方法について

前提・実現したいこと

発生している問題・エラーメッセージ

該当のソースコード

関連した質問