pythonを使ったスクレイピングで、span内のテキストが抽出できません

前提・実現したいこと

span内のテキストを抽出したいです。

発生している問題・エラーメッセージ

URL = https://ehfcl.eurohandball.com/women/2020-21/player/ZtM-k_SFYU_lbwphYcnzTQ/cleopatre-darleux/

上記ページから、FACTSHEETに記載されている個人のデータを抽出したいのですが、

<div class="player-personal-info">内の <span class="info">部分のみ抽出できません。見た目は文字列が並んでいるのですが、出力すると下記の通り、{{details.player.person.〇〇}}となってしまいます。

#出力結果の例（.textなしの場合）
<span class="label">Weight:</span>
<span class="info">{{details.player.person.weight}} kg</span>

該当のソースコード

Python
1from bs4 import BeautifulSoup as bs
2import requests as req
3
4url = "https://ehfcl.eurohandball.com/women/2020-21/player/ZtM-k_SFYU_lbwphYcnzTQ/cleopatre-darleux/"
5r = req.get(url)
6soup = bs(r.text, "html.parser")
7lists = []
8
9player_infos = soup.find_all("div", class_="player-info-row")
10
11for player_info in player_infos:
12    datas = player_info.find_all("span")
13    for data in datas:
14        xxx = data.text
15        lists.append(xxx)
16
17for i in lists:
18    print(i)

試したこと

ネットで検索しましたが、該当するような情報を得ることができませんでした。

補足情報（FW/ツールのバージョンなど）

ここにより詳細な情報を記載してください。

行動規範の内容に同意します

回答2件

ベストアンサー

レンダリングさせるためにHTMLSessionを使ってみました。
ただし、
resp.html.render()
だけではうまく取得できなかったため、さらに
resp.html.render(sleep=1, keep_page=True)
としてみました。

これでどうでしょう。（変更前のあなたのコードは#でコメントアウトしています）

python
1from bs4 import BeautifulSoup as bs
2#import requests as req
3from requests_html import HTMLSession
4
5#url = "https://ehfcl.eurohandball.com/women/2020-21/player/ZtM-k_SFYU_lbwphYcnzTQ/cleopatre-darleux/"
6#r = req.get(url)
7#soup = bs(r.text, "html.parser")
8
9session = HTMLSession()
10resp = session.get("https://ehfcl.eurohandball.com/women/2020-21/player/ZtM-k_SFYU_lbwphYcnzTQ/cleopatre-darleux/")
11resp.html.render(sleep=1, keep_page=True)
12soup = bs(resp.html.html, "html.parser")

参考にしたサイト：
https://gammasoft.jp/blog/how-to-download-web-page-created-javascript/
https://stackoverflow.com/questions/56745062/data-scraping-from-a-webpage-with-javascript-using-python

投稿2021/06/06 04:20

退会済みユーザー

総合スコア0

AKICHILD

2021/06/06 12:53

回答ありがとうございます。上記のご回答を参考にさせていただきました。 resp.html.render(sleep=1, keep_page=True)では、 pyppeteer.errors.TimeoutError: Navigation Timeout Exceeded: 8000 ms exceeded.のエラーが出てしまったため、下記のように、修正してみました。 from bs4 import BeautifulSoup as bs from requests_html import HTMLSession session = HTMLSession() resp = session.get("https://ehfcl.eurohandball.com/women/2020-21/player/ZtM-k_SFYU_lbwphYcnzTQ/cleopatre-darleux/") resp.html.render(timeout=20, keep_page=True) soup = bs(resp.html.html, "html.parser") lists = [] datas = resp.html.find(".info") for data in datas: xxx = data.text lists.append(xxx) print(lists) しかし、出力はこの通りでした。 [',', '', '', '', 'cm', 'kg'] 何がいけないのでしょうか…。参考にしたサイト： https://laboratory.kazuuu.net/install-requests-html-that-can-parse-html-as-simply-as-possible-windows10/

行動規範の内容に同意します

python
1from bs4 import BeautifulSoup as bs
2#import requests as req
3from requests_html import HTMLSession
4
5#url = "https://ehfcl.eurohandball.com/women/2020-21/player/ZtM-k_SFYU_lbwphYcnzTQ/cleopatre-darleux/"
6#r = req.get(url)
7#soup = bs(r.text, "html.parser")
8
9session = HTMLSession()
10resp = session.get("https://ehfcl.eurohandball.com/women/2020-21/player/ZtM-k_SFYU_lbwphYcnzTQ/cleopatre-darleux/")
11resp.html.render(sleep=1, keep_page=True)
12soup = bs(resp.html.html, "html.parser")
13
14lists = []
15
16player_infos = soup.find_all("div", class_="player-info-row")
17
18for player_info in player_infos:
19    datas = player_info.find_all("span")
20    for data in datas:
21        xxx = data.text
22        lists.append(xxx)
23
24for i in lists:
25    print(i)

上記コードをtera.py として、
また、処理時間を測定するためにtimeコマンド下で実行してみた
結果は以下となっています。

$ time python tera.py 
Name:
Darleux, Cleopatre
Age:
31
Place of birth:
Mulhouse
Nationality:
France
Height:
176 cm
Weight:
72 kg

real	0m8.710s
user	0m1.160s
sys	0m0.312s

8.7秒、と結構時間がかかっていますが、情報は取得できています。
何でしょうね、そちらで取得できないのは。。。

投稿2021/06/06 13:33

退会済みユーザー

総合スコア0