pythonで特定のclassタグ内のURLのみ取得したい

前提・実現したいこと

あるサイトのリンクを抽出しているのですが、URL上にあるすべてのリンクではなくclass=HorseName以下のリンクのみ抽出したいと思っています。

該当のソースコード

from bs4 import BeautifulSoup
import urllib.request as req

url = "https://race.netkeiba.com/race/shutuba.html?race_id=202006030111&rf=race_list"
res = req.urlopen(url)
soup = BeautifulSoup(res, 'html.parser')
url_items = soup.find_all(class_='HorseName')
for x in url_items:
print(x.get('href'))

発生している問題・エラーメッセージ

上記を実行すると以下のようにNoneがかえります。
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None

なお抽出したいリンクを含むHTMLは以下の通りです。

<div> <span class="HorseName"> <a title="アイスバブル" href="https://db.netkeiba.com/horse/2015104689" target="_blank">アイスバブル<img width="18" class="disp_none Favorite" id="myhorse_2015104689" alt="" src="https://cdn.netkeiba.com/img.racev3/common/img/icon/icon_horse.png?2019073001"> </a> </span> </div> ### 試したこと上記の状態でprint(x)を実行すると以下のように返ります。

<span class="HorseName"><a href="https://db.netkeiba.com/horse/2015105090" target="_blank" title="レッドレオン">レッドレオン<img alt="" class="disp_none Favorite" id="myhorse_2015105090" src="https://cdn.netkeiba.com/img.racev3/common/img/icon/icon_horse.png?2019073001" width="18"/></a></span>

HTMLは抽出できているのですが、この中の
a href="https://db.netkeiba.com/horse/2015105090"
のリンクの抽出が上手にいってないようです。

quickquip

2020/03/26 14:45

print(x)とかして何が取れているか見てみましたか。コードは最低限読めるように編集してください。 https://teratail.com/help/question-tips#questionTips3-5-1 や https://teratail.com/help#about-markdown を参考にどうぞ。

shiratamadango

2020/03/26 14:59

ご指摘ありがとうございました。

行動規範の内容に同意します

回答1件

ベストアンサー

urlを抜き取るにはspanの中にあるaタグをさらに抜き出す必要がありそうです。

Python
1from bs4 import BeautifulSoup
2import urllib.request as req
3
4url = "https://race.netkeiba.com/race/shutuba.html?race_id=202006030111&rf=race_list"
5res = req.urlopen(url)
6soup = BeautifulSoup(res, 'html.parser')
7url_items = soup.select('.HorseName a')
8for a in url_items:
9    print(a['href'])