webスクレイピングをする際のCSSセレクタの指定がわからない

初心者です。
下記画像のpythonのrequestsを使って、valueのリンクを取得したいのですが
cssセレクタの指定方法がわかりません。
div.link_value="value"ではだめでした。

import requests
import lxm.html

response = requests.get('hoge.com')
root = Lxml.html.fromstring(response.content)
for a in root.cssselect('div.link_value="value"'):
value = a.get("value")
print(value)

環境

mac OS 10.14.1
Python 3.7.2

ｃｓセレクタでどう指定したらいいのか教えてください。

行動規範の内容に同意します

回答2件

ベストアンサー

Python
1root.cssselect('div input')

じゃ無理ですか？HTML全文かURL見せてほしいですね。。。

もっとも簡単な方法は、CSSセレクタを知りたい要素の上で

右クリック > Copy > CSS Selector

でCSSセレクターをコピーできるはずです

投稿2019/01/22 02:40

yamato_user

総合スコア2321

kazu130

2019/01/22 02:45

回答いただきありがとうございます。 ('div input')ではダメでした urlはhttps://www.db.yugioh-card.com/yugiohdb/card_list.action ここのリンクをスクレイピングしたいのですが・・・右クリック > Copy > CSS Selector で実施したのですがそれも弾かれました・・・・

yamato_user

2019/01/22 02:58

HTMLパーサはBeautifulSoupというのが最も一般的なので、そちらを使用しています。また、GoogleのデベロッパーモードからコピペしたCSSセレクタはnth-child(n)の部分をnth-of-type(n)に書き換える必要があります。それを踏まえて書きにコードを示します。 import requests from bs4 import BeautifulSoup response = requests.get('https://www.db.yugioh-card.com/yugiohdb/card_list.action') soup = BeautifulSoup(response.text) hit=soup.select("#card_list_1 > table > tbody > tr:nth-of-type(1) > td:nth-of-type(1) > div.list_body > div:nth-of-type(2) > div:nth-of-type(1)") print(hit) 私はこれでクロールできました

kazu130

2019/01/22 03:13

丁寧な返信ありがとうございます。上記で実行したところ python list2.py list2.py:4: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("lxml"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently. The code that caused this warning is on line 4 of the file list2.py. To get rid of this warning, pass the additional argument 'features="lxml"' to the BeautifulSoup constructor. とエラーがでました。 lxmlはインストールしているのですが・・・少し調べてみます。

yamato_user

2019/01/22 03:16

soup = BeautifulSoup(response.text, 'html.parser') こんな感じに変えてみてください。

yamato_user

2019/01/22 03:17

もしくは、こんな感じ soup = BeautifulSoup(response.text, features = "lxml")

kazu130

2019/01/22 03:19

迅速な対応ありがとうございます。上記に変更すると []とでました。なにかのインストールが足りないんでしょうか・・・・

yamato_user

2019/01/22 03:22

あー申し訳ありません。多分これで行けるはず！ soup = BeautifulSoup(response.text,'html5lib')

kazu130

2019/01/22 03:30

ありがとうございます。上記でやると bs4.FeatureNotFound: Couldn't find a tree builder with the features you requested: html5lib. Do you need to install a parser library? とでます parser libraryのインストールでしょうか？？

yamato_user

2019/01/22 03:32

pip install html5lib やってみてください

kazu130

2019/01/22 03:37

ありがとうございます！！取得することができました！！本当に感謝です。

kazu130

2019/01/22 03:48

追記ですが、card_list_1のvalueの相対URLを取りたいのですがそれもできますか？？なんどもすみません

行動規範の内容に同意します

以下のような感じで取得できないでしょうか。

Python
1import lxml.html
2
3res = """
4<div class="pack pack_ja">
5<input type="hidden" class="link_value" value="/hoge/huga.action?ope=1"> == $0
6</div>
7"""
8
9root = lxml.html.fromstring(res)
10ret = root.cssselect('div > .link_value')
11for r in ret:
12    print(r.attrib['value']) # /hoge/huga.action?ope=1