PythonでBeautifulSoupを使って文字を取得したい

前提・実現したいこと

PythonでBeautifulSoupを使ってWEBページ内の特定の文字を取得したいのですが、うまくいきません。
不要なHTMLソースまで取得できてしまいます。。。

ご教授いただければと思います。

発生している問題・エラーメッセージ

該当のソースコード

python
1#! /usr/bin/env python
2
3from selenium import webdriver
4from bs4 import BeautifulSoup
5
6driver = webdriver.Chrome("/test/test/chromedriver")
7try:
8    # ページにアクセス
9    driver.get('https://test/')
10    html = driver.page_source
11    soup = BeautifulSoup(html, "html.parser")
12    for shopList in soup.find_all('div', class_='shop'):
13        results = soup.find_all("div", class_="shop_name")
14        print(results)
15except Exception as e:
16        print("【取得エラー】")

上記のソースだと、結果は以下のようになるのですが、
この結果のうち、「取得したい文字」だけを取得したいです。

<div class="shop_name">
            取得したい文字<br/>
<span class="en">test</span>
</div>, <div class="shop_name">
            取得したい文字<br/>
<span class="en">test</span>
</div>, <div class="shop_name">
            取得したい文字<br/>
<span class="en">test</span>
</div>

補足情報（FW/ツールのバージョンなど）

python3.7

行動規範の内容に同意します

回答2件

ベストアンサー

掲題のコードではfind_all('div', class_='shop')とある様なので
htmlデータにも<div class="shop">という要素があり
また'find_all()'を使用している為その要素が複数存在していると仮定して回答差し上げます。

掲題のコードではfor文内にてsoup.find_all('div', class_='shop_name')とされておりますが
それでは元々for文でshopListと変数を取る意味がなくなってしまいます。
for文内でsoupを対象にfind_allメソッドを使用するのではなくshopListを対象に使用しましょう。

そうするとshop_nameクラスを持ったdiv要素が含まれるshopクラスのdiv要素が列挙されるので
更にfor文でそれぞれ.textとすると「取得したい文字 test」と出力されるはずなので
split()と分割をし[0]をスライスしてあげる事で目的の動作が得られるでしょう。

python
1from bs4 import BeautifulSoup
2
3target = '''<div class="shop">
4<div class="shop_name">
5			取得したい文字<br/>
6<span class="en">test</span>
7</div>, <div class="shop_name">
8			取得したい文字<br/>
9<span class="en">test</span>
10</div>, <div class="shop_name">
11			取得したい文字<br/>
12<span class="en">test</span>
13</div>'''
14
15
16soup = BeautifulSoup(target, 'html.parser')
17for shopList in soup.find_all('div', class_='shop'):
18	results = shopList.find_all("div", class_="shop_name")
19	for result in results:
20		print(result.text.split()[0])

投稿2020/10/19 04:11

nto

総合スコア1438

takahiro00

2020/10/27 12:12

解決しました。ありがとうございます。

行動規範の内容に同意します

print(results.contents[0].strip())でしょうか。

投稿2020/10/15 14:23

otn

総合スコア85901

takahiro00

2020/10/18 01:51

回答ありがとうございます。試しましたが、取得エラーになってしまいます。。。

otn

2020/10/18 02:18

エラーを具体的に書いてください。

toast-uz

2020/10/18 05:11

AttributeError: ResultSet object has no attribute 'contents'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()? ですね。質問者様、こちらで同様の質疑回答が存在します。 https://teratail.com/questions/165149

行動規範の内容に同意します

あなたの回答