beautifulsoupでMediumの本文を抽出したい

こんにちは。Pythonでスクレイピングを行っています。
例えば
https://medium.com/cotinetwork/coti-newsletter-september-20th-a9cac08e22df
こちらの記事で「本文だけ」を抽出したいのです。

htmlを見ると、

<p id="8149" **class="hy hz ct ia b ib ic id ie if ig ih ii ij ik il im in io ip iq ir is it iu iv cl dq"** data-selectable-paragraph="">Following the <a class="el iw" rel="noopener" href="/cotinetwork/beyond-payments-cotis-growth-plan-to-become-a-next-gen-financial-ecosystem-2907949df002">announcement</a> of COTI’s growth plan, various media outlets have provided coverage on COTI’s roadmap to become a next-generation financial ecosystem. COTI was recently featured on <a href="https://www.crypto-news-flash.com/coti-update-explains-native-tokens-recent-rally/" class="el iw" target="_blank" rel="noopener ugc nofollow">Crypto New Flash</a>, <a href="https://coinquora.com/coti-outlines-plans-for-the-future-as-native-token-soars/" class="el iw" target="_blank" rel="noopener ugc nofollow">CoinQuora</a>, and <a href="https://u.today/coti-releases-growth-plan-treasury-stablecoin-factory-and-ecosystem" class="el iw" target="_blank" rel="noopener ugc nofollow">U.TODAY.</a></p> ``` …（以下略）

と書かれているので、本文のクラス名が
hy hz ct ia b ib ic id ie if ig ih ii ij ik il im in io ip iq ir is it iu iv cl dq
なのは分かるのですが、これをbeautiful soupを用いて取得することはできますでしょうか？

【試してみたこと】
①contents = soup.find_all('p', class_="hy hz ct ia b ib ic id ie if ig ih ii ij ik il im in io ip iq ir is it iu iv cl dq")
で本文の取得自体はできます。しかし、このクラス名自体を自動でpythonで取得できるようにしたいのです。
②a = soup.select("body > div > div > div:nth-child(3) > article > div > div > section > div:nth-child(3) > div > p")
などのCSS参照も考えてみたのですが、ブログごと/記事ごとに階層構造が異なるため、うまい参照方法が思いつきません。
③feedparserも試してみました。
feeds.entries[0].content
により、本文をタグ付きで取得することができましたが、<figure><li><img>タグなど余計な要素が入るため、これは消したいです。したがって、beautifulsoupの方が効率的かなと思っています。

どうぞお知恵をお貸しください。beautifulsoup以外のライブラリを用いても構いません。

宜しくお願いします。

行動規範の内容に同意します

回答1件

ベストアンサー

最初に soup.select_one() で p タグ要素を一つ取得して class attribute の値を取得しておきます。そして、その値を使って本文を抽出します。

python
1import requests
2from bs4 import BeautifulSoup
3
4url = 'https://medium.com/cotinetwork/coti-newsletter-september-20th-a9cac08e22df'
5r = requests.get(url)
6soup = BeautifulSoup(r.content , 'html.parser')
7
8# p tag element which has id and class attributes
9cls_name = ' '.join(soup.select_one('p[id][class]').get('class'))
10text = '\n'.join(i.text for i in soup.select(f'p[class="{cls_name}"]'))
11
12print(text)
13
14# 適宜改行を入れています
15Following the announcement of COTI’s growth plan, various media outlets have provided coverage
16on COTI’s roadmap to become a next-generation financial ecosystem. COTI was recently featured on
17Crypto New Flash, CoinQuora, and U.TODAY.

※ どうやら class attribute の値が p タグごとに微妙に異なっているらしく、最初の paragraph しか取れていません。

python
1class="hy hz ct ia b ib ic id ie if ig ih ii ij ik il im in io ip iq ir is it iu iv cl dq"

最初の 8 文字(hy hz ct)で照合すると 4626 文字抽出されて本文全体が取れている様な感じです。

python
1text = '\n'.join(i.text for i in soup.select(f'p[class*="{cls_name[:8]}"]'))

投稿2021/11/23 11:03

編集2021/11/23 12:10

melian

総合スコア20655

Aya_K

2021/11/23 12:34

すごい！できました、ありがとうございます。ただ、こちらは本文全体ではないようです。理由は、１つめの段落と２つ目以降の段落が微妙にclass名が異なっていました。本文全体を取得できますかね？

melian

2021/11/23 12:44

はい、 text = '\n'.join(i.text for i in soup.select(f'p[class="{cls_name}"]')) を以下に変えてみて下さい。 text = '\n'.join(i.text for i in soup.select(f'p[class*="{cls_name[:8]}"]')) これは、class 属性に "hy hz ct" が含まれている p タグという意味です(どうやら後ろの方が違った文字列になっている様です)。

Aya_K

2021/11/23 13:27

丁寧に教えていただき本当にありがとうございます！！また疑問点がわきましたら質問させていただきます。

行動規範の内容に同意します