指定URLのhtmlからhrefタグの中身を取り出したい（スクレイピング）

前提・実現したいこと

以下の朝日新聞のURLから「次の記事」と「前の記事」のURLを取り出したいです。
https://www.asahi.com/articles/DA3S14526484.html?iref=pc_rensai_article_long_16_prev

発生している問題・エラーメッセージ

BeautifulSoup4を使って取り出そうとするのですが、select,find_allでも取り出すことができないです。
何か間違えますか？

該当のソースコード

Python
1import requests
2from bs4 import BeautifulSoup
3
4res = requests.get(url)
5soup = BeautifulSoup(res.text, "html.parser")
6
7prev = soup.select("#PrevLink > div > a").get("href")
8next = souop.select("#PrevLink > div > a").get("href")

試したこと

lxmlのxpathや、find_allも試みましたがうまく取得することができませんでした。

行動規範の内容に同意します

回答3件

ベストアンサー

Seleniumを使うとこんなコードで取れました
参考:https://qiita.com/Azunyan1111/items/b161b998790b1db2ff7a

python
1from bs4 import BeautifulSoup
2import requests as re
3from selenium import webdriver
4from selenium.webdriver.chrome.options import Options
5
6options = Options()
7options.set_headless(True)
8driver = webdriver.Chrome(chrome_options=options)
9driver.get("https://www.asahi.com/articles/DA3S14526484.html?iref=pc_rensai_article_long_16_prev")
10
11html = driver.page_source.encode('utf-8')
12soup = BeautifulSoup(html, "html.parser")
13print(soup.find(id='NextLink').find('a')['href'])
14print(soup.find(id='PrevLink').find('a')['href'])

投稿2020/06/28 22:12

Penpen7

総合スコア698

onegai

2020/06/28 23:38

Seleniumと呼ばれるものを初めて知りました！ありがとうございます！

行動規範の内容に同意します

Selemium を使わない方法を紹介します。
今回の場合は BeautifulSoup4 も使いません。

目的の情報にたどり着くための調査方法

1
開発者ツールの Network タブを開きます

2
画面左上の検索ボックスで、目的の情報が表示されている DOM 要素に関係するキーワードで検索します
今回の場合は DOM の id "NextLink" で検索をかけています

3
検索の結果、ブラウザがリクエストしたコンテンツの中身が検索ボックスのすぐ下に表示されるので、
クリックします

4
すると、タイムラインの特定のリクエストの行が濃いグレーで表示されるので、
クリックします

5
[Preview] タブをクリックします

6
Ctrl + F で検索ボックスが表示されるので、再度 DOM 要素に関係するキーワードで検索し、
コードをねばり強く見ていきます

すると、いつかは目的の情報を取得している箇所が見つかります

コード

python
1import re
2import requests
3
4# 現在のページの URL
5url_current_page = "https://www.asahi.com/articles/DA3S14526484.html?iref=pc_rensai_article_long_16_prev"
6
7# 連載の情報の URL
8#
9# 正確に URL を生成する場合は次のコードを参考に Python で書き直します (ここでは省略します)
10# https://www.asahicom.jp/js/asahi-article.js 73行目
11# this.rensaiJson = "/rensai/json/" + rensaiId +".json?" + Math.floor((new Date()).getTime() / (1000 * 60 * 60 * 4));//4h
12url_page_list = "https://www.asahi.com/rensai/json/da16.json?110650"
13
14# 記事 ID の取得処理
15#
16# 次のコードを参考に Python で書き直します:
17# https://www.asahicom.jp/js/asahi-article.js 371 行目
18# // 記事ID取得関数
19# getKijiid:function(path,suffix){
20#   let b = path.replace(/^.*[/\]/g, '');
21#   if (typeof(suffix) == 'string' && b.substr(b.length - suffix.length) == suffix) {
22#     b = b.substr(0, b.length - suffix.length);
23#   }
24# return b;
25# },
26path_current_page = re.sub(r'.*[///]', '', url_current_page)
27index_last = path_current_page.rfind(".html")
28page_id_current = path_current_page[:index_last]
29print(page_id_current)
30
31# JSON をリクエストし、辞書型として取得
32res_page_list = requests.get(url_page_list)
33json = res_page_list.json()
34
35# 直前のページ、直後のページ取得処理
36# 
37# 次のコードを参考に Python で書き直します:
38# https://www.asahicom.jp/js/asahi-article.js 95 行目
39# $.each(articleItems,function(i,data){
40#   if(articleItems[i].id == articleId){
41#     nowPage = articleItems[i];
42#   }else{
43#     if(nowPage){
44#       prevPage = articleItems[i];
45#       return false;
46#     }else{
47#       nextPage = articleItems[i];
48#     }
49#   }
50# });
51now_page = None
52for item in json["items"]:
53    # print(item)
54    if item["id"] == page_id_current:
55        now_page = item
56    else:
57        if now_page is not None:
58            prev_page = item
59            break
60        if now_page is None:
61            next_page = item
62
63print(prev_page)
64print(now_page)
65print(next_page)

実行結果:

console
1$ pipenv run python test.py
2DA3S14526484
3{'count': 1497, 'description': 'ちぐはぐな対応に不信を抱いた人も多いのではないか。政府の新型コロナウイルス対策に医学的見地から助言してきた専門家会議を廃止し、新たな会議体を設けると、おととい西村康稔担当相が表明した。改 
4組することに異', 'id': 'DA3S14526483', 'image': None, 'limited': None, 'release_date': '20200626050000', 'title': '（社説）専門家会議\u3000最後の提言\u3000政府は胸に'}
5{'count': 1498, 'description': '中国とインドとの長年に及ぶ国境問題が再燃している。ともに約１４億人の人口を抱え、核兵器を保有する大国である。ぶつかり合えば、世界全体を揺るがしかねない。両国は最大限の自制に努めるべきだ。 
6中国西部チベッ', 'id': 'DA3S14526484', 'image': None, 'limited': None, 'release_date': '20200626050000', 'title': '（社説）中国とインド\u3000成長大国の責任自覚を'}
7{'count': 1499, 'description': '不透明さを批判されている政府事業の民間への委託について、問題の事業を所管する経済産業省が、有識者による改善策の検討を始めた。経産省が委託のやり方を見直すのは当然である。しかし委託契約は他 
8の省庁でも行わ', 'id': 'DA3S14527841', 'image': None, 'limited': None, 'release_date': '20200627050000', 'title': '（社説）民間への委託\u3000統一ルールが必要だ'}