[python] スクレイピングでサイトに埋め込まれた論文のIDを取得したいが実現できないので困っています

google chrome ＋ selenium を使って、スクレイピングにより、情報を取得しようとしています。

具体的には、下記のサイト内の各論文のIDを全て取得したい、というものです。

・サイト
https://www.sciencedirect.com/journal/journal-of-membrane-science/articles-in-press

上記に添付した画像のように、google chrome で、論文のタイトル上（例：”Enhanced molecular selectivity and plasticization resistance in ring-opened Tröger's base polymer membranes”）でマウスオーバーした状態で、右クリックから「検証」を押すと、右側で示したようにそれに対応したIDらしきものが分かります。

このページには、論文が複数存在するため、全ての論文のIDを取得したいのですが、具体的にどのようにすれば取得できるのでしょうか？

以下のようにコードを書いて、sourceを見たのですが、sourceの記述は、上記添付画像右側の記述と違いますし、sourceにはそもそも私が欲しいIDはどこにも載っていませんでした。

どうすればIDを取得できるのか、どなたかお分かりになる方、ご教授いただけますようお願いします。

私が書いたコード

import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options

path="C:\...\chromedriver.exe"
driver = webdriver.Chrome(executable_path=path)
url = 'https://www.sciencedirect.com/journal/journal-of-membrane-science/articles-in-press'
driver.get(url)
time.sleep(5)
driver.page_source

私が書いたコードで得られた出力情報

<html lang="en-us"><head>\n<meta name="viewport" content="width=device-width, initial-scale=1">\n<meta charset="utf-8">\n<title>Journal of Membrane Science | ScienceDirect.com by Elsevier</title>\n<meta data-react-helmet="true" name="SDTech" content="Proudly brought to you by the SD Technology team in London, Dayton, and Amsterdam"><meta data-react-helmet="true" name="description" content="Read the latest articles of Journal of Membrane Science at ScienceDirect.com, Elsevier’s leading platform of peer-reviewed scholarly literature"><meta data-react-helmet="true" name="robots" content="INDEX,FOLLOW,NOARCHIVE,NOODP,NOYDIR">\n<link data-react-helmet="true" rel="next" href="https://www.sciencedirect.com/journal/journal-of-membrane-science/articles-in-press?page=2"><link data-react-helmet="true" rel="canonical" href="https://www.sciencedirect.com/journal/journal-of-membrane-science/articles-in-press">\n<link rel="shortcut icon" href="https://sdfestaticassets-us-east-1.sciencedirectassets.com/shared-assets/16/images/favSD.ico" type="image/x-icon">\n<link rel="icon" href="https://sdfestaticassets-us-east-1.sciencedirectassets.com/shared-assets/16/images/favSD.ico" type="image/x-icon">\n<link href="https://sdfestaticassets-us-east-1.sciencedirectassets.com" rel="dns-prefetch">\n<link href="https://sdfestaticassets-us-east-1.sciencedirectassets.com" rel="preconnect" crossorigin="anonymous">\n<link href="https://smetrics.elsevier.com" rel="dns-prefetch">\n<link href="https://smetrics.elsevier.com" rel="preconnect" crossorigin="anonymous">\n<link href="https://assets.adobedtm.com" rel="dns-prefetch">\n<link href="https://assets.adobedtm.com" rel="preconnect" crossorigin="anonymous">\n<link rel="stylesheet" href="https://sdfestaticassets-us-east-1.sciencedirectassets.com/prod/815bab6b8c52e658d36e290d97a477ad50c70a24/style.css">\n<script src="https://www.googletagservices.com/activeview/js/current/osd.js?cb=%2Fr20100101"></script><script type="text/javascript" src="https://bam.nr-data.net/1/7ac4127487?a=1080559012&amp;sa=1&amp;v=1169.7b094c0&amp;t=Unnamed%20Transaction&amp;rst=1758&amp;ck=1&amp;ref=https://www.sciencedirect.com/journal/journal-of-membrane-science/articles-in-press&amp;be=967&amp;fe=1245&amp;dc=1245&amp;af=err,xhr,stn,ins,spa&amp;perf=%7B%22timing%22:%7B%22of%22:1620417641067,%22n%22:0,%22u%22:787,%22ue%22:793,%22f%22:4,%22dn%22:31,%22dne%22:32,%22c%22:32,%22s%22:41,%22ce%22:67,%22rq%22:68,%22rp%22:765,%22rpe%22:783,%22dl%22:797,%22di%22:928,%22ds%22:944,%22de%22:944,%22dc%22:960,%22l%22:961,%22le%22:984%7D,%22navigation

退会済みユーザー

2021/05/08 11:00 編集

（修正）当該サイトの利用規約に下記のような記述がありますので、禁止されている行為に抵触しないように注意してください。 https://www.elsevier.com/legal/elsevier-website-terms-and-conditions > You may not use any robots, spiders, crawlers or other automated downloading programs, algorithms or devices, or any similar or equivalent manual process, to: (i) continuously and automatically search, scrape, extract, deep link or index any Content; (ii) harvest personal information from the Services for purposes of sending unsolicited or unauthorized material （あなたは、ロボット・スパイダー・クローラ・その他自動化されたダウンロードプログラム・アルゴリズム・デバイス・又はこれらに類似若しくは同等の手動プロセスを使用して以下のことを行ってはならない。(i) 継続的かつ自動的にコンテンツを検索・スクレイピング・抽出・ディープリンク・またはインデックスを作成すること...)

taC-h

2021/05/08 10:44

DeepL翻訳ですが """ 以下のことを行ってはなりません。(i) 継続的かつ自動的にコンテンツを検索、スクレイピング、抽出、ディープリンク、またはインデックスを作成すること、(ii) 未承諾または許可されていない素材を送信する目的で本サービスから個人情報を収集すること、または (iii) 本サービスの作業または他者による本サービスの利用に障害を与えること。本サービスにロボット排除ファイルまたはロボット排除ヘッダーが含まれている場合、お客様はそれらを尊重し、それらを回避するための装置、ソフトウェア、ルーチンを使用しないことに同意します。 """ 条項を守ればスクレイピングは許可されていると思うのですが

行動規範の内容に同意します

回答1件

ベストアンサー

ソースが違う理由はいろいろ考えられますが，windows特有の文字コードが出力時に悪さをしている可能性が高いです
正しいソースが得られるなら，適切な方法でエレメントを探索してからWebElement.get_attribute("id")で得られます

python
1import time
2from selenium import webdriver
3from pprint import pprint
4
5
6path = "C:\...\chromedriver.exe"
7driver = webdriver.Chrome(executable_path=path)
8url = "https://www.sciencedirect.com/journal/journal-of-membrane-science/articles-in-press"
9driver.get(url)
10time.sleep(5)
11
12#ソースの保存(確認用)
13with open("hoge.html", "w", encoding="UTF-8") as f:
14    f.write(str(driver.page_source))
15
16#適宜，適切な探索方法を使う
17class_name = "article-content-title"
18articles = driver.find_elements_by_class_name(class_name)
19
20id_list = [ a.get_attribute("id") for a in articles]
21pprint(id_list)
22"""output
23['S0376738821003471',
24 'S0376738821003562',
25 'S0376738821003355',
26 'S0376738821003446',
27 'S0376738821003392',
28 'S0376738821003501',
29 'S0376738821003525',
30 'S0376738821003513',
31 'S0376738821003422',
32 'S0376738821003367',
33 'S0376738821003409',
34 'S0376738821003458',
35 'S0376738821003483',
36 'S0376738821003252',
37 'S0376738821002878',
38 'S0376738821003343',
39 'S0376738821003197',
40 'S0376738821003057',
41 'S0376738821003264',
42 'S037673882100329X']
43"""
44