スクロールしなければ情報が出てこないタイプのページで、スクロールしきった後にテキストを取得すると何故か一部のテキストが空で返ってきてしまう。

環境

windows10
python 3.7.9 64bit

実現させたい事

redditというサイトにて、猫のサブレディット（猫のコミュニティ）の画面でスクロールさせた状態でいいねの数を取得できればと考えております。

やったこと

実際のコードです。冗長ではありますが見て頂けると幸いです。

python
1from selenium import webdriver
2from selenium.webdriver.chrome.options import Options
3from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
4from webdriver_manager.chrome import ChromeDriverManager
5import time
6
7chrome_path = r"C:\Users\hoge\Downloads\chromedriver_win32 (4)\chromedriver.exe"
8options = Options()
9options.add_argument("--headless")
10options.add_experimental_option("excludeSwitches", ['enable-automation'])
11driver = webdriver.Chrome(ChromeDriverManager().install(), options = options)
12capabilities = DesiredCapabilities.CHROME.copy()
13capabilities["acceptInsecureCerts"] = True
14
15url = "https://www.reddit.com/r/cats/new/"
16
17def change_view():
18    driver.get(url)
19    # 表示変更
20    change_view_xpath = '//*[@id="LayoutSwitch--picker"]'
21    button = driver.find_element_by_xpath(change_view_xpath)
22    button.click()
23    time.sleep(2)
24
25    for i in range(30):
26        try:
27            change_view_xpath_span = "/html/body/div[{}]/div/button[3]".format(i)
28            button_second = driver.find_element_by_xpath(change_view_xpath_span)
29            button_second.click()
30            time.sleep(3)
31            break
32        except:
33            pass
34
35def scrolling():
36    SCROLL_PAUSE_TIME = 0.5
37    # Get scroll height
38    last_height = driver.execute_script("return document.body.scrollHeight")
39    n = 0
40    while True:
41        # Scroll down to bottom
42        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
43
44        # Wait to load page
45        time.sleep(SCROLL_PAUSE_TIME)
46
47        # Calculate new scroll height and compare with last scroll height
48        new_height = driver.execute_script("return document.body.scrollHeight")
49        if new_height == last_height:
50            n += 1
51            if n == 10:
52                break
53            if new_height >= 10000:
54                break
55        else:
56            print(new_height)
57            n=0
58        last_height = new_height
59    time.sleep(5)
60
61def get_score():
62    score_lis = []
63    scoer_class = "_1rZYMD_4xY3gRcSS3p8ODO._25IkBM0rRUqWX5ZojEMAFQ"
64    score_elems = driver.find_elements_by_class_name(scoer_class)
65    for score_elem in score_elems:
66        score = score_elem.text
67        score_lis.append(score)
68    print(str(len(score_elems)), " 取得できたscoreの数")
69    print(score_lis)
70
71change_view()
72scrolling()
73get_score()

こちらを実行すると猫のサブレディットで少しスクロールした後にいいねの数を取得していくのですが、実行結果は以下になります。

175  取得できたscoreの数
['', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 
'', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '13', '39', '8', '5', '55', 
'0', '2', '43', '0', '33', '21', '14', '19', '23', '13', '5', '50', '32', '13', '17', '18', '19', '17', '1', '24', '39', '9', 
'15', '13', '27', '31', '21', '13', '17', '15', '15', '23', '19', '11', '10', '20', '19', '45', '14', '1', '9', '16', '12', '8', '49', '27', '17', '14', '20', '15', '7', '18', '25', '17', '12', '15', '1', '35', '17', '24', '20', '29', '16', '27', '26', '2', '2']

調べた事

上のコードでスクロールさせずに（scrolling関数を止めて）実行させると、実行結果は以下の様になります。

25  取得できたscoreの数
['Vote', 'Vote', 'Vote', 'Vote', 'Vote', 'Vote', 'Vote', 'Vote', 'Vote', 'Vote', 'Vote', 'Vote', 'Vote', 'Vote', 'Vote', 'Vote', 'Vote', 'Vote', 'Vote', 'Vote', 'Vote', 'Vote', 'Vote', 'Vote', 'Vote']

最初にスクロールさせて実行した際に空で返ってきた部分が、スクロールさせないとちゃんとテキストとして取得できます。

スクロールさせた後にすべての投稿のURLを取得して、それぞれのページにあるいいねの数を取得する方法はいけるかな思いましたが、他に良い解決策があるのでは？と思ったのと

python scraping reddit good score
などでググったりしてみたのですが、参考にできそうなサイトが見つけられずこちらで質問させていただきました。

質問

どうしたらスクロールさせたうえで、全てのいいねの数をテキストとして取得できますでしょうか？？
有識者様のお力をお借りできればと思います。

全ての投稿のURLを取得して、投稿それぞれのページに載っているいいね数を取得する方法しかなさそうであれば、またやり直してみたいと思います。

行動規範の内容に同意します

回答1件

ベストアンサー

API を使う方法もあるかと思います。

reddit.com: APIドキュメント

以下は最新 500 件の記事の「いいね」の数を取得するコード例です。

python
1import requests
2import json
3
4url = 'https://www.reddit.com/r/cats/new.json'
5headers = {'User-Agent': 'vitalflux-pybot/0.0.1'}
6after = None
7upvotes = []
8limit = 100 # max 100 items
9n_times = 5 # max 30 requests in 60 seconds
10
11for i in range(n_times):
12  qs = f'limit={limit}' + (f'&after={after}' if after else '')
13  r = requests.get(f'{url}?{qs}', headers=headers)
14  js = json.loads(r.content)
15  after = js['data']['after']
16  data = js['data']['children']
17  upvotes += [d['data']['ups'] for d in data]
18
19print(upvotes)

追記

最新100個の投稿を5回取得しているように見受けられました。

以下のスクリプトで試してみて下さい。

python
1import requests
2import json
3
4url = 'https://www.reddit.com/r/cats/new.json'
5headers = {'User-Agent': 'vitalflux-pybot/0.0.1'}
6after = None
7upvotes = []
8limit = 100 # max 100 items
9n_times = 5 # max 30 requests in 60 seconds
10
11base_url = 'https://www.reddit.com'
12lis = []
13for i in range(n_times):
14  qs = f'limit={limit}' + (f'&after={after}' if after else '')
15  r = requests.get(f'{url}?{qs}', headers=headers)
16  js = json.loads(r.content)
17  after = js['data']['after']
18  data = js['data']['children']
19  for d in data:
20    dst = [d['data']['ups'], base_url + d['data']['permalink']]
21    lis.append(dst)
22
23print(lis)

投稿2022/05/18 20:40

編集2022/05/19 09:43

melian

総合スコア21265

dd_

2022/05/19 08:30 編集

melian様お世話になっております。いつもご回答ありがとうございます。コードの作成、ドキュメント添付、感謝しております。 dst = [] base_url = 'https://www.reddit.com' ..... for d in data: dst = [d['data']['ups'], base_url + d['data']['permalink']] lis.append(dst) こちらが正しいかは不安ですが、上記のコード付け加えて取得したデータを見てみたところ。最新１００個の投稿を５回取得しているように見受けられました。ドキュメントにも目を通してみました。 limit the maximum number of items desired (default: 25, maximum: 100) とあったのでマックスでやはりこのAPIは最新１００個しか取得できないようになっているのでしょうか？？