PythonでYouTube Liveのアーカイブからチャットを取得したいけれどうまくいかない

前提・実現したいこと

PythonでYoutube Liveのアーカイブからチャットを取得したいです。

こちらのサイトを参考に、コードをそのまま使用しています。
http://watagassy.hatenablog.com/entry/2018/10/08/132939

発生している問題・エラーメッセージ

cmdでpyファイル実行後、テキストファイル(comment_data.txt)に何も表示されません。
cmd,Visual Studio Code共にエラーメッセージや問題など発生していません。

解決策が分からなくて困っています。

該当のソースコード

py
1from bs4 import BeautifulSoup
2import json
3import requests
4
5target_url = "https://www.youtube.com/watch?v=xxxxx(取得したい動画のURL)"
6dict_str = ""
7next_url = ""
8comment_data = []
9session = requests.Session()
10headers = {'user-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36'}
11
12# まず動画ページにrequestsを実行しhtmlソースを手に入れてlive_chat_replayの先頭のurlを入手
13html = requests.get(target_url)
14soup = BeautifulSoup(html.text, "html.parser")
15
16for iframe in soup.find_all("iframe"):
17    if("live_chat_replay" in iframe["src"]):
18        next_url= iframe["src"]
19
20
21while(1):
22
23    try:
24　　 html = session.get(next_url, headers=headers)
25        soup = BeautifulSoup(html.text,"lxml")
26
27
28        # 次に飛ぶurlのデータがある部分をfind_allで探してsplitで整形
29        for scrp in soup.find_all("script"):
30            if "window[\"ytInitialData\"]" in scrp.text:
31                dict_str = scrp.text.split(" = " , 1)[1]
32
33        # javascript表記なので更に整形. falseとtrueの表記を直す
34        dict_str = dict_str.replace("false","False")
35        dict_str = dict_str.replace("true","True")
36
37        # 辞書形式と認識すると簡単にデータを取得できるが, 末尾に邪魔なのがあるので消しておく（「空白2つ + \n + ;」を消す）
38        dict_str = dict_str.rstrip("  \n;")
39        # 辞書形式に変換
40        dics = eval(dict_str)
41
42        # "https://www.youtube.com/live_chat_replay?continuation=" + continue_url が次のlive_chat_replayのurl
43        continue_url = dics["continuationContents"]["liveChatContinuation"]["continuations"][0]["liveChatReplayContinuationData"]["continuation"]
44        next_url = "https://www.youtube.com/live_chat_replay?continuation=" + continue_url
45        # dics["continuationContents"]["liveChatContinuation"]["actions"]がコメントデータのリスト。先頭はノイズデータなので[1:]で保存
46        for samp in dics["continuationContents"]["liveChatContinuation"]["actions"][1:]:
47            comment_data.append(str(samp)+"\n")
48
49    # next_urlが入手できなくなったら終わり
50    except:
51        break
52
53# comment_data.txt にコメントデータを書き込む
54with open("comment_data.txt", mode='w', encoding="utf-8") as f:
55    f.writelines(comment_data)

試したこと

Python 3.8.3です。
BeautifulSoup, requests, lxml インストール済みです。

pip list
Package Version

astroid 2.4.1
beautifulsoup4 4.9.1
bs4 0.0.1
certifi 2020.4.5.1
chardet 3.0.4
colorama 0.4.3
idna 2.9
isort 4.3.21
lazy-object-proxy 1.4.3
lxml 4.5.1
mccabe 0.6.1
pip 20.1.1
pylint 2.5.2
requests 2.23.0
selenium 3.141.0
setuptools 41.2.0
six 1.14.0
soupsieve 2.0.1
toml 0.10.1
urllib3 1.25.9
wrapt 1.12.1

補足情報（FW/ツールのバージョンなど）

ここにより詳細な情報を記載してください。

行動規範の内容に同意します

回答2件

target_url = "https://www.youtube.com/live_chat_replay?continuation=xxxxxxxx"

となっていますが、
target_urlには動画ページのurl
(https://www.youtube.com/watch?v=xxxxxx〜)
が入るのではないでしょうか？

また、

dict_str = scrp.text.split(" = ")[1]

となっていますが、これだとチャットのmessage自体に" = "が含まれていると3分割以上されて正常にevalが通らない可能性があります。
なので、

dict_str = scrp.text.split(" = ", 1)[1]

とするべきでは？

投稿2020/05/20 23:19

編集2020/05/20 23:39

kotori_a

総合スコア820

macarooon

2020/05/21 14:59

ご回答ありがとうございます。ご指摘箇所、修正しましたが、依然テキストファイルにチャット取得できない状態です...。

kotori_a

2020/05/22 09:45 編集

修正バージョン回答を追加しました。

行動規範の内容に同意します

ベストアンサー

BeautifulSoupの仕様によるのかもしれませんが、

            if "window[\"ytInitialData\"]" in scrp.text:
                dict_str = scrp.text.split(" = " , 1)[1]

の「text」を「next」に変えてみてはどうでしょうか。（下記）

            if "window[\"ytInitialData\"]" in scrp.next:
                dict_str = scrp.next.split(" = " , 1)[1]

あと

        for samp in dics["continuationContents"]["liveChatContinuation"]["actions"][1:]:

ですが、これだと、チャットデータを取得する都度、最初のデータが欠落してしまうと思います。
したがって、欠落無しで取得するには、最後の1を0に変える必要があると思います。（下記）

        for samp in dics["continuationContents"]["liveChatContinuation"]["actions"][0:]:

なお、上記含めた元記事のスクリプトでは上位チャットしか取得できない点ご留意ください。

投稿2020/05/22 09:40

編集2020/05/23 03:04

kotori_a

総合スコア820

macarooon

2020/05/23 20:16

[1]を[0]に変更して、取得できました、、、、！ご回答本当にありがとうございました！！

行動規範の内容に同意します

あなたの回答

tips

プレビュー

行動規範の内容に同意します

質問の解決につながる回答をしましょう。サンプルコードなど、より具体的な説明があると質問者の理解の助けになります。また、読む側のことを考えた、分かりやすい文章を心がけましょう。

15分調べてもわからないことは
teratailで質問しよう！

ただいまの回答率
85.35%

質問をまとめることで
思考を整理して素早く解決

テンプレート機能で
簡単に質問をまとめる

質問する

質問をすることでしか得られない、回答やアドバイスがある。

15分調べてもわからないことは、質問しよう！

PythonでYouTube Liveのアーカイブからチャットを取得したいけれどうまくいかない

前提・実現したいこと

発生している問題・エラーメッセージ

該当のソースコード

試したこと

補足情報（FW/ツールのバージョンなど）

関連した質問