[Python]スクレイピングでヤフコメを取得するもエラー

ヤフーコメントをスクレイピングするためにコチラ（https://qiita.com/yaju/items/501874c39d569fd53f91）をgoogle colabにて実行したところ以下のエラーが発生しました。
検索していくとHTMLの知識が必要とも書かれていたため、見当違いなXpathを調べたり、css selectorと思われるコードを入力しましたが同様のエラーでした。
解決方法をご教授いただければありがたいです。

NoSuchElementException                    Traceback (most recent call last)
<ipython-input-12-4dbbfbbd0f97> in <module>()
    121     driver.get(URL + "&page={}".format(page))
    122 
--> 123     iframe = driver.find_element_by_class_name("iframe.news-comment-plguin-iframe")
    124     driver.switch_to.frame(iframe)
    125 

3 frames
/usr/local/lib/python3.7/dist-packages/selenium/webdriver/remote/errorhandler.py in check_response(self, response)
    240                 alert_text = value['alert'].get('text')
    241             raise exception_class(message, screen, stacktrace, alert_text)
--> 242         raise exception_class(message, screen, stacktrace)
    243 
    244     def _value_or_default(self, obj, key, default):

NoSuchElementException: Message: no such element: Unable to locate element: {"method":"css selector","selector":".news-comment-plguin-iframe"}
  (Session info: headless chrome=91.0.4472.77)

実行したコードを以下に記載します。

!apt-get update
!apt install chromium-chromedriver
!cp /usr/lib/chromium-browser/chromedriver /usr/bin
!pip install selenium

import time
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

# ブラウザをバックグラウンド実行
options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
# ブラウザ起動
driver = webdriver.Chrome('chromedriver',options=options)

URL = 'https://news.yahoo.co.jp/articles/6da1faebfd938f7301f2964807ae66a8a1d740b4'

# 対象要素のテキスト取得
def getItem(element, name, name2):
    result = ""
    elem = element.find_elements_by_class_name(name)
    if len(elem) > 0:
        if name2 == "":
            result = elem[0].text.strip()
        else:
            result = elem[0].find_element_by_class_name(name2).text.strip()

    return result;

# 認証者コメント出力
def print_authorComment(no):
    comment_boxes = driver.find_elements_by_css_selector('li[id^="authorcomment-"]')
    for comment_box in comment_boxes:
        #コメント取得
        elem_comment = comment_box.find_element_by_class_name("comment")
        comment = elem_comment.text.strip().rstrip('...もっと見る')
        comment += comment_box.find_element_by_class_name("hideAthrCmtText").get_attribute("textContent")
        #ユーザー名取得
        name = getItem(comment_box, "rapidnofollow", "")
        #日付取得
        date = getItem(comment_box, "date", "")
        #参考になった数取得
        refcnt = comment_box.find_elements_by_css_selector('li.reference a em')[0].text

        no += 1
        print('{:0=4}\t{}\t{}\t{}\t{}\t{}\t{}'.format(no,comment.replace('\n', ' '), refcnt, "0", name, date, "0"))

    return no

# 一般者コメント出力
def print_generalComment(no):
    comment_boxes = driver.find_elements_by_css_selector('li[id^="comment-"]')
    for comment_box in comment_boxes:
        #コメント取得
        comment = getItem(comment_box, "cmtBody", "")
        #ユーザー名取得
        name = getItem(comment_box, "rapidnofollow", "")
        #日付取得
        date = getItem(comment_box, "date", "")
        #good数取得
        agree = getItem(comment_box, "good", "userNum")
        #bad数取得
        disagree = getItem(comment_box, "bad", "userNum")
        #返信数
        reply = int(getItem(comment_box, "reply", "num") or "0")

        no += 1
        print('{:0=4}\t{}\t{}\t{}\t{}\t{}\t{}'.format(no, comment.replace('\n', ' '), agree, disagree, name, date, reply))

        if reply == 0:
            continue

        #返信出力
        print_reply(comment_box, reply, no)

    return no

# 返信コメント出力
def print_reply(element, reply, no):
    # 「返信」 リンクを click
    rep_links = element.find_elements_by_css_selector('a.btnView.expandBtn')
    for rep_link in rep_links:
        rep_link.click()
        time.sleep(2)

    # 「もっと見る」 リンクを click
    response_boxes = element.find_elements_by_class_name("response")
    for i in range(int(reply/10)):
        if len(response_boxes) > 0 and (reply % 10) > 0:
            rep_links = response_boxes[0].find_elements_by_css_selector('a.moreReplyCommentList')
            for rep_link in rep_links:
                rep_link.click()
                time.sleep(2)

    # 返信コメント 取り出し
    replys = response_boxes[0].find_elements_by_css_selector('li[id^="reply-"]')
    cno = 1
    for reply in replys:
        cmtBodies = reply.find_elements_by_css_selector('div.action article p span.cmtBody')
        if len(cmtBodies) == 0:
            continue
        #コメント取得
        comment = cmtBodies[0].text.strip()
        #ユーザー名取得
        name = getItem(reply, "rapidnofollow", "")
        #日付取得
        date = getItem(reply, "date", "")
        #good数取得
        agree = getItem(reply, "good", "userNum")
        #bad数取得
        disagree = getItem(reply, "bad", "userNum")

        print('{:0=4}-{:0=3}\t{}\t{}\t{}\t{}\t{}'.format(no, cno, comment.replace('\n', ' '), agree, disagree, name, date))
        cno += 1

# コメント取り出し
start = 1
end = 2

no = 0
for page in range(start, end + 1):
    driver.get(URL + "&page={}".format(page))

    iframe = driver.find_element_by_class_name("news-comment-plguin-iframe")
    driver.switch_to.frame(iframe)

    #認証者コメント
    if page == 1:
        no = print_authorComment(no)
    #一般者コメント
    no = print_generalComment(no)

#コメント取得
elem_comment = comment_box.find_element_by_class_name("comment")
comment = elem_comment.text.strip().rstrip('...もっと見る')
comment += comment_box.find_element_by_class_name("hideAthrCmtText").get_attribute("textContent")

行動規範の内容に同意します

回答1件

ベストアンサー

User-agent: *

Disallow: /comment/plugin/
Disallow: /comment/violation
Disallow: /profile/violation
Disallow: /polls/widgets/
Disallow: /articles/*/comments
Disallow: /articles/*/order
Sitemap: https://news.yahoo.co.jp/sitemaps.xml
Sitemap: https://news.yahoo.co.jp/sitemaps/article.xml
Sitemap: https://news.yahoo.co.jp/byline/sitemap.xml
Sitemap: https://news.yahoo.co.jp/polls/sitemap.xml

https://news.yahoo.co.jp/robots.txt

YahooNewsのRobots.txtです。
コメントを取得することは禁止されているようなので、避けた方が良いかと思います。

投稿2021/06/08 04:29

glyzinieh

総合スコア222

52kkp

2021/06/08 12:22

ご指摘ありがとうございます。勉強不足で危うくルール違反を犯してしまうところでした。今回のコードは実行しないものとして、発生したエラーについても解決方法がございましたらご教授いただけれるとありがたいです。

glyzinieh

2021/06/08 13:53 編集

当方、seleniumを利用したことがほとんどありませんので、的確なアドバイスを差し上げることができないことをお詫び申し上げます。少し調べてみますと、iframe内にアクセスできないことが原因で出されるエラーのようですが iframe = driver.find_element_by_class_name("news-comment-plguin-iframe") driver.switch_to.frame(iframe) も書かれていますので原因がよく分かりませんね… driver.page_sourceを使うと、現在取得しているソースを見ることができるようなので試してみてはいかがでしょうか？（driver.switch_to.frame(iframe)の後が良いかと思います）こちらが参考になりそうです▶︎https://office54.net/python/scraping/selenium-element-iframe お役に立てずに申し訳ありません。

52kkp

2021/06/09 05:02

いえいえ、教えていただきほんとうにありがとうございます。参考ページから解決策を見つけていこうと思います。ご親切にありがとうございます。

行動規範の内容に同意します