クローリングに於けるrobots.txtの扱いについて

前提・実現したいこと

robots.txtを尊重したクローラーを作りたい

発生している問題・エラーメッセージ

（python3を使っていますがこの際言語は恐らく関係ありません）
クローリング先のrobot.txtの内容には「user-agent: *」に対して特定パスのクローリングを許可する記述があるにも関わらず、robots.txt自体へのアクセスが禁止（ステータスコード403が返される）されているケースがかなりの頻度で存在する様です。クローラーを作る場合に「robotx.txtの内容が許可している事」と「robots.txtを読んではいけない」のどちらを尊重すべきなのでしょうか？

該当のソースコード

クローリング先のrobots.txt

User-agent: *
Disallow: /akamai-test/
Disallow: /cookie_alert_index.html
Disallow: /cookie_warning.html
Disallow: /css/
Disallow: /favicon.ico
Disallow: /img/
Disallow: /images/
Disallow: /js/
Disallow: /missing_page.htm
Disallow: /missing_page_glb.htm
Disallow: /priv_exhibition/
Disallow: /shared/
Allow:    /shared/pkgimage/
Disallow: /xref/
Disallow: /eng/
Disallow: /zz_xref/
Disallow: /cn/campaign-e/
Disallow: /cn/campaign-a/

下記コードを実行するとFalseが返って来ます。

# _*_ coding: utf-8 _*_

from urllib.robotparser import RobotFileParser
rp = RobotFileParser()
rp.set_url("https://toshiba.semicon-storage.com/robots.txt")
rp.read()

print(rp.can_fetch("*","https://toshiba.semicon-storage.com/shared/pkgimage/"))

del rp

試したこと

RobotFileParser.pyのreadモジュールに下記print文を追加して実行すると「Error: Forbidden 403」が表示されます。つまり、RobotFileParserはrobots.txtの読み取りが401又は403で拒否された場合は「完全不許可」に設定されます。

    def read(self):
        """Reads the robots.txt URL and feeds it to the parser."""
        try:
            f = urllib.request.urlopen(self.url)
        except urllib.error.HTTPError as err:
            print("Error: {}  {}".format(err.msg,err.code))　　　        #追加文
            if err.code in (401, 403):
                self.disallow_all = True
            elif err.code >= 400 and err.code < 500:
                self.allow_all = True
        else:
            raw = f.read()
            self.parse(raw.decode("utf-8").splitlines())

一方、googleはrobots.txt読込のステータスが400番台の場合は以下の判断としている様です。

「Google では、すべての 4xx エラーは同じように扱われ、有効な robots.txt ファイルが存在しないものとみなされます。制限はないものとみなされます。これは、クロールの「完全許可」です。これには、401「未認証」と 403「禁止」の HTTP 結果コードも含まれます。」
https://developers.google.com/search/reference/robots_txt?hl=ja

補足情報（FW/ツールのバージョンなど）

Ubuntu 18.04.1 LTS
Python 3.7.1

行動規範の内容に同意します

回答2件

urllib.robotparser.RobotFileParser.read()を以下の通り書き換えて取得可能になりました。

def read(self):
"""Reads the robots.txt URL and feeds it to the parser."""
f = requests.get(self.url)
if f.status_code in (401, 403):
self.disallow_all = True
elif f.status_code >= 400 and f.status_code < 500:
self.allow_all = True
raw = f.content
self.parse(raw.decode("utf-8").splitlines())

投稿2019/01/15 08:02

kafer

総合スコア13

ベストアンサー

"https://toshiba.semicon-storage.com/robots.txt"
普通に取得できたのでどこか間違えてるだけでは。

ブラウザから見えてるならそれに従う。

投稿2019/01/15 06:58

kawax

総合スコア10377

kafer

2019/01/15 07:43

ご回答ありがとうございます。ご指摘の通りブラウザでも、wgetでもPythonのrequests.getでも問題無く読めるので「読んではいけない」訳ではないですよね。 robotparser.pyが使っているurllib.request.urlopenの使い方に問題が有るのでしょう。別なパーサーを使うか、requestsをベースに自作します。

行動規範の内容に同意します

あなたの回答