HP上の競馬の騎手の騎乗予定が記載されたtableをpandasを使って取得したいが、エラーが出て困っています。

前提

競馬のデータをスクレイピングで収集したいと思い、勉強している最中です。
全くの無学ですが、Pythonを用い、Pandasの「read_html」を使ってnetkeibaのレースのデータのtableを取得することに成功しました。が、次に同じ要領でJRAの騎手の騎乗予定ページからtableを取得しようとするとHTTPエラーが出ていまいました。
考えうる理由としては

・そのページにそもそもtableが存在しない
・HP側がスクレイピングを禁止している
・URLが適切ではない

等が浮かびましたが、理由が浮かんでも解決方法が浮かびません。
未熟者ではありますが、先達のご指導ご鞭撻の程をよろしくお願いします。

実現したいこと

https://www.jra.go.jp/JRADB/accessK.html　
・上記のHPの騎乗予定のtableを「read_html」を用いて取得するにあたって、エラーが出る理由を知りたい。
・その上でその解決方法を知りたい。

発生している問題・エラーメッセージ

---------------------------------------------------------------------------
HTTPError                                 Traceback (most recent call last)
Input In [69], in <cell line: 4>()
      2 #pandasのインポート
      3 URL = 'https://www.jra.go.jp/JRADB/accessK.html'
----> 4 pd.read_html(URL)

File ~\anaconda3\lib\site-packages\pandas\util\_decorators.py:311, in deprecate_nonkeyword_arguments.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
    305 if len(args) > num_allow_args:
    306     warnings.warn(
    307         msg.format(arguments=arguments),
    308         FutureWarning,
    309         stacklevel=stacklevel,
    310     )
--> 311 return func(*args, **kwargs)

File ~\anaconda3\lib\site-packages\pandas\io\html.py:1113, in read_html(io, match, flavor, header, index_col, skiprows, attrs, parse_dates, thousands, encoding, decimal, converters, na_values, keep_default_na, displayed_only)
   1109 validate_header_arg(header)
   1111 io = stringify_path(io)
-> 1113 return _parse(
   1114     flavor=flavor,
   1115     io=io,
   1116     match=match,
   1117     header=header,
   1118     index_col=index_col,
   1119     skiprows=skiprows,
   1120     parse_dates=parse_dates,
   1121     thousands=thousands,
   1122     attrs=attrs,
   1123     encoding=encoding,
   1124     decimal=decimal,
   1125     converters=converters,
   1126     na_values=na_values,
   1127     keep_default_na=keep_default_na,
   1128     displayed_only=displayed_only,
   1129 )

File ~\anaconda3\lib\site-packages\pandas\io\html.py:919, in _parse(flavor, io, match, attrs, encoding, displayed_only, **kwargs)
    916 p = parser(io, compiled_match, attrs, encoding, displayed_only)
    918 try:
--> 919     tables = p.parse_tables()
    920 except ValueError as caught:
    921     # if `io` is an io-like object, check if it's seekable
    922     # and try to rewind it before trying the next parser
    923     if hasattr(io, "seekable") and io.seekable():

File ~\anaconda3\lib\site-packages\pandas\io\html.py:239, in _HtmlFrameParser.parse_tables(self)
    231 def parse_tables(self):
    232     """
    233     Parse and return all tables from the DOM.
    234 
   (...)
    237     list of parsed (header, body, footer) tuples from tables.
    238     """
--> 239     tables = self._parse_tables(self._build_doc(), self.match, self.attrs)
    240     return (self._parse_thead_tbody_tfoot(table) for table in tables)

File ~\anaconda3\lib\site-packages\pandas\io\html.py:758, in _LxmlFrameParser._build_doc(self)
    756             pass
    757     else:
--> 758         raise e
    759 else:
    760     if not hasattr(r, "text_content"):

File ~\anaconda3\lib\site-packages\pandas\io\html.py:739, in _LxmlFrameParser._build_doc(self)
    737 try:
    738     if is_url(self.io):
--> 739         with urlopen(self.io) as f:
    740             r = parse(f, parser=parser)
    741     else:
    742         # try to parse the input in the simplest way

File ~\anaconda3\lib\site-packages\pandas\io\common.py:239, in urlopen(*args, **kwargs)
    233 """
    234 Lazy-import wrapper for stdlib urlopen, as that imports a big chunk of
    235 the stdlib.
    236 """
    237 import urllib.request
--> 239 return urllib.request.urlopen(*args, **kwargs)

File ~\anaconda3\lib\urllib\request.py:214, in urlopen(url, data, timeout, cafile, capath, cadefault, context)
    212 else:
    213     opener = _opener
--> 214 return opener.open(url, data, timeout)

File ~\anaconda3\lib\urllib\request.py:523, in OpenerDirector.open(self, fullurl, data, timeout)
    521 for processor in self.process_response.get(protocol, []):
    522     meth = getattr(processor, meth_name)
--> 523     response = meth(req, response)
    525 return response

File ~\anaconda3\lib\urllib\request.py:632, in HTTPErrorProcessor.http_response(self, request, response)
    629 # According to RFC 2616, "2xx" code indicates that the client's
    630 # request was successfully received, understood, and accepted.
    631 if not (200 <= code < 300):
--> 632     response = self.parent.error(
    633         'http', request, response, code, msg, hdrs)
    635 return response

File ~\anaconda3\lib\urllib\request.py:561, in OpenerDirector.error(self, proto, *args)
    559 if http_err:
    560     args = (dict, 'default', 'http_error_default') + orig_args
--> 561     return self._call_chain(*args)

File ~\anaconda3\lib\urllib\request.py:494, in OpenerDirector._call_chain(self, chain, kind, meth_name, *args)
    492 for handler in handlers:
    493     func = getattr(handler, meth_name)
--> 494     result = func(*args)
    495     if result is not None:
    496         return result

File ~\anaconda3\lib\urllib\request.py:641, in HTTPDefaultErrorHandler.http_error_default(self, req, fp, code, msg, hdrs)
    640 def http_error_default(self, req, fp, code, msg, hdrs):
--> 641     raise HTTPError(req.full_url, code, msg, hdrs, fp)

HTTPError: HTTP Error 403: Forbidden

該当のソースコード

!pip install lxml html5lib beautifulsoup4
!pip install pandas
import pandas as pd
URL = 'https://www.jra.go.jp/JRADB/accessK.html'
pd.read_html(URL)

試したこと

別のHPに記載されている騎乗予定のページも試したが、同じエラーが出てしまった。
・https://www.keibalab.jp/db/jockey/01019/

補足情報（FW/ツールのバージョンなど）

ここにより詳細な情報を記載してください。

meg_

2022/10/24 11:58

「HP側がスクレイピングを禁止している」についてはサイトの利用規約等を確認すれば分かるのではないでしょうか？

nankotu

2022/10/24 12:53

コメントありがとうございます。 JRAの利用規約には特にそのような記載はありませんでした。が不勉強なものでもしかしたら私の方で見落としているかもしれませんので、下記にURLを貼っておきます。 https://sp.jra.jp/use/

CHERRY

2022/10/24 13:02

> https://www.jra.go.jp/JRADB/accessK.html　 > ・上記のHPの騎乗予定のtableを「read_html」を用いて取得するにあたって、エラーが出る理由を知りたい。上記 URL に直接接続すると > エラー013 パラメータエラー > ご指定のページに接続できませんでした。となるようですので、・URLが適切ではないということではないでしょうか。

行動規範の内容に同意します

回答2件

ベストアンサー

https://www.keibalab.jp/db/jockey/01019/ に関してはリクエストヘッダにuser-agentを追加すれば通るようです．

Python
1import requests
2import pandas as pd
3
4headers = {
5    'user-agent': 'Mozilla/5.0'
6}
7
8response = requests.get('https://www.keibalab.jp/db/jockey/01019/', headers = headers)
9print(pd.read_html(response.text))

ちなみに https://www.jra.go.jp/JRADB/accessK.html に関してはリクエスト内容不足ですね．そもそもpostリクエストのようですし遷移元からのデータが必要になっています．この状態に関して次のサイトで説明がなされています．

netkeibaをスクレイピングする方法

スクレイピングに関する規約はないものの，嫌がっている理由として「『JRA-VANデータラボ』の稼ぎに影響が出る」と考察されています．

ちなみに遷移元からのデータを併用してリクエストを送信，テーブル取得は可能です．

Python
1import requests
2import pandas as pd
3
4data = {
5    'cname': 'pt01kld00999999993101/48',
6}
7
8response = requests.post('https://www.jra.go.jp/JRADB/accessK.html', data=data)
9
10response.encoding = response.apparent_encoding
11print(pd.read_html(response.text))

基本，HTTPに関する知識及びHTML５に関する知識なしではJRAからのスクレイピングは難航するでしょう．ここまでJRA側がスクレイピングを嫌っているので，やらない方が紳士的かと思います．

今回利用した遷移元からのデータ'cname': 'pt01kld00999999993101/48'の取得方法もAbeTakashiさんの回答に同意のもと述べないでおきます．実際これだけでは上位20名分の騎手ランキングしか得られないこと注意してください．

あくまでエラーの出ない取得方法を上述して「tableは取得可能である」という状態にした解答を示しました．一応スクレイピングの方法はググれば出てきますが利用は自己責任で．

投稿2022/10/24 13:08

編集2022/10/24 14:21

ps_aux_grep

総合スコア1581

nankotu

2022/10/25 12:08

ps_aux_grep様。ご回答ありがとうございます。まだPythonを学び始めて日が浅い身ではありますが、お二人の回答でスクレイピングというものの社会的性質というものを、少し学べたような気がします。一番上のコードで騎乗予定の取得ができたので、そちらを使ってみたいと思います。本当にありがとうございました。

行動規範の内容に同意します

レスポンスコード403のエラーですから、そのアクセス方法でのアクセスが許可されてないということです。

参考）
https://developer.mozilla.org/ja/docs/Web/HTTP/Status/403?language=ja

どのようなアクセスを許可する、許可しないはJRAが判断しているので、この原因をJRA以外の人が判断するのは不可能かと思います（「・HP側がスクレイピングを禁止している」の対策を入れている可能性がありますが、これがそうなのかはJRAの中の人以外は分かりません。）。

この対策をくぐり抜ける方法があるのかもしれませんが、例えそのような方法があったとしても、このような公になってるサイトで教え合ったりするような内容ではありませんので、このサイトで解決をはかるのは、残念ですがかなり難しいかと思います。

投稿2022/10/24 13:08

AbeTakashi

総合スコア4932

nankotu

2022/10/25 12:02

AbeTakashi様。ありがとうございます。参考HPも拝見いたしました。質問の際に挙げた要因の中にそれというものがあった、ということがわかっただけでも大変うれしいです。データを提供されている立場でくぐり抜けるようなことをするつもりはございません。浅学の身に余る丁寧な回答、本当にありがとうございました。

行動規範の内容に同意します

あなたの回答