Python BeautifulSoup の問題特定方法について

実現したいこと

問題の発生個所をつきとめたい

発生している問題・分からないこと

Python BeautifulSoup で、自分のサイトのコンテンツをチェックしています。
２万件程度のコンテンツをループさせて下の方の箇所を実行しています。
そうすると、コンテンツのうち、いくつかで

plain
1/usr/local/lib/python3.11/html/parser.py:170: XMLParsedAsHTMLWarning: It looks like you're parsing an XML document using an HTML parser. If this really is an HTML document (maybe it's XHTML?), you can ignore or filter this warning. If it's XML, you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the lxml package installed, and pass the keyword argument `features="xml"` into the BeautifulSoup constructor.

または

plain
1Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.

という出力が発生します。
ひとつは、コンテンツが html であるべきところ、xml になっている。
もうひとつは、utf-8 であるべきところが他のコードになっていることは理解しています。
ただし、件数が多いので、それがどの URL か特定できていないので、特定したいのです。

URL を特定する方法もしくは出力発生時に例外を発生させる方法があれば教えてください。

該当のソースコード

python
1http = urllib3.PoolManager(cert_reqs='CERT_REQUIRED', ca_certs='証明書ファイル')
2略
3html = http.request('GET', 'URL')
4soup = BeautifulSoup(html.data, "html.parser")

試したこと・調べたこと

teratailやGoogle等で検索した
ソースコードを自分なりに変更した
知人に聞いた
その他

上記の詳細・結果

自分の望むことは得られませんでした。

補足

soup の箇所に print で対象の URL を出力すればいいのは存じておりますが、それ以外のスマートな方法がないものかと思っております。

poto568

2025/01/23 06:27

soup の前に print(URL) とでも書いておけば分かりそうな気がしますけど…。 (他に要件があれば質問欄を編集してください。)

showkit

2025/01/23 07:14

おっしゃることはよくわかりますので、質問を修正いたします。

行動規範の内容に同意します

回答1件

ベストアンサー

ひとつは、コンテンツが html であるべきところ、xml になっている。

warnings モジュールを利用して、warning をエラー扱いにする方法があります。

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.

もうひとつは、utf-8 であるべきところが他のコードになっている

この場合、contains_replacement_characters が True にセットされるので、それで判断します。

soup の箇所に print で対象の URL を出力すればいいのは存じておりますが、それ以外のスマートな方法がないものかと思っております。

上記の対応がスマートかどうかは不明です。

python
1import requests
2import warnings
3import logging
4from bs4 import BeautifulSoup
5from bs4.builder import XMLParsedAsHTMLWarning
6
7warnings.resetwarnings()
8warnings.simplefilter('error')
9logger = logging.getLogger('bs4.dammit')
10logger.disabled = True
11
12urls = [
13    'https://teratail.com/',
14    'https://www.sitemaps.org/sitemap.xml',
15    'https://www.jleague.jp/standings/j1/',
16]
17
18for url in urls:
19    res = requests.get(url)
20    res.raise_for_status()
21
22    try:
23        soup = BeautifulSoup(res.content, 'html.parser')
24        if soup.contains_replacement_characters:
25            print(f'Some characters could not be decoded: {url}')
26    except XMLParsedAsHTMLWarning:
27        print(f'XMLParsedAsHTMLWarning: {url}')
28    except e:
29        print(f'{url}: {e}')
30
31
32# XMLParsedAsHTMLWarning: https://www.sitemaps.org/sitemap.xml
33# Some characters could not be decoded: https://www.jleague.jp/standings/j1/