回答編集履歴

"Some characters could not be decoded 〜" 発生時の処理を変更

2025/01/23 08:25

投稿

melian

スコア21341

answer CHANGED Viewed

@@ -1,6 +1,6 @@
 > ひとつは、コンテンツが html であるべきところ、xml になっている。
-`warning` モジュールを利用して、warning をエラー扱いにする方法があります。
+`warnings` モジュールを利用して、warning をエラー扱いにする方法があります。
 > Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
 >
@@ -15,12 +15,14 @@
 ```python
 import requests
 import warnings
+import logging
 from bs4 import BeautifulSoup
 from bs4.builder import XMLParsedAsHTMLWarning
-# Warning を例外(exception)化
 warnings.resetwarnings()
 warnings.simplefilter('error')
+logger = logging.getLogger('bs4.dammit')
+logger.disabled = True
 urls = [
     'https://teratail.com/',
@@ -31,17 +33,17 @@
 for url in urls:
     res = requests.get(url)
     res.raise_for_status()
     try:
         soup = BeautifulSoup(res.content, 'html.parser')
         if soup.contains_replacement_characters:
-            print(url)
+            print(f'Some characters could not be decoded: {url}')
     except XMLParsedAsHTMLWarning:
-        print(f'XMLParsedAsHTMLWarning occured in {url}')
+        print(f'XMLParsedAsHTMLWarning: {url}')
     except e:
         print(f'{url}: {e}')
-# XMLParsedAsHTMLWarning occured in https://www.sitemaps.org/sitemap.xml
+# XMLParsedAsHTMLWarning: https://www.sitemaps.org/sitemap.xml
-# Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
-# https://www.jleague.jp/standings/j1/
+# Some characters could not be decoded: https://www.jleague.jp/standings/j1/
 ```