スクレイピングの実行でzlib.error:が起こってしまう

前提・実現したいこと

pythonを使ってGoogleスプレッドシートにスクレイピング結果を書き込むスクリプトを実行しようとしたのですが以下のようなエラーメッセージが表示されてしまいます。
原因や解決策があればご教授いただけますでしょうか。

他にも原因特定に必要な情報がありましたら追記しますのでお教えてください。

発生している問題・エラーメッセージ

raceback (most recent call last):
  File "/Users/test_user/.pyenv/versions/3.9.4/lib/python3.9/site-packages/urllib3/response.py", line 401, in _decode
    data = self._decoder.decompress(data)
  File "/Users/test_user/.pyenv/versions/3.9.4/lib/python3.9/site-packages/urllib3/response.py", line 88, in decompress
    ret += self._obj.decompress(data)
zlib.error: Error -3 while decompressing data: incorrect data check

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/test_user/.pyenv/versions/3.9.4/lib/python3.9/site-packages/requests/models.py", line 753, in generate
    for chunk in self.raw.stream(chunk_size, decode_content=True):
  File "/Users/test_user/.pyenv/versions/3.9.4/lib/python3.9/site-packages/urllib3/response.py", line 572, in stream
    for line in self.read_chunked(amt, decode_content=decode_content):
  File "/Users/test_user/.pyenv/versions/3.9.4/lib/python3.9/site-packages/urllib3/response.py", line 768, in read_chunked
    decoded = self._decode(
  File "/Users/test_user/.pyenv/versions/3.9.4/lib/python3.9/site-packages/urllib3/response.py", line 404, in _decode
    raise DecodeError(
urllib3.exceptions.DecodeError: ('Received response with content-encoding: gzip, but failed to decode it.', error('Error -3 while decompressing data: incorrect data check'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/test_user/copy_test.py", line 21, in <module>
    response = requests.get(url)
  File "/Users/test_user/.pyenv/versions/3.9.4/lib/python3.9/site-packages/requests/api.py", line 76, in get
    return request('get', url, params=params, **kwargs)
  File "/Users/test_user/.pyenv/versions/3.9.4/lib/python3.9/site-packages/requests/api.py", line 61, in request
    return session.request(method=method, url=url, **kwargs)
  File "/Users/test_user/.pyenv/versions/3.9.4/lib/python3.9/site-packages/requests/sessions.py", line 542, in request
    resp = self.send(prep, **send_kwargs)
  File "/Users/test_user/.pyenv/versions/3.9.4/lib/python3.9/site-packages/requests/sessions.py", line 697, in send
    r.content
  File "/Users/test_user/.pyenv/versions/3.9.4/lib/python3.9/site-packages/requests/models.py", line 831, in content
    self._content = b''.join(self.iter_content(CONTENT_CHUNK_SIZE)) or b''
  File "/Users/test_user/.pyenv/versions/3.9.4/lib/python3.9/site-packages/requests/models.py", line 758, in generate
    raise ContentDecodingError(e)
requests.exceptions.ContentDecodingError: ('Received response with content-encoding: gzip, but failed to decode it.', error('Error -3 while decompressing data: incorrect data check'))

該当のソースコード

具体的なサイトURLや秘密鍵などは伏せ字にしています。

import requests
from bs4 import BeautifulSoup
import re
import gspread
from oauth2client.service_account import ServiceAccountCredentials
from datetime import datetime
import sys
from time import sleep

def get_gspread_book(secret_key, book_name):
    scope = ['https://www.googleapis.com/auth/spreadsheets',
            'https://www.googleapis.com/auth/drive']
    #jsonファイルで認証情報設定
    credentials = ServiceAccountCredentials.from_json_keyfile_name(secret_key, scope)
    gc = gspread.authorize(credentials)  #認証情報を使用してGoogleAPIにログイン
    book = gc.open(book_name)  #ファイル名を指定してGoogleスプレッドシートを開く
    return book  #開いたGoogleスプレッドシートを戻り値に指定

#requestsを利用してWEBサイトの情報をダウンロード
url = 'https://****.jp/'
response = requests.get(url)

#BeautifulSoup()に取得したWEBサイトの情報とパーサー"html.parser"を渡す
soup = BeautifulSoup(response.text, "html.parser")

#href属性の中で"***/pickup"が含まれているもののみ全て抽出
elems = soup.find_all(href = re.compile("***/pickup"),limit=8)

#jsonファイルとbook名でspreadsheetをopenする
secret_key = '/Users/test_user/modular-******.json'
book_name = 'test_sheet'
sheet_name = 'シート1'
sheet = get_gspread_book(secret_key, book_name).worksheet(sheet_name)

#spreadsheetの最終行を取得
values1 = sheet.col_values(1) #列の情報をまとめてlistに取得
lastrow1 = len(values1) #listの長さ＝行数

#soup結果listをspreadsheetに反映する
#写真付きの最後の1要素はエラーとなるのでエラ〜ハンドリングを行う
for elem in elems:
    lastrow1 += 1
    try:
        sheet.update_acell('B' + str(lastrow1), elem.contents[0])
    except:
        print('エラーが発生しました')
        print(elem.attrs['href'])
    else:
        datetimestr = datetime.now().strftime("%Y/%m/%d %H:%M:%S")
        sheet.update_acell('A' + str(lastrow1), datetimestr)
        sheet.update_acell('B' + str(lastrow1), elem.contents[0])
        sheet.update_acell('C' + str(lastrow1), elem.attrs['href'])
        sleep(2)
print(datetimestr,'スクレイピングを終了しました。')

補足情報（FW/ツールのバージョンなど）

パッケージなどのバージョンは以下のとおりです

Python 3.9.4

Package              Version
-------------------- ---------
beautifulsoup4       4.9.3
cachetools           4.2.1
certifi              2020.12.5
chardet              4.0.0
google-auth          1.28.0
google-auth-oauthlib 0.4.4
gspread              3.7.0
httplib2             0.19.1
idna                 2.10
oauth2client         4.1.3
oauthlib             3.1.0
pip                  20.2.3
pyasn1               0.4.8
pyasn1-modules       0.2.8
pyparsing            2.4.7
requests             2.25.1
requests-oauthlib    1.3.0
rsa                  4.7.2
setuptools           49.2.1
six                  1.15.0
soupsieve            2.2.1
urllib3              1.26.4

行動規範の内容に同意します

回答2件

ベストアンサー

Received response with content-encoding: gzip, but failed to decode it.

と書いてある通りで、Webサーバから帰ってきたレスポンスにcontent-encoding: gzipというヘッダが付いていたにもかかわらずボディ部がgzipでなかった（gzipで伸張できなかった）ということでしょう。

ブラウザで見て問題ないか、とかcurlやwgetのようなコマンドからのアクセスで問題ないかを確認するでしょう。
それらで問題ない場合、
(1) おかしいデータが来ているがブラウザなどはcontent-encodingヘッダを無視するようなことをやっている（そんなことがありえるのかどうかは分かりませんが）
(2) Pythonからのアクセスに対して相手Webサーバが正しいレスポンスを返さない（意図してのものかどうかはともかく）
などを疑います。

ネットワークアクセスがプロキシ越しの場合
(3) Pythonからのアクセスに対してプロキシが正しいレスポンスを返さない（意図してのものかどうかはともかく）
も加わるでしょう。

投稿2021/04/06 05:09

編集2021/04/06 05:12

quickquip

総合スコア11235

退会済みユーザー

2021/04/06 06:44

回答ありがとうございます。 curlコマンドでは問題なくアクセスできました。プロキシは経由していなかったので、（１）（２）が可能性としてありますかね。ちなみにそれぞれが原因の場合、こちらで解決する方法はありませんでしょうか？

quickquip

2021/04/06 07:32

回避できそうなのは、ヘッダから Accept-Encoding を明に外すとか、 User-Agent を与えるとかですかね。 urllib3のissue https://github.com/urllib3/urllib3/issues/206#issuecomment-34958040 を見ると、curl --compressed 〜で再現するようなら(1)で確定っぽいですね。

退会済みユーザー

2021/04/06 10:00

curl --compressed 〜で実行して見てみましたが Accept-Encoding は含まれていませんでした。ただ、調べてみると該当サイトでは「text/htmlがContent-Encoding: gzipで返ってくる」とのことでしたので（１）の可能性が高いような気がします… 自分のスキル・知識不足なため >ヘッダから Accept-Encoding を明に外すとか、 User-Agent を与えるとかこちらの具体的な書き方が調べてみてもわからず、もしよろしければ詳細をお教えいただけると幸いです…

quickquip

2021/04/06 10:06

具体的なコードは（ドキュメントなりを）調べないと出てこないです。 User-Agent を設定するのはTeratailで答えていた人がいたと思います。

quickquip

2021/04/06 10:08

tag:python requests user-agent でTeratailを検索するとトップにくるこれ https://teratail.com/questions/163429 の2の場合の方のコードですね

退会済みユーザー

2021/04/07 03:07

ご返事が遅くなり失礼いたしました。ヘッダーに "accept-encoding":"zlib, deflate, br" を追加してうまく作動しました。この度は本当に親身に回答してくださりありがとうございました。大変助かりました。

行動規範の内容に同意します