Google Colaboratoryを使い単語ベクトルの読み込みを行いたい

Question

###### 恐らくファイルリンクを変更しないといけないはずですが、どの部分を変更すればよいのか分かりません。自然言語処理（テキスト処理）を行うにあたり、いまにゅさんの動画を参考にしています。 https://www.youtube.com/watch?v=gPV7SuZiVu4 上記動画に添付してある講義内使用コードをコピペし内容理解に取り組んでいるのですが、「15.単語ベクトルの読み込み」で下記コード ```ここに言語を入力 FILE_ID = "0B7XkCwpI5KDYNlNUTTlSS21pQmM" FILE_NAME = "GoogleNews-vectors-negative300.bin.gz" !wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=$FILE_ID' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1 /p')&id=$FILE_ID" -O $FILE_NAME && rm -rf /tmp/cookies.txt ``` を入力すると、 ```ここに言語を入力 --2022-10-05 07:26:16-- https://docs.google.com/uc?export=download&confirm=&id=0B7XkCwpI5KDYNlNUTTlSS21pQmM Resolving docs.google.com (docs.google.com)... 172.253.63.101, 172.253.63.138, 172.253.63.139, ... Connecting to docs.google.com (docs.google.com)|172.253.63.101|:443... connected. HTTP request sent, awaiting response... 200 OK Length: unspecified [text/html] Saving to: ‘GoogleNews-vectors-negative300.bin.gz’ GoogleNews-vectors- [ <=> ] 2.33K --.-KB/s in 0s 2022-10-05 07:26:16 (41.2 MB/s) - ‘GoogleNews-vectors-negative300.bin.gz’ saved [2386] ``` と返ってきます。講義内使用コードと比較すると返ってきたメッセージは異なりますが、エラーにはならなかったので、次の ```ここに言語を入力 from gensim.models import KeyedVectors ``` を入力し ```ここに言語を入力 model = KeyedVectors.load_word2vec_format('/content/GoogleNews-vectors-negative300.bin.gz', binary=True) ``` 上記コードを実行すると、 ```ここに言語を入力 --------------------------------------------------------------------------- OSError Traceback (most recent call last) in ----> 1 model = KeyedVectors.load_word2vec_format('/content/GoogleNews-vectors-negative300.bin.gz', binary=True) 5 frames /usr/local/lib/python3.7/dist-packages/gensim/models/keyedvectors.py in load_word2vec_format(cls, fname, fvocab, binary, encoding, unicode_errors, limit, datatype) 1436 return _load_word2vec_format( 1437 cls, fname, fvocab=fvocab, binary=binary, encoding=encoding, unicode_errors=unicode_errors, -> 1438 limit=limit, datatype=datatype) 1439 1440 def get_keras_embedding(self, train_embeddings=False): /usr/local/lib/python3.7/dist-packages/gensim/models/utils_any2vec.py in _load_word2vec_format(cls, fname, fvocab, binary, encoding, unicode_errors, limit, datatype) 170 logger.info("loading projection weights from %s", fname) 171 with utils.smart_open(fname) as fin: --> 172 header = utils.to_unicode(fin.readline(), encoding=encoding) 173 vocab_size, vector_size = (int(x) for x in header.split()) # throws for invalid file format 174 if limit: /usr/lib/python3.7/gzip.py in readline(self, size) 383 def readline(self, size=-1): 384 self._check_not_closed() --> 385 return self._buffer.readline(size) 386 387 /usr/lib/python3.7/_compression.py in readinto(self, b) 66 def readinto(self, b): 67 with memoryview(b) as view, view.cast("B") as byte_view: ---> 68 data = self.read(len(byte_view)) 69 byte_view[:len(data)] = data 70 return len(data) /usr/lib/python3.7/gzip.py in read(self, size) 472 # jump to the next member, if there is one. 473 self._init_read() --> 474 if not self._read_gzip_header(): 475 self._size = self._pos 476 return b"" /usr/lib/python3.7/gzip.py in _read_gzip_header(self) 420 421 if magic != b'\037\213': --> 422 raise OSError('Not a gzipped file (%r)' % magic) 423 424 (method, flag, OSError: Not a gzipped file (b' in ----> 1 model = KeyedVectors.load_word2vec_format('/content/GoogleNews-vectors-negative300.bin.gz', binary=True) 4 frames /usr/local/lib/python3.7/dist-packages/gensim/models/keyedvectors.py in load_word2vec_format(cls, fname, fvocab, binary, encoding, unicode_errors, limit, datatype) 1436 return _load_word2vec_format( 1437 cls, fname, fvocab=fvocab, binary=binary, encoding=encoding, unicode_errors=unicode_errors, -> 1438 limit=limit, datatype=datatype) 1439 1440 def get_keras_embedding(self, train_embeddings=False): /usr/local/lib/python3.7/dist-packages/gensim/models/utils_any2vec.py in _load_word2vec_format(cls, fname, fvocab, binary, encoding, unicode_errors, limit, datatype) 210 word.append(ch) 211 word = utils.to_unicode(b''.join(word), encoding=encoding, errors=unicode_errors) --> 212 weights = fromstring(fin.read(binary_len), dtype=REAL).astype(datatype) 213 add_word(word, weights) 214 else: /usr/lib/python3.7/gzip.py in read(self, size) 285 import errno 286 raise OSError(errno.EBADF, "read() on write-only GzipFile object") --> 287 return self._buffer.read(size) 288 289 def read1(self, size=-1): /usr/lib/python3.7/_compression.py in readinto(self, b) 66 def readinto(self, b): 67 with memoryview(b) as view, view.cast("B") as byte_view: ---> 68 data = self.read(len(byte_view)) 69 byte_view[:len(data)] = data 70 return len(data) /usr/lib/python3.7/gzip.py in read(self, size) 491 break 492 if buf == b"": --> 493 raise EOFError("Compressed file ended before the " 494 "end-of-stream marker was reached") 495 EOFError: Compressed file ended before the end-of-stream marker was reached ``` ![イメージ説明](https://ddjkaamml8q8x.cloudfront.net/questions/2022-10-06/99ecb093-a106-47eb-8fe7-6bae13fc28ea.png)

Accepted Answer

弊環境でちゃんと動くようになったときのファイルダウンロード方法を示しておきます．

### 手順1
Chromeなどの開発者ツールのあるWebブラウザで`https://docs.google.com/uc?export=download&id=0B7XkCwpI5KDYNlNUTTlSS21pQmM`にアクセスし，開発者ツールを開く．

### 手順2
下図: 先のリンク先ページ(左側) - そのページで開いた開発者ツール(右側)
![開発者ツール](https://ddjkaamml8q8x.cloudfront.net/questions/2022-10-06/076aa3bf-6566-49a0-94dd-ac4cc8aaccf7.png)
ボタン`input type="submit"`の直上にあるフォーム`form id="downloadForm"`の`action`の後に続くリンクをコピーする．上では`https://docs.google.com/uc?export=download&id=0B7XkCwpI5KDYNlNUTTlSS21pQmM&confirm=t&uuid=xxxxxxxx-xxxx-xxxxx-xxxx-xxxxxxxxxxxx`のようになっている箇所である．

### 手順3
コピーしたリンクをGoogle Colaboratory上で`wget`に使う．

```shell
FILE_NAME = "GoogleNews-vectors-negative300.bin.gz"
!wget -O $FILE_NAME "https://docs.google.com/uc?export=download&id=0B7XkCwpI5KDYNlNUTTlSS21pQmM&confirm=t&uuid=xxxxxxxx-xxxx-xxxxx-xxxx-xxxxxxxxxxxx"
```
となるようコマンドの記述とリンクの貼り付けを行って実行．

### 手順4
一応これでファイル全体をダウンロードでき，`ls -lh`で確認すると`1.6G`と表示されるはずです．弊環境では

```shell
!md5sum $FILE_NAME
```

を実行すると`1c892c4707a8a1a508b01a01735c0339  GoogleNews-vectors-negative300.bin.gz`という出力が得られました．ダウンロードの過程でファイルが破損していなければ同一のMD5ハッシュ値`1c892c4707a8a1a508b01a01735c0339`が出てくるはずです．

ファイル全体が取得できていない，もしくはファイルが破損していれば上のように`EOFError`が出ると思われます．

![実行例](https://ddjkaamml8q8x.cloudfront.net/questions/2022-10-06/5e06ce0a-77b7-4f62-bb3d-30cb07115b21.png)

追記でおっしゃっている「ファイルをダウンロードすることが出来ました」も，MD5ハッシュ値が違うのであればダウンロードすることに成功していません．ハッシュ値の同一性はファイルの同一性を担保します．確認をお願いします．

恐らくファイルリンクを変更しないといけないはずですが、どの部分を変更すればよいのか分かりません。

追記

手順1

手順2

手順3

手順4

関連した質問