pytorchでgpuを使おうとするとエラーする

pytorchでgpuを使おうとするとエラーが出てしまいます。
torch.cuda.current_device()やtorch.cuda.is_available()を実行するとcudaGetDeviceCount()でエラーが起きてしまい、GPUが使えません。
正直GPUなどハード周りについて詳しくなく、エラーについて検索などしてみましたが原因がわかりませんでした。どういった原因が考えられるのでしょうか？

python
1>>> import torch
2>>> torch.cuda.current_device()
3Traceback (most recent call last):
4  File "<stdin>", line 1, in <module>
5  File "/xxx/.venv/yyy/lib/python3.6/site-packages/torch/cuda/__init__.py", line 366, in current_device
6    _lazy_init()
7  File "/xxx/.venv/yyy/lib/python3.6/site-packages/torch/cuda/__init__.py", line 172, in _lazy_init
8    torch._C._cuda_init()
9RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 2: out of memory
10

追記

GPU

nvidia-smiを使うと下記のような返答があるので恐らくGPUは動作しているのではないかと考えました。

nvidia-smi
Tue Nov 10 23:49:33 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.66       Driver Version: 450.66       CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla K40m          On   | 00000000:03:00.0 Off |                    0 |
| N/A   32C    P8    21W / 235W |      0MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K40m          On   | 00000000:04:00.0 Off |                    0 |
| N/A   31C    P8    21W / 235W |      0MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla K40m          On   | 00000000:82:00.0 Off |                    0 |
| N/A   31C    P8    20W / 235W |      0MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

OS、Pytorchのバージョン

pytorch version 1.7.0+cu101
OS debian 10.6
pytorchは確か一回pip install torchで普通にインストールした後、GPUの問題が起きたのを見てからアンインストールして、"pip install torch==1.7.0+cu101 torchvision==0.8.1+cu101 torchaudio==0.7.0 -f https://download.pytorch.org/whl/torch_stable.html"で再インストールしました。

cuda、OS、pytorchのバージョンはそれぞれ下記の通りで確認しました。

bash
1aoies: ~$ nvcc --version
2nvcc: NVIDIA (R) Cuda compiler driver
3Copyright (c) 2005-2019 NVIDIA Corporation
4Built on Sun_Jul_28_19:07:16_PDT_2019
5Cuda compilation tools, release 10.1, V10.1.243
6
7aoies: ~$ cat /etc/debian_version 
810.6

python
1>>> import torch
2>>> print(torch.__version__)
31.7.0+cu101
4

追記2

cuda 11.0用のコマンド（pip install torch==1.7.0+cu110 torchvision==0.8.1+cu110 torchaudio===0.7.0 -f https://download.pytorch.org/whl/torch_stable.html）も試しましたが同一のエラーが発生しました。

あと、試しにcupyをインストールしてこちらでもgpuの使用をテストしようとしたところ
こちらだとcuda110版ではバージョンの相違によるエラーが検出されました。一方で、cuda101版を使うと、バージョンの差異のエラーは出なかったものの、関数の実行時にメモリのエラーが発生しました。

このことから
1.cudaのバージョンは10.1でおそらく正しい
2.pythonとGPU間自体に問題がある、pythonのインストール自体などに問題がある
ことが示唆されました。

cupy-cuda110の場合

python
1>>> import cupy
2Traceback (most recent call last):
3  File "/xxx/.venv/tff-IwBB_zea/lib/python3.6/site-packages/cupy/__init__.py", line 21, in <module>
4    from cupy import core  # NOQA
5  File "/xxx/.venv/tff-IwBB_zea/lib/python3.6/site-packages/cupy/core/__init__.py", line 1, in <module>
6    from cupy.core import core  # NOQA
7ImportError: libcublas.so.11: cannot open shared object file: No such file or directory
8
9During handling of the above exception, another exception occurred:
10
11Traceback (most recent call last):
12  File "<stdin>", line 1, in <module>
13  File "/xxx/.venv/tff-IwBB_zea/lib/python3.6/site-packages/cupy/__init__.py", line 42, in <module>
14    six.reraise(ImportError, ImportError(msg), exc_info[2])
15  File "/xxx/.venv/tff-IwBB_zea/lib/python3.6/site-packages/six.py", line 702, in reraise
16    raise value.with_traceback(tb)
17  File "/xxx/.venv/tff-IwBB_zea/lib/python3.6/site-packages/cupy/__init__.py", line 21, in <module>
18    from cupy import core  # NOQA
19  File "/xxx/.venv/tff-IwBB_zea/lib/python3.6/site-packages/cupy/core/__init__.py", line 1, in <module>
20    from cupy.core import core  # NOQA
21ImportError: CuPy is not correctly installed.
22
23If you are using wheel distribution (cupy-cudaXX), make sure that the version of CuPy you installed matches with the version of CUDA on your host.
24Also, confirm that only one CuPy package is installed:
25  $ pip freeze
26
27If you are building CuPy from source, please check your environment, uninstall CuPy and reinstall it with:
28  $ pip install cupy --no-cache-dir -vvvv
29
30Check the Installation Guide for details:
31  https://docs.cupy.dev/en/latest/install.html
32
33original error: libcublas.so.11: cannot open shared object file: No such file or directory

cupy-cuda101の場合

python
1>>> import cupy as cp
2>>> x = cp.arange(6).reshape(2, 3).astype('f')
3Traceback (most recent call last):
4  File "<stdin>", line 1, in <module>
5  File "/xxx/.venv/tff-IwBB_zea/lib/python3.6/site-packages/cupy/_creation/ranges.py", line 55, in arange
6    ret = cupy.empty((size,), dtype=dtype)
7  File "/xxx/.venv/tff-IwBB_zea/lib/python3.6/site-packages/cupy/_creation/basic.py", line 22, in empty
8    return cupy.ndarray(shape, dtype, order=order)
9  File "cupy/core/core.pyx", line 138, in cupy.core.core.ndarray.__init__
10  File "cupy/cuda/memory.pyx", line 578, in cupy.cuda.memory.alloc
11  File "cupy/cuda/memory.pyx", line 1250, in cupy.cuda.memory.MemoryPool.malloc
12  File "cupy/cuda/memory.pyx", line 1270, in cupy.cuda.memory.MemoryPool.malloc
13  File "cupy/cuda/device.pyx", line 25, in cupy.cuda.device.get_device_id
14  File "cupy_backends/cuda/api/runtime.pyx", line 275, in cupy_backends.cuda.api.runtime.getDevice
15  File "cupy_backends/cuda/api/runtime.pyx", line 247, in cupy_backends.cuda.api.runtime.check_status
16cupy_backends.cuda.api.runtime.CUDARuntimeError: cudaErrorMemoryAllocation: out of memory
17

meg_

2020/11/10 22:28

GPUは動作していますか？

meg_

2020/11/11 00:07

pytorchはどうやってインストールされましたか？ OS、pytorchのバージョンは何ですか？

行動規範の内容に同意します

回答1件

nvidia-smiの出力結果では「CUDA Version: 11.0 」となっていますね。

CUDA11.0用のコマンド（pip install torch==1.7.0+cu110 torchvision==0.8.1+cu110 torchaudio===0.7.0 -f https://download.pytorch.org/whl/torch_stable.html）でインストールしてはどうでしょうか？

投稿2020/11/11 10:52

meg_

総合スコア10579

aoies

2020/11/12 23:47 編集

Ver11用コマンドも試しましたが同一のエラーが出てしまいました...

meg_

2020/11/13 03:23

そうでしたか。お使いのGPU(Tesla K40m)で使用可能なCUDAについてメーカーに問い合わせされてはいかがでしょうか？生産終了品であり情報があまりないですね。（nvidia-smiの出力結果にCUDA Version: 11.0と出てますが、これには非対応でしょうね。。）CUDA 10.2にも対応しているのでしょうか？

aoies

2020/11/13 09:16

同一GPUを共有する他の方も10.1のバージョンでcupy、pytorchでGPUを使えているので何か別の問題があるように思います。

meg_

2020/11/13 10:26

> 同一GPUを共有する他の方も10.1のバージョンでcupy、pytorchでGPUを使えているそうだったんですか！その方の環境と比較すれば解決は早そうですね！検討違いのことばかり言ってしまい申し訳ありませんでした。

行動規範の内容に同意します