Google Colaboratory上でのcuDNNの初期化エラー

前提

セマンティックセグメンテーションを行うために、Google Colab上でU-Netの実装を行っています。
参考サイトを参照してプログラムを組んでいますが

UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above. [[node model_1/conv2d_24/Conv2D (defined at <ipython-input-22-c8244c51f154>:3) ]] [Op:__inference_train_function_5385]

のエラーが発生してしまいます。解決方法をお教えいただきたいです。。。

実現したいこと

参考サイトのプログラムをGoogle Colab上で実現する。
参考サイトの学習実行部のプログラムを実行させたい。

発生している問題・エラーメッセージ

cuDNNの初期化が行われていない、というエラーが出る。

UnknownError:  Failed to get convolution algorithm. 
This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[node model_1/conv2d_24/Conv2D (defined at <ipython-input-22-c8244c51f154>:3) ]] [Op:__inference_train_function_5385]

該当のソースコード

python
1#参考サイトのバージョンに合わせるため、各モジュールをインストール
2!pip install tensorflow-gpu==2.5.0
3!pip install keras==2.4.3
4!pip install matplotlib==3.4.2
5!pip install tqdm
6
7#Google Driveをマウント
8from google.colab import drive
9drive.mount('/content/drive')
10
11#Google Driveに格納したgitファイルを利用するため、パスを通す。
12import sys
13sys.path.append('/content/drive/My Drive/unet-master/unet-master')
14
15#data.pyをインポート
16from data import *
17
18#参考サイトにおけるData Augmentation部を実行
19data_gen_args = dict(rotation_range=0.2,#回転
20                    width_shift_range=0.05,#水平移動
21                    height_shift_range=0.05,#垂直移動
22                    shear_range=0.05,#シアー変換
23                    zoom_range=0.05,#ズーム
24                    horizontal_flip=True,#左右反転
25                    fill_mode='nearest')
26myGenerator = trainGenerator(20,'/content/drive/My Drive/unet-master/unet-master/data/membrane/train','image','label',
27                             data_gen_args,save_to_dir = "/content/drive/My Drive/unet-master/unet-master/data/membrane/train/aug")
28num_batch = 3
29for i,batch in enumerate(myGenerator):
30    if(i >= num_batch):
31        break
32image_arr,mask_arr = geneTrainNpy("/content/drive/My Drive/unet-master/unet-master/data/membrane/train/aug/","/content/drive/My Drive/unet-master/unet-master/data/membrane/train/aug/")
33
34#参考サイトにおける学習実行部を実行
35from model import *
36from data import *
37
38model = unet()
39model_checkpoint = ModelCheckpoint('unet_membrane.hdf5', monitor='loss',verbose=1, save_best_only=True)
40
41#エラーが発生した部分のプログラム
42imgs_train,imgs_mask_train = geneTrainNpy("/content/drive/My Drive/unet-master/unet-master/data/membrane/train/aug","/content/drive/My Drive/unet-master/unet-master/data/membrane/train/aug")
43history = model.fit(imgs_train, imgs_mask_train, batch_size=8, epochs=1000, verbose=1,validation_split=0.2, 
44                    shuffle=True, callbacks=[model_checkpoint])

試したこと

エラーを検索してhttps://qiita.com/Ka-k/items/cb942855ab669ff60630 を発見したが解決できず。
cuDNNとCUDAのバージョンを確認し、互換性があるかの確認。

jbpb0

2022/09/30 05:42

> Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, https://stackoverflow.com/questions/53698035/failed-to-get-convolution-algorithm-this-is-probably-because-cudnn-failed-to-in のwaterproofさんのJun 9, 2019 at 4:09の回答の「2. You're out of memory」によると、gpuのメモリーが不足した時にもそのエラーが出るようです > history = model.fit(imgs_train, imgs_mask_train, batch_size=8, epochs=1000, verbose=1,validation_split=0.2, shuffle=True, callbacks=[model_checkpoint]) でそのエラーが出てるのなら、「batch_size=」を1とか2とか、思いっきり小さくしてみたら、どうなりますでしょうか？

osumosan

2022/09/30 05:46 編集

Epoch 1/10 --------------------------------------------------------------------------- UnknownError Traceback (most recent call last) <ipython-input-34-7315ca317ecc> in <module> 1 imgs_train,imgs_mask_train = geneTrainNpy("/content/drive/My Drive/unet-master/unet-master/data/membrane/train/aug","/content/drive/My Drive/unet-master/unet-master/data/membrane/train/aug") 2 history = model.fit(imgs_train, imgs_mask_train, batch_size=1, epochs=10, verbose=1,validation_split=0.2, ----> 3 shuffle=True, callbacks=[model_checkpoint]) 6 frames /usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/execute.py in quick_execute(op_name, num_outputs, inputs, attrs, ctx, name) 58 ctx.ensure_initialized() 59 tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name, ---> 60 inputs, attrs, num_outputs) 61 except core._NotOkStatusException as e: 62 if name is not None: UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above. [[node model_2/conv2d_48/Conv2D (defined at <ipython-input-33-1f69bafe71d6>:3) ]] [Op:__inference_train_function_7990] Function call stack: train_function ------------------------------------------------------------------------------------------------ ですね。。。試しにepoch sizeも小さくしてみましたが、同様のエラーが発生します。。。

行動規範の内容に同意します

回答1件

自己解決

(2022/09/30現在)におけるGoogle Colab上のCUDAが、11.1であるため
TensorflowとCUDAの対応表を見て、tensorflow==2.4.0にダウングレードして実行したところ
うまく動作しました。

・変更前

python
1#参考サイトのバージョンに合わせるため、各モジュールをインストール
2!pip install tensorflow-gpu==2.5.0
3!pip install keras==2.4.3
4!pip install matplotlib==3.4.2
5!pip install tqdm

・変更後

python
1#参考サイトのバージョンに合わせるため、各モジュールをインストール
2!pip install tensorflow==2.4.0
3!pip install keras==2.4.3
4!pip install matplotlib==3.4.2
5!pip install tqdm