前提・実現したいこと
LeapMind社のblueoilを実行できる環境を構築しています。blueoilとは、簡単にエッジデバイス用に最適化された機械学習を行うことができるソフトウェアスタックです。
blueoil
blueoilではdockerを使用することで、自分のPCにnvidia-cuda-toolkitやcudnnをインストールする必要がなく、GPUを使った学習ができます。
発生している問題・エラーメッセージ
blueoilが提供しているdocker imageをbuildすることは問題なくできましたが、いざ学習する段階で下記のエラーが出ます。
2021-12-04 05:37:15.880455: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR 2021-12-04 05:37:15.909105: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR Traceback (most recent call last): File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call return fn(*args) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn target_list, run_metadata) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.UnknownError: 2 root error(s) found. (0) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above. [[{{node conv1/conv2d/Conv2D}}]] [[train/gradients/AddN_31/_233]] (1) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above. [[{{node conv1/conv2d/Conv2D}}]] 0 successful operations. 0 derived errors ignored. During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/local/bin/blueoil", line 33, in <module> sys.exit(load_entry_point('blueoil', 'console_scripts', 'blueoil')()) File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 829, in __call__ return self.main(*args, **kwargs) File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 782, in main rv = self.invoke(ctx) File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 1259, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 1066, in invoke return ctx.invoke(self.callback, **ctx.params) File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 610, in invoke return callback(*args, **kwargs) File "/home/blueoil/blueoil/cmd/main.py", line 102, in train experiment_id, checkpoint_name = run_train(config, experiment_id, recreate, profile_step) File "/home/blueoil/blueoil/cmd/train.py", line 310, in train run(config_file, experiment_id, recreate, profile_step) File "/home/blueoil/blueoil/cmd/train.py", line 301, in run start_training(config, profile_step) File "/home/blueoil/blueoil/cmd/train.py", line 232, in start_training run_metadata=run_meta, File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 956, in run run_metadata_ptr) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1180, in _run feed_dict_tensor, options, run_metadata) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run run_metadata) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.UnknownError: 2 root error(s) found. (0) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above. [[node conv1/conv2d/Conv2D (defined at usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]] [[train/gradients/AddN_31/_233]] (1) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above. [[node conv1/conv2d/Conv2D (defined at usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]] 0 successful operations. 0 derived errors ignored. Original stack trace for 'conv1/conv2d/Conv2D': File "usr/local/bin/blueoil", line 33, in <module> sys.exit(load_entry_point('blueoil', 'console_scripts', 'blueoil')()) File "usr/local/lib/python3.6/dist-packages/click/core.py", line 829, in __call__ return self.main(*args, **kwargs) File "usr/local/lib/python3.6/dist-packages/click/core.py", line 782, in main rv = self.invoke(ctx) File "usr/local/lib/python3.6/dist-packages/click/core.py", line 1259, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "usr/local/lib/python3.6/dist-packages/click/core.py", line 1066, in invoke return ctx.invoke(self.callback, **ctx.params) File "usr/local/lib/python3.6/dist-packages/click/core.py", line 610, in invoke return callback(*args, **kwargs) File "home/blueoil/blueoil/cmd/main.py", line 102, in train experiment_id, checkpoint_name = run_train(config, experiment_id, recreate, profile_step) File "home/blueoil/blueoil/cmd/train.py", line 310, in train run(config_file, experiment_id, recreate, profile_step) File "home/blueoil/blueoil/cmd/train.py", line 301, in run start_training(config, profile_step) File "home/blueoil/blueoil/cmd/train.py", line 105, in start_training output = model.inference(images_placeholder, is_training_placeholder) File "home/blueoil/blueoil/networks/classification/base.py", line 67, in inference base = self.base(images, is_training) File "home/blueoil/blueoil/networks/classification/lmnet_v1.py", line 69, in base x = _lmnet_block('conv1', images, 32, 3) File "home/blueoil/blueoil/blocks.py", line 93, in lmnet_block data_format=data_format) File "usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py", line 324, in new_func return func(*args, **kwargs) File "usr/local/lib/python3.6/dist-packages/tensorflow_core/python/layers/convolutional.py", line 424, in conv2d return layer.apply(inputs) File "usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py", line 324, in new_func return func(*args, **kwargs) File "usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/base_layer.py", line 1700, in apply return self.__call__(inputs, *args, **kwargs) File "usr/local/lib/python3.6/dist-packages/tensorflow_core/python/layers/base.py", line 548, in __call__ outputs = super(Layer, self).__call__(inputs, *args, **kwargs) File "usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/base_layer.py", line 854, in __call__ outputs = call_fn(cast_inputs, *args, **kwargs) File "usr/local/lib/python3.6/dist-packages/tensorflow_core/python/autograph/impl/api.py", line 234, in wrapper return converted_call(f, options, args, kwargs) File "usr/local/lib/python3.6/dist-packages/tensorflow_core/python/autograph/impl/api.py", line 439, in converted_call return _call_unconverted(f, args, kwargs, options) File "usr/local/lib/python3.6/dist-packages/tensorflow_core/python/autograph/impl/api.py", line 330, in _call_unconverted return f(*args, **kwargs) File "usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/layers/convolutional.py", line 197, in call outputs = self._convolution_op(inputs, self.kernel) File "usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/nn_ops.py", line 1134, in __call__ return self.conv_op(inp, filter) File "usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/nn_ops.py", line 639, in __call__ return self.call(inp, filter) File "usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/nn_ops.py", line 238, in __call__ name=self.name) File "usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/nn_ops.py", line 2010, in conv2d name=name) File "usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/gen_nn_ops.py", line 1071, in conv2d data_format=data_format, dilations=dilations, name=name) File "usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper op_def=op_def) File "usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py", line 507, in new_func return func(*args, **kwargs) File "usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op attrs, op_def, compute_device) File "usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal op_def=op_def) File "usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 1748, in __init__ self._traceback = tf_stack.extract_stack()
実行したコマンド
docker run --rm --gpus '"device=0"' \ -v $(pwd)/cifar:/home/blueoil/cifar \ -v $(pwd)/config:/home/blueoil/config \ -v $(pwd)/saved:/home/blueoil/saved \ blueoil_root:v0.30.0-7-g0c9160b \ blueoil train -c config/cifar10_test.py
試したこと
cudnnのエラーと出ていますが、本当にそうなのかどうかもわかりません。
一応、
export TF_FORCE_GPU_ALLOW_GROWTH=true
は実行してみましたが、変わりませんでした。
自分のPCには nvidia-driver==470 と、 nvidia-container-runtime==3.7.0-1 とdocker-ce==20.10.11 のみインストールしてあります。
このエラーは何が原因なのか、どこを見ればいいかなど、教えていただきたいです。
よろしくお願いします。
補足情報(FW/ツールのバージョンなど)
OS : ubuntu 20.04.2 LTS
GPU : NVIDIA GeForce GTX 1660 Ti
回答2件
あなたの回答
tips
プレビュー