Windows10、RTX2070を使用しており、学習にKeras実装のYOLOv3を使っています。
しかし、学習する際にバッチサイズを4まで下げ、GPUメモリの使用量を抑えるプログラム(以下のようなの)を挿入する必要があります。
Python
1 import tensorflow as tf 2 gpus = tf.config.experimental.list_physical_devices('GPU') 3 if gpus: 4 try: 5 # Currently, memory growth needs to be the same across GPUs 6 for gpu in gpus: 7 tf.config.experimental.set_memory_growth(gpu, True) 8 logical_gpus = tf.config.experimental.list_logical_devices('GPU') 9 print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs") 10 except RuntimeError as e: 11 # Memory growth must be set before GPUs have been initialized 12 print(e)
学習する際にはGPUの名前(RTX2070)が表示されるため、GPU自体は認識されているはずなのですがなぜでしょうか?
研究室にある同じVRAM8GBのGTX1080はUbuntuで同じKeras実装のYOLOv3を使っており、これはバッチサイズ32で動作しているので疑問に思い質問させていただきました。プログラミングに関する質問ではないのでTeratailでするべきではなかったら申し訳ないです。
ちなみに出力されるエラー文は以下のようになります。
ErrorMessage
1shape_optimizer failed: Invalid argument: Subshape must have computed start >= end since stride is negative, but is 0 and 2 (computed from start 0 and end 9223372036854775807 over shape with rank 2 and stride-1) 2remapper failed: Invalid argument: Subshape must have computed start >= end since stride is negative, but is 0 and 2 (computed from start 0 and end 9223372036854775807 over shape with rank 2 and stride-1)
他にもエラーでストップする場合があったので追記します。
ErrorMessage2
1 OP_REQUIRES failed at transpose_op.cc:199 : Resource exhausted: OOM when allocating tensor with shape[16,104,104,128] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc 2Traceback (most recent call last): 3 File "train.py", line 208, in <module> 4 _main() 5 File "train.py", line 102, in _main 6 callbacks=[logging, checkpoint, reduce_lr, early_stopping]) 7 File "D:\Users\myusername\anaconda3\envs\yolov3_gpu_2\lib\site-packages\keras\legacy\interfaces.py", line 91, in wrapper 8 return func(*args, **kwargs) 9 File "D:\Users\myusername\anaconda3\envs\yolov3_gpu_2\lib\site-packages\keras\engine\training.py", line 1418, in fit_generator 10 initial_epoch=initial_epoch) 11 File "D:\Users\myusername\anaconda3\envs\yolov3_gpu_2\lib\site-packages\keras\engine\training_generator.py", line 217, in fit_generator 12 class_weight=class_weight) 13 File "D:\Users\myusername\anaconda3\envs\yolov3_gpu_2\lib\site-packages\keras\engine\training.py", line 1217, in train_on_batch 14 outputs = self.train_function(ins) 15 File "D:\Users\myusername\anaconda3\envs\yolov3_gpu_2\lib\site-packages\keras\backend\tensorflow_backend.py", line 2715, in __call__ 16 return self._call(inputs) 17 File "D:\Users\myusername\anaconda3\envs\yolov3_gpu_2\lib\site-packages\keras\backend\tensorflow_backend.py", line 2675, in _call 18 fetched = self._callable_fn(*array_vals) 19 File "D:\Users\myusername\anaconda3\envs\yolov3_gpu_2\lib\site-packages\tensorflow\python\client\session.py", line 1458, in __call__ 20 run_metadata_ptr) 21tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found. 22 (0) Resource exhausted: OOM when allocating tensor with shape[16,104,104,128] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc 23 [[{{node leaky_re_lu_9/LeakyRelu}}]] 24Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. 25 26 [[loss_1/add_74/_5295]] 27Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. 28 29 (1) Resource exhausted: OOM when allocating tensor with shape[16,104,104,128] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc 30 [[{{node leaky_re_lu_9/LeakyRelu}}]] 31Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. 32 330 successful operations. 340 derived errors ignored.
これは以下の記事と同じエラーですね。
https://qiita.com/enoughspacefor/items/1c09a27877877c56f25a
回答2件
あなたの回答
tips
プレビュー