GPU版tensorflowが動かない

前提・実現したいこと

tensorflow（GPU）を使用して畳み込みニューラルネットワークの
チュートリアルをやっています。

発生している問題・エラーメッセージ

C:\Users\TAKANO\Anaconda3\envs\gputf15rc0\python.exe C:/Users/TAKANO/PycharmProjects/180505_CNN/main.py
Extracting MNIST_data/train-images-idx3-ubyte.gz
Extracting MNIST_data/train-labels-idx1-ubyte.gz
Extracting MNIST_data/t10k-images-idx3-ubyte.gz
Extracting MNIST_data/t10k-labels-idx1-ubyte.gz
2018-05-05 14:02:37.074310: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\platform\cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2
2018-05-05 14:02:37.322147: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\gpu\gpu_device.cc:1105] Found device 0 with properties: 
name: GeForce GTX 1050 major: 6 minor: 1 memoryClockRate(GHz): 1.455
pciBusID: 0000:01:00.0
totalMemory: 2.00GiB freeMemory: 1.60GiB
2018-05-05 14:02:37.322485: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\gpu\gpu_device.cc:1195] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce GTX 1050, pci bus id: 0000:01:00.0, compute capability: 6.1)
2018-05-05 14:02:49.060262: W C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:273] Allocator (GPU_0_bfc) ran out of memory trying to allocate 957.03MiB.  Current allocation summary follows.
(略)
2018-05-05 14:02:49.125925: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:684] Sum Total of in-use chunks: 1012.71MiB
2018-05-05 14:02:49.126076: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:686] Stats: 
Limit:                  1442503065
InUse:                  1061908480
MaxInUse:               1061908480
NumAllocs:                     172
MaxAllocSize:           1003520000

2018-05-05 14:02:49.126345: W C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:277] ***************************************************************************_________________________
2018-05-05 14:02:49.126559: W C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\framework\op_kernel.cc:1198] Resource exhausted: OOM when allocating tensor with shape[10000,32,28,28] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
Traceback (most recent call last):
  File "C:\Users\TAKANO\Anaconda3\envs\gputf15rc0\lib\site-packages\tensorflow\python\client\session.py", line 1350, in _do_call
    return fn(*args)
  File "C:\Users\TAKANO\Anaconda3\envs\gputf15rc0\lib\site-packages\tensorflow\python\client\session.py", line 1329, in _run_fn
    status, run_metadata)
  File "C:\Users\TAKANO\Anaconda3\envs\gputf15rc0\lib\site-packages\tensorflow\python\framework\errors_impl.py", line 473, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[10000,32,28,28] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
	 [[Node: Conv2D = Conv2D[T=DT_FLOAT, data_format="NHWC", dilations=[1, 1, 1, 1], padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](Reshape, Variable/read)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:/Users/TAKANO/PycharmProjects/180505_CNN/main.py", line 51, in <module>
    train_loss = sess.run(loss , feed_dict = {x:mnist.test.images , y:mnist.test.labels , keep_drop:1.0})
  File "C:\Users\TAKANO\Anaconda3\envs\gputf15rc0\lib\site-packages\tensorflow\python\client\session.py", line 895, in run
    run_metadata_ptr)
  File "C:\Users\TAKANO\Anaconda3\envs\gputf15rc0\lib\site-packages\tensorflow\python\client\session.py", line 1128, in _run
    feed_dict_tensor, options, run_metadata)
  File "C:\Users\TAKANO\Anaconda3\envs\gputf15rc0\lib\site-packages\tensorflow\python\client\session.py", line 1344, in _do_run
    options, run_metadata)
  File "C:\Users\TAKANO\Anaconda3\envs\gputf15rc0\lib\site-packages\tensorflow\python\client\session.py", line 1363, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[10000,32,28,28] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
	 [[Node: Conv2D = Conv2D[T=DT_FLOAT, data_format="NHWC", dilations=[1, 1, 1, 1], padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](Reshape, Variable/read)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.


Caused by op 'Conv2D', defined at:
  File "C:/Users/TAKANO/PycharmProjects/180505_CNN/main.py", line 16, in <module>
    h1_conv = tf.nn.relu(conv2d(tf.reshape(x , [-1,28,28,1]) , w1) + b1)
  File "C:\Users\TAKANO\PycharmProjects\180505_CNN\func_set.py", line 27, in conv2d
    return tf.nn.conv2d(x , W , strides = [1,1,1,1] , padding = "SAME")
  File "C:\Users\TAKANO\Anaconda3\envs\gputf15rc0\lib\site-packages\tensorflow\python\ops\gen_nn_ops.py", line 725, in conv2d
    data_format=data_format, dilations=dilations, name=name)
  File "C:\Users\TAKANO\Anaconda3\envs\gputf15rc0\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "C:\Users\TAKANO\Anaconda3\envs\gputf15rc0\lib\site-packages\tensorflow\python\framework\ops.py", line 3160, in create_op
    op_def=op_def)
  File "C:\Users\TAKANO\Anaconda3\envs\gputf15rc0\lib\site-packages\tensorflow\python\framework\ops.py", line 1625, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[10000,32,28,28] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
	 [[Node: Conv2D = Conv2D[T=DT_FLOAT, data_format="NHWC", dilations=[1, 1, 1, 1], padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](Reshape, Variable/read)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.



Process finished with exit code 1

該当のソースコード

python
1import numpy as np
2import tensorflow as tf
3import pickle
4
5
6from func_set import *
7from tensorflow.examples.tutorials.mnist import input_data
8
9mnist = input_data.read_data_sets("MNIST_data/" , one_hot = True)
10x = tf.placeholder(tf.float32 , [None,784])
11y = tf.placeholder(tf.float32 , [None,10])
12
13w1 = W_init([5,5,1,32])
14b1 = B_init([32])
15
16h1_conv = tf.nn.relu(conv2d(tf.reshape(x , [-1,28,28,1]) , w1) + b1)
17h1_pool = max_pool2x2(h1_conv)
18
19w2 = W_init([5,5,32,64])
20b2 = B_init([64])
21
22h2_conv = tf.nn.relu(conv2d(h1_pool , w2) + b2)
23h2_pool = max_pool2x2(h2_conv)
24
25w_fc1 = W_init([7*7*64,512])
26b_fc1 = B_init([512])
27h_fc1 = tf.nn.relu(tf.matmul(tf.reshape(h2_pool , [-1,7*7*64]) , w_fc1) + b_fc1)
28
29#dropout
30keep_drop = tf.placeholder(tf.float32)
31h_fc1_drop = tf.nn.dropout(h_fc1 , keep_drop)
32
33w_fc2 = W_init([512,10])
34b_fc2 = B_init([10])
35out = tf.nn.softmax(tf.matmul(h_fc1_drop , w_fc2) + b_fc2)
36
37loss = Error_cross_entropy(out , y)
38accu = accuracy(out , y)
39train = AdamOptimizer(0.01 , loss)
40
41init = tf.global_variables_initializer()
42
43with tf.Session() as sess:
44    sess.run(init)
45    for i in range(2000):
46        batch = mnist.train.next_batch(2)
47        feed_dict = {x:batch[0] , y:batch[1] , keep_drop:0.5}
48        sess.run(train,feed_dict = feed_dict)
49        if i % 10 == 0:
50            train_accu = sess.run(accu , feed_dict = {x:batch[0] , y:batch[1] , keep_drop:1.0})
51            train_loss = sess.run(loss , feed_dict = {x:mnist.test.images , y:mnist.test.labels , keep_drop:1.0})
52            print("step:{} accuracy:{} loss:{}".format(i+1 , train_accu , train_loss))
53    final_accu = sess.run(accu , feed_dict = {x:mnist.test.images , y:mnist.test.labels , keep_drop:1.0})
54    print("Optimization DONE!")
55    print("Final Accuracy:{}".format(final_accu))

試したこと

他の方の質問を拝見したところ、「バッチサイズを小さくすればよい」という回答が
多く見受けられたため、バッチサイズを限界まで小さくしましたが、同じエラーが出ました。

補足情報（FW/ツールのバージョンなど）

windows10 Home 64Bbit
Intel(R) Core(TM)i5-7400 CPU @3.00GHz
RAM8.00GB
GEFORCE GTX 1050
python3.6
tensorflow-gpu1.5.0rc0

tachikoma

2018/05/05 06:24

バカでっかいメモリ（shape[10000,32,28,28]）を確保しようとGPUのメモリ不足が発生してますね。32はバッチサイズでいいんだけど、10000ってテストのサンプルサイズのはず。どうしてこれが発生してしまってるんでしょう・・・

tachikoma

2018/05/05 06:28

評価時もミニバッチにしないとしんどいような気がします。解決策までは知らないのですが。

wakame

2018/05/05 13:37

tensorflowってデフォルトでGPUのメモリを全部使おうとするので必要な分確保するような設定にしてあげればいいのではと思いました。

wakame

2018/05/06 06:49

問題の切り分けとしてhttps://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/tutorials/mnist/mnist_deep.py 公式のminist チュートリアルを実行すると同じエラーになりますか

nisekonsan

2018/05/06 09:42

チュートリアルを行ったところ、エラーメッセージも'test accuracy'も表示されない不思議な終わり方をしました。

行動規範の内容に同意します

回答1件

[追記]
質問文より抜粋

python
1# with以下
2with tf.Session() as sess:
3    sess.run(init)
4    for i in range(2000):
5        batch = mnist.train.next_batch(2)
6        feed_dict = {x:batch[0] , y:batch[1] , keep_drop:0.5}
7        sess.run(train,feed_dict = feed_dict)
8        if i % 10 == 0:
9            train_accu = sess.run(accu , feed_dict = {x:batch[0] , y:batch[1] , keep_drop:1.0})
10            train_loss = sess.run(loss , feed_dict = {x:mnist.test.images , y:mnist.test.labels , keep_drop:1.0})
11            print("step:{} accuracy:{} loss:{}".format(i+1 , train_accu , train_loss))
12    final_accu = sess.run(accu , feed_dict = {x:mnist.test.images , y:mnist.test.labels , keep_drop:1.0})
13    print("Optimization DONE!")
14    print("Final Accuracy:{}".format(final_accu))

上記を下記のように修正して実行するとどうなりますか。

python
1with tf.Session() as sess:
2    sess.run(init)
3    for i in range(2000):
4        batch = mnist.train.next_batch(2)
5        feed_dict = {x:batch[0] , y:batch[1] , keep_drop:0.5}
6        sess.run(train,feed_dict = feed_dict)
7        if i % 10 == 0:
8            train_accu = sess.run(accu , feed_dict = {x:batch[0] , y:batch[1] , keep_drop:1.0})
9            # train_loss = sess.run(loss , feed_dict = {x:mnist.test.images , y:mnist.test.labels , keep_drop:1.0})
10            print("step:{} accuracy:{}".format(i+1 , train_accu))
11    # final_accu = sess.run(accu , feed_dict = {x:mnist.test.images , y:mnist.test.labels , keep_drop:1.0})
12    print("Optimization DONE!")
13    # print("Final Accuracy:{}".format(final_accu))

下記コードはTensorflow側で必要な分GPUのメモリを確保するコードですがこれをソースコードに組み込むとどうなりますか。

python
1import tensorflow as tf
2
3config = tf.ConfigProto()
4config.gpu_options.allow_growth = True
5session = tf.Session(config=config)

Python: Keras/TensorFlow で GPU のメモリを必要な分だけ確保する
 tensorflow - Allowing GPU memory growth

投稿2018/05/05 13:45

編集2018/05/05 14:29

wakame

総合スコア1170

nisekonsan

2018/05/05 13:58

ご回答ありがとうございます。上記コードを試してみましたが、同じエラーが発生しました。

wakame

2018/05/05 14:02

思い当たる節があったのでちょっとやってみてほしいのですが nvidia-smi コマンドが使える状態であればコマンドを打ってもらってこちらコメントに記述してもらえますか。

wakame

2018/05/05 14:04

疑っているのは今の状態ではGPUのメモリが解放出来ておらず、メモリを占有している状態でプログラムを走らせているためメモリが足らないというエラーが出ているのかなと。

nisekonsan

2018/05/05 14:06

これでよろしいでしょうか C:\Program Files\NVIDIA Corporation\NVSMI>nvidia-smi.exe Sat May 05 23:05:10 2018 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 385.54 Driver Version: 385.54 | |-------------------------------+----------------------+----------------------+ | GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 GeForce GTX 1050 WDDM | 00000000:01:00.0 On | N/A | | 45% 30C P8 35W / 75W | 430MiB / 2048MiB | 1% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 104 C+G ...Files (x86)\Mozilla Firefox\firefox.exe N/A | | 0 1052 C+G Insufficient Permissions N/A | | 0 2536 C+G ...Files (x86)\Mozilla Firefox\firefox.exe N/A | | 0 2912 C+G ...Files (x86)\Mozilla Firefox\firefox.exe N/A | | 0 5432 C+G C:\Windows\explorer.exe N/A | | 0 5448 C+G ...Files (x86)\Mozilla Firefox\firefox.exe N/A | | 0 6300 C+G ...t_cw5n1h2txyewy\ShellExperienceHost.exe N/A | | 0 6444 C+G ...dows.Cortana_cw5n1h2txyewy\SearchUI.exe N/A | | 0 8296 C+G ...ogram Files (x86)\Overwolf\Overwolf.exe N/A | | 0 8856 C+G ...Files (x86)\Mozilla Firefox\firefox.exe N/A | | 0 9632 C+G ...ptions\Software\Current\LogiOverlay.exe N/A | | 0 10036 C+G ...x64__8wekyb3d8bbwe\Microsoft.Photos.exe N/A | | 0 10836 C+G ...Overwolf\0.112.2.37\OverwolfBrowser.exe N/A | | 0 10956 C+G ...Files (x86)\Mozilla Firefox\firefox.exe N/A | | 0 11124 C+G ...m Files\Logicool\SetPointP\SetPoint.exe N/A | | 0 11396 C+G ...mmersiveControlPanel\SystemSettings.exe N/A | +-----------------------------------------------------------------------------+ C:\Program Files\NVIDIA Corporation\NVSMI>

wakame

2018/05/05 14:31

とりあえず動くと思われるコードを回答部に追記しました。

nisekonsan

2018/05/05 14:39

動くようになりました！ ”mnist.test.images”　部のサイズが大きく、ResourceExaustedErrorが起きていたということなのでしょうか？

wakame

2018/05/05 14:42

結果を見るとそう見えますね。すみません、今眠くて今回の問題について解説できそうにないので明日解説をまとめようかなと思います。

行動規範の内容に同意します