前提・実現したいこと
tensorflow(GPU)を使用して畳み込みニューラルネットワークの
チュートリアルをやっています。
発生している問題・エラーメッセージ
C:\Users\TAKANO\Anaconda3\envs\gputf15rc0\python.exe C:/Users/TAKANO/PycharmProjects/180505_CNN/main.py Extracting MNIST_data/train-images-idx3-ubyte.gz Extracting MNIST_data/train-labels-idx1-ubyte.gz Extracting MNIST_data/t10k-images-idx3-ubyte.gz Extracting MNIST_data/t10k-labels-idx1-ubyte.gz 2018-05-05 14:02:37.074310: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\platform\cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2 2018-05-05 14:02:37.322147: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\gpu\gpu_device.cc:1105] Found device 0 with properties: name: GeForce GTX 1050 major: 6 minor: 1 memoryClockRate(GHz): 1.455 pciBusID: 0000:01:00.0 totalMemory: 2.00GiB freeMemory: 1.60GiB 2018-05-05 14:02:37.322485: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\gpu\gpu_device.cc:1195] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce GTX 1050, pci bus id: 0000:01:00.0, compute capability: 6.1) 2018-05-05 14:02:49.060262: W C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:273] Allocator (GPU_0_bfc) ran out of memory trying to allocate 957.03MiB. Current allocation summary follows. (略) 2018-05-05 14:02:49.125925: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:684] Sum Total of in-use chunks: 1012.71MiB 2018-05-05 14:02:49.126076: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:686] Stats: Limit: 1442503065 InUse: 1061908480 MaxInUse: 1061908480 NumAllocs: 172 MaxAllocSize: 1003520000 2018-05-05 14:02:49.126345: W C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:277] ***************************************************************************_________________________ 2018-05-05 14:02:49.126559: W C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\framework\op_kernel.cc:1198] Resource exhausted: OOM when allocating tensor with shape[10000,32,28,28] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc Traceback (most recent call last): File "C:\Users\TAKANO\Anaconda3\envs\gputf15rc0\lib\site-packages\tensorflow\python\client\session.py", line 1350, in _do_call return fn(*args) File "C:\Users\TAKANO\Anaconda3\envs\gputf15rc0\lib\site-packages\tensorflow\python\client\session.py", line 1329, in _run_fn status, run_metadata) File "C:\Users\TAKANO\Anaconda3\envs\gputf15rc0\lib\site-packages\tensorflow\python\framework\errors_impl.py", line 473, in __exit__ c_api.TF_GetCode(self.status.status)) tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[10000,32,28,28] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[Node: Conv2D = Conv2D[T=DT_FLOAT, data_format="NHWC", dilations=[1, 1, 1, 1], padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](Reshape, Variable/read)]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. During handling of the above exception, another exception occurred: Traceback (most recent call last): File "C:/Users/TAKANO/PycharmProjects/180505_CNN/main.py", line 51, in <module> train_loss = sess.run(loss , feed_dict = {x:mnist.test.images , y:mnist.test.labels , keep_drop:1.0}) File "C:\Users\TAKANO\Anaconda3\envs\gputf15rc0\lib\site-packages\tensorflow\python\client\session.py", line 895, in run run_metadata_ptr) File "C:\Users\TAKANO\Anaconda3\envs\gputf15rc0\lib\site-packages\tensorflow\python\client\session.py", line 1128, in _run feed_dict_tensor, options, run_metadata) File "C:\Users\TAKANO\Anaconda3\envs\gputf15rc0\lib\site-packages\tensorflow\python\client\session.py", line 1344, in _do_run options, run_metadata) File "C:\Users\TAKANO\Anaconda3\envs\gputf15rc0\lib\site-packages\tensorflow\python\client\session.py", line 1363, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[10000,32,28,28] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[Node: Conv2D = Conv2D[T=DT_FLOAT, data_format="NHWC", dilations=[1, 1, 1, 1], padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](Reshape, Variable/read)]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. Caused by op 'Conv2D', defined at: File "C:/Users/TAKANO/PycharmProjects/180505_CNN/main.py", line 16, in <module> h1_conv = tf.nn.relu(conv2d(tf.reshape(x , [-1,28,28,1]) , w1) + b1) File "C:\Users\TAKANO\PycharmProjects\180505_CNN\func_set.py", line 27, in conv2d return tf.nn.conv2d(x , W , strides = [1,1,1,1] , padding = "SAME") File "C:\Users\TAKANO\Anaconda3\envs\gputf15rc0\lib\site-packages\tensorflow\python\ops\gen_nn_ops.py", line 725, in conv2d data_format=data_format, dilations=dilations, name=name) File "C:\Users\TAKANO\Anaconda3\envs\gputf15rc0\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 787, in _apply_op_helper op_def=op_def) File "C:\Users\TAKANO\Anaconda3\envs\gputf15rc0\lib\site-packages\tensorflow\python\framework\ops.py", line 3160, in create_op op_def=op_def) File "C:\Users\TAKANO\Anaconda3\envs\gputf15rc0\lib\site-packages\tensorflow\python\framework\ops.py", line 1625, in __init__ self._traceback = self._graph._extract_stack() # pylint: disable=protected-access ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[10000,32,28,28] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[Node: Conv2D = Conv2D[T=DT_FLOAT, data_format="NHWC", dilations=[1, 1, 1, 1], padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](Reshape, Variable/read)]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. Process finished with exit code 1
該当のソースコード
python
1import numpy as np 2import tensorflow as tf 3import pickle 4 5 6from func_set import * 7from tensorflow.examples.tutorials.mnist import input_data 8 9mnist = input_data.read_data_sets("MNIST_data/" , one_hot = True) 10x = tf.placeholder(tf.float32 , [None,784]) 11y = tf.placeholder(tf.float32 , [None,10]) 12 13w1 = W_init([5,5,1,32]) 14b1 = B_init([32]) 15 16h1_conv = tf.nn.relu(conv2d(tf.reshape(x , [-1,28,28,1]) , w1) + b1) 17h1_pool = max_pool2x2(h1_conv) 18 19w2 = W_init([5,5,32,64]) 20b2 = B_init([64]) 21 22h2_conv = tf.nn.relu(conv2d(h1_pool , w2) + b2) 23h2_pool = max_pool2x2(h2_conv) 24 25w_fc1 = W_init([7*7*64,512]) 26b_fc1 = B_init([512]) 27h_fc1 = tf.nn.relu(tf.matmul(tf.reshape(h2_pool , [-1,7*7*64]) , w_fc1) + b_fc1) 28 29#dropout 30keep_drop = tf.placeholder(tf.float32) 31h_fc1_drop = tf.nn.dropout(h_fc1 , keep_drop) 32 33w_fc2 = W_init([512,10]) 34b_fc2 = B_init([10]) 35out = tf.nn.softmax(tf.matmul(h_fc1_drop , w_fc2) + b_fc2) 36 37loss = Error_cross_entropy(out , y) 38accu = accuracy(out , y) 39train = AdamOptimizer(0.01 , loss) 40 41init = tf.global_variables_initializer() 42 43with tf.Session() as sess: 44 sess.run(init) 45 for i in range(2000): 46 batch = mnist.train.next_batch(2) 47 feed_dict = {x:batch[0] , y:batch[1] , keep_drop:0.5} 48 sess.run(train,feed_dict = feed_dict) 49 if i % 10 == 0: 50 train_accu = sess.run(accu , feed_dict = {x:batch[0] , y:batch[1] , keep_drop:1.0}) 51 train_loss = sess.run(loss , feed_dict = {x:mnist.test.images , y:mnist.test.labels , keep_drop:1.0}) 52 print("step:{} accuracy:{} loss:{}".format(i+1 , train_accu , train_loss)) 53 final_accu = sess.run(accu , feed_dict = {x:mnist.test.images , y:mnist.test.labels , keep_drop:1.0}) 54 print("Optimization DONE!") 55 print("Final Accuracy:{}".format(final_accu))
試したこと
他の方の質問を拝見したところ、「バッチサイズを小さくすればよい」という回答が
多く見受けられたため、バッチサイズを限界まで小さくしましたが、同じエラーが出ました。
補足情報(FW/ツールのバージョンなど)
windows10 Home 64Bbit
Intel(R) Core(TM)i5-7400 CPU @3.00GHz
RAM8.00GB
GEFORCE GTX 1050
python3.6
tensorflow-gpu1.5.0rc0
バカでっかいメモリ(shape[10000,32,28,28])を確保しようとGPUのメモリ不足が発生してますね。32はバッチサイズでいいんだけど、10000ってテストのサンプルサイズのはず。どうしてこれが発生してしまってるんでしょう・・・
評価時もミニバッチにしないとしんどいような気がします。解決策までは知らないのですが。
tensorflowってデフォルトでGPUのメモリを全部使おうとするので必要な分確保するような設定にしてあげればいいのではと思いました。
問題の切り分けとしてhttps://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/tutorials/mnist/mnist_deep.py 公式のminist チュートリアルを実行すると同じエラーになりますか
チュートリアルを行ったところ、エラーメッセージも'test accuracy'も表示されない不思議な終わり方をしました。