Tensorflowを使ってGPUで学習する際に、フリーズする問題

前提・実現したいこと

tensorflowを使ってGPUで学習をしようとしています。
同じプログラムをCPUで実行すると問題なく実行できますが、
GPUを使うと以下のsess.run(train)でフリーズしてしまいます。

sess = tf.Session()
sess.run(tf.global_variables_initializer())

sess.run(train, feed_dict={X: x_data, Y:y_data})

{x_data =(1000,500)、Y_data =(1000,250) }

ちなみに、ctr+Cを押しても停止できません。
nvidia-smiでGPUの使用状況を確認するとGPUはしっかり使われています。

feed_dictを使うと停止してしまうみたいなのですが、どうすれば改善できると思いますか？
ご意見よろしくお願い致します。
必要があればコードやスペックなども記述しますので、宜しくお願い致します。
以下、nvidia-smiを実行。
GPU-Util = 0%なのが気がかりです。

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.111                Driver Version: 384.111                   |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX TIT...  Off  | 00000000:02:00.0 Off |                  N/A |
| 22%   48C    P2    67W / 250W |  11645MiB / 12205MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      7449      C   python                                     11634MiB |
+-----------------------------------------------------------------------------+

以下、上手く実行された例

# -*- coding: utf-8 -*-

import tensorflow as tf
import numpy as np

# Create 100 phony x, y data points in NumPy, y = x * 0.1 + 0.3
x_data = np.random.rand(1000,100).astype(np.float32)
y_data = x_data * 0.1 + 0.3
print(x_data)

X = tf.placeholder(dtype = tf.float32, shape = [None, x_data.shape[1]])
Y = tf.placeholder(dtype = tf.float32, shape = [None, y_data.shape[1]])
# Try to find values for W and b that compute y_data = W * x_data + b
# (We know that W should be 0.1 and b 0.3, but TensorFlow will
# figure that out for us.)
W = tf.Variable(tf.random_uniform([100,100], -1.0, 1.0))
b = tf.Variable(tf.zeros([1]))

######相違点#######
y = W * X + b
#y=tf.matmul(W,X)+b      

# Minimize the mean squared errors.
loss = tf.reduce_mean(tf.square(y -Y))
optimizer = tf.train.GradientDescentOptimizer(0.5)
train = optimizer.minimize(loss)

# Before starting, initialize the variables.  We will 'run' this first.
init = tf.global_variables_initializer()

# Launch the graph.
sess = tf.Session()
sess.run(init)
#print(x_data)
#print(y_data)

# Fit the line.
for step in range(201):
    #sess.run(train)
    sess.run(train,feed_dict={X:x_data[0:100],Y:y_data[0:100]})
    if step % 20 == 0:
        print(step, sess.run(W), sess.run(b))

# Learns best fit is W: [0.1], b: [0.3]

# Close the Session when we're done.
sess.close()

以下、上手くいかずフリーズした例

# -*- coding: utf-8 -*-

import tensorflow as tf
import numpy as np

# Create 100 phony x, y data points in NumPy, y = x * 0.1 + 0.3
x_data = np.random.rand(1000,100).astype(np.float32)
y_data = x_data * 0.1 + 0.3
print(x_data)

X = tf.placeholder(dtype = tf.float32, shape = [None, x_data.shape[1]])
Y = tf.placeholder(dtype = tf.float32, shape = [None, y_data.shape[1]])
# Try to find values for W and b that compute y_data = W * x_data + b
# (We know that W should be 0.1 and b 0.3, but TensorFlow will
# figure that out for us.)
W = tf.Variable(tf.random_uniform([100,100], -1.0, 1.0))
b = tf.Variable(tf.zeros([1]))

######相違点#######
#y = W * X + b
y=tf.matmul(W,X)+b


# Minimize the mean squared errors.
loss = tf.reduce_mean(tf.square(y -Y))
optimizer = tf.train.GradientDescentOptimizer(0.5)
train = optimizer.minimize(loss)

# Before starting, initialize the variables.  We will 'run' this first.
init = tf.global_variables_initializer()

# Launch the graph.
sess = tf.Session()
sess.run(init)
#print(x_data)
#print(y_data)

# Fit the line.
for step in range(201):
    #sess.run(train)
    sess.run(train,feed_dict={X:x_data[0:100],Y:y_data[0:100]})
    if step % 20 == 0:
        print(step, sess.run(W), sess.run(b))

# Learns best fit is W: [0.1], b: [0.3]

# Close the Session when we're done.
sess.close()

行動規範の内容に同意します

回答3件

https://github.com/tensorflow/tensorflow/issues/1947
同じような問題が起きているみたいですが、ドライバーをアップグレードして直ったという意見が多いようです。
それでダメならマザボが原因の可能性があるようです。

投稿2018/05/21 05:14

puroko3

総合スコア185

6123_sadaharu-.

2018/05/21 10:43

ご回答ありがとうございます。確かにURL先の問題と似ていますね。最近ドライバーをインストールして384.111とまだ新しいので、 nvidiaにGPUと合っているか問い合わせたいと思います。マザーボードに関しては、他人と共有マシーンですので、対応は厳しそうです。また何か気付いた点がありましたら、宜しくお願い致します。

行動規範の内容に同意します

自己解決

ドライバーのバーションがGPUに合ってるか確認。
ー＞PATHを確認
ー＞nvcc -Vで出力エラーを確認
ー＞nvcc -Vが上手くいくようにpath設定
ー＞改善

投稿2018/05/22 14:19

6123_sadaharu-.

総合スコア6

同様の事象になったことがないのでうまくいく保証はありませんが、コメントします。
デフォルトでtf.Session()を実行するとgpuの全メモリがひとつのプロセスに割り当てられるそうで、これが原因かもしれません。tf.Sessionにconfigパラメーターを設定すると割り当てを設定できるようなので試してみてはいかがでしょうか。当方も設定したことがないので、「tensorflow gpu メモリ」で検索すると例示コードが見つかると思うので確かめてください

投稿2018/05/21 03:56

R.Shigemori

総合スコア3378

6123_sadaharu-.

2018/05/21 04:07

ご回答ありがとうございます。 GPUのパラメータを設定したのですが、上手くいきませんでした。具体的には、 tf.Session( tf.ConfigProto(gpu_options=tf.GPUOptions( allow_growth=True or per_process_gpu_emory_fraction = 0.8) ) を試しました。またfeed_dictを用いてtf.matmulを計算すると、フリーズしてしまうようです。