質問編集履歴

2018/07/24 13:33

投稿

trafalbad

スコア303

title CHANGED Viewed

File without changes

body CHANGED Viewed

@@ -9,9 +9,12 @@
 ご教授お願いします
 追記
+```python
+config = tf.ConfigProto(log_device_placement=True)
+sess = tf.Session(config=config)
-config = tf.ConfigProto(log_device_placement=True)
-sess = tf.Session(config=config) K.set_session(sess)
+K.set_session(sess)
+```
 に変更して、画像サイズ減らす、input関数の画像枚数増やす処理なくせば良いのかなと思うのですが
 ```python
 ＃エラー

2018/07/24 13:33

投稿

trafalbad

スコア303

title CHANGED Viewed

File without changes

body CHANGED Viewed

@@ -7,6 +7,12 @@
 何が原因なのでしょうか？
 ちなみにjupyter上ではなくAWSのEC２のターミナル上で実行しました
 ご教授お願いします
+追記
+config = tf.ConfigProto(log_device_placement=True)
+sess = tf.Session(config=config) K.set_session(sess)
+に変更して、画像サイズ減らす、input関数の画像枚数増やす処理なくせば良いのかなと思うのですが
 ```python
 ＃エラー
 W tensorflow/core/common_runtime/bfc_allocator.cc:279] *************************************************************************************************xxx

質問変更

2018/07/24 13:32

投稿

trafalbad

スコア303

title CHANGED Viewed

	@@ -1,1 +1,1 @@
1	- ~~google~~の~~急上昇ワ~~ードに~~似た簡易的な検知アルゴリズムに~~ついて
1	+ GPUのエラー'OOM when allocating tensor'について

body CHANGED Viewed

@@ -1,17 +1,180 @@
-googleやYahoo!で急上昇ワードというのがあります。
+質問の変更申し訳ありません。
-あれは文献で見たのですが、複数のアルゴリズムを用いて、普通に作れるものではないことがわかりました。
-自分は検索ワード数の異常検知アルゴリズムを資料（https://www.albert2005.co.jp/knowledge/machine_learning/anomaly_detection_basics/anomaly_detection_time）
-を参考にして作ったのですが、個々のワードの急上昇までは特定できません。
+GPUで実行すると下記のエラーが出ます
-上記の資料の異常検知のロジックで急上昇ワードを検知するアルゴリズムとしてはどのようなものがありますでしょうか？
+実行環境はAWSのp2インスタンスのp2.8xlargeなのでメモリが足りないことはないと思うのですが、バッチを8にしてもこのエラーが出てしまいます。
-自分としては
-・単語をidと共に組み合わせた、辞書などを使って、一日の各単語を　countして各単語の時系列データを作り、上記同様の異常検知アルゴリズムを作る
+何が原因なのでしょうか？
+ちなみにjupyter上ではなくAWSのEC２のターミナル上で実行しました
+ご教授お願いします
+```python
+＃エラー
+W tensorflow/core/common_runtime/bfc_allocator.cc:279] *************************************************************************************************xxx
+2018-07-24 08:58:04.962110: W tensorflow/core/framework/op_kernel.cc:1295] OP_REQUIRES failed at constant_op.cc:75 : Resource exhausted: OOM when allocating tensor of shape [1,1,1088,192] and type float
+2018-07-24 08:58:04.962293: E tensorflow/core/common_runtime/executor.cc:660] Executor failed to create kernel. Resource exhausted: OOM when allocating tensor of shape [1,1,1088,192] and type float
+	 [[Node: training/SGD/zeros_176 = Const[dtype=DT_FLOAT, value=Tensor<type: float shape: [1,1,1088,192] values: [[[0 0 0]]]...>, _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]
+error
+Traceback (most recent call last):
+  File "Inception_resnet_v2_train.py", line 303, in <module>
+    coord.join(threads)
+  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/coordinator.py", line 389, in join
+    six.reraise(*self._exc_info_to_raise)
+  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/six.py", line 693, in reraise
+    raise value
+  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/queue_runner_impl.py", line 252, in _run
+    enqueue_callable()
+  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1244, in _single_operation_run
+    self._call_tf_sessionrun(None, {}, [], target_list, None)
+  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1409, in _call_tf_sessionrun
+    run_metadata)
+tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[150,150,3] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
+	 [[Node: Cast_1 = Cast[DstT=DT_FLOAT, SrcT=DT_UINT8, _class=["loc:@random_flip_left_right/Switch_1"], _device="/job:localhost/replica:0/task:0/device:GPU:0"](Reshape)]]
+Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
-感じで考えているのですが？
+	 [[Node: per_image_standardization/_25 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_58_per_image_standardization", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
+Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
+```
+コード（一部抜粋）
+```python
+＃input用の関数
+from __future__ import print_function
-**・質問：急上昇ワードのようなアルゴリズムで現実的なものとしてどんなものが考えられるでしょうか？**
+from __future__ import absolute_import
+import warnings
+import time
+import os
+import math
+import numpy as np
+import tensorflow as tf
+from keras.optimizers import SGD
+from keras.callbacks import History
+from keras.callbacks import Callback
+from keras.callbacks import ModelCheckpoint
+from keras.callbacks import TensorBoard
+from keras.callbacks import CSVLogger
+from keras import layers
+from keras.preprocessing import image
+from keras.models import Model
+from keras.layers import Activation
+from keras.layers import AveragePooling2D
+from keras.layers import BatchNormalization
+from keras.layers import Concatenate
+from keras.layers import Conv2D
+from keras.layers import Dense
+from keras.layers import GlobalAveragePooling2D
+from keras.layers import GlobalMaxPooling2D
+from keras.layers import Input
+from keras.layers import Lambda
+from keras.layers import MaxPooling2D
+from keras.utils.data_utils import get_file
+from keras.engine.topology import get_source_inputs
+from keras import backend as K
+from keras import metrics
+from keras import utils as np_utils
+from keras.utils.vis_utils import plot_model, model_to_dot
+import matplotlib.pyplot as plt
+from keras.callbacks import EarlyStopping
+tf.logging.set_verbosity(tf.logging.ERROR)
+# In[2]:
+from tensorflow.python.client import device_lib
+device_lib.list_local_devices()
+# In[4]:
+def input_data(data_dir, batch_size, distort=False):
+    num_class = 45
+    filenames = [os.path.join(data_dir, 'train_%d.tfrecords' % i)
-アドバイスや考えなど様々なご意見お願いします
+               for i in range(1, 61)]
+    for f in filenames:
+        if not tf.gfile.Exists(f):
+            raise ValueError('Failed to find file: ' + f)
+    # Create a queue that produces the filenames to read.
+    filename_queue = tf.train.string_input_producer(filenames)
+    reader = tf.TFRecordReader()
+    _, serialized_example = reader.read(filename_queue)
+    features = tf.parse_single_example(serialized_example,
+      features={"label": tf.FixedLenFeature([], tf.int64),
+          "image": tf.FixedLenFeature([], tf.string)})
+    label = tf.cast(features["label"], tf.int32)
+    imgin = tf.reshape(tf.decode_raw(features["image"], tf.uint8), tf.stack([150, 150, 3]))
+    float_image = tf.cast(imgin, tf.float32)
+    num_preprocess_threads = 16
+    min_fraction_of_examples_in_queue = 0.4
+    NUM_EXAMPLES_PER_EPOCH_FOR_TRAIN = 2900000
+    if distort is True:
+        distorted_image = tf.image.random_flip_left_right(float_image)
+        distorted_image = tf.image.random_brightness(distorted_image, max_delta=63)
+        distorted_image = tf.image.random_contrast(distorted_image, lower=0.2, upper=1.8)
+        distorted_image = tf.image.per_image_standardization(distorted_image)
+        distorted_image.set_shape([150, 150, 3])
+        min_fraction_of_examples_in_queue = 0.4
+        min_queue_examples = int(NUM_EXAMPLES_PER_EPOCH_FOR_TRAIN *
+                            min_fraction_of_examples_in_queue)
+        print ('Filling queue with %d CIFAR images before starting to train. '
+         'This will take a few minutes.' % min_queue_examples)
+        images, label_batch = tf.train.shuffle_batch([distorted_image, label], batch_size=batch_size,
+        num_threads=num_preprocess_threads, capacity=min_queue_examples + 3 * batch_size,
+        min_after_dequeue=min_queue_examples)
+    else:
+        images, label_batch = tf.train.batch([float_image, label], batch_size=batch_size,
+        num_threads=num_preprocess_threads, capacity=min_queue_examples + 3 * batch_size,
+        min_after_dequeue=min_queue_examples)
+    return tf.subtract(tf.div(images,127.5), 1.0), tf.one_hot(tf.reshape(label_batch, [batch_size]),num_class)
+＃session実行部
+config = tf.ConfigProto(allow_soft_placement=True)
+config.gpu_options.allocator_type = 'BFC'
+config.gpu_options.per_process_gpu_memory_fraction = 0.40
+config.gpu_options.allow_growth=True
+sess = K.get_session()
+train_image, train_labels = input_data('/home/ubuntu/train_tf',16, distort=True)
+input_ = Input(tensor=train_image)
+output_ = InceptionResNetV2(img_input=input_)
+train_model = Model(input_, output_, name='inception_resnet_v2')
+train_model.compile(optimizer=SGD(decay=0.1, momentum=0.9, nesterov=True),
+                        loss='categorical_crossentropy',
+                    metrics=['accuracy'], target_tensors=[train_labels])
+# In[7]:
+history = History()
+callback = []
+# callbacks.append(ModelCheckpoint(filepath="model.best.h5", save_best_only=True))
+callback.append(history)
+callback.append(ModelCheckpoint(filepath="/home/ubuntu/check_dir/model.ep{epoch:02d}.h5"))
+callback.append(EarlyStopping("loss", patience=1))
+# In[8]:
+coord = tf.train.Coordinator()
+threads = tf.train.start_queue_runners(sess, coord)
+try:
+    history = train_model.fit(epochs=10, steps_per_epoch=int(np.ceil(2900000/16)), callbacks=callback)
+    print(history)
+except:
+    print('error')
+coord.request_stop()
+coord.join(threads)
+```