[CNN-LSTM]動画分類における時系列データの読み込み方法がわからない

やりたいこと

自前で撮影したグレースケールの手指動作の映像データに対して、fine-tuningとLSTMによる動画分類を行いたいのですが、画像の読み込み方法がわからず困っています。

データセットのディレクトリ構造は以下のようになっています。

building,clothes等の35個のディレクトリには、1フレームごとに撮影された画像(100×100)が100枚ずつ入っており、これらを時系列データとして扱いたいです。

モデル構造については、以下のサイトを参考にさせていただきました。
リンク内容

実行環境
・Google Colab TPU
・TensorFlow2.0.0のKerasライブラリ
・Python3.6.9

ソースコード(predict_camera.py)

Python
1import os, sys
2from PIL import Image
3import glob
4import numpy as np
5import tensorflow as tf
6from tensorflow.keras.utils import to_categorical
7from sklearn.model_selection import train_test_split
8import tensorflow.keras.callbacks
9from tensorflow.keras.applications.vgg16 import VGG16
10from tensorflow.keras.models import Model
11from tensorflow.keras.layers import Dense, Input, GlobalAveragePooling2D, LSTM, TimeDistributed
12from tensorflow.keras.optimizers import Nadam
13from tensorflow.keras.callbacks import EarlyStopping
14
15# tpu用
16# 詳細 https://www.tensorflow.org/guide/distributed_training#tpustrategy
17tpu_grpc_url = "grpc://" + os.environ["COLAB_TPU_ADDR"]
18tpu_cluster_resolver = tf.distribute.cluster_resolver.TPUClusterResolver(tpu_grpc_url)
19tf.config.experimental_connect_to_cluster(tpu_cluster_resolver)
20
21tf.compat.v1.disable_v2_behavior()
22tf.compat.v1.disable_eager_execution()
23
24CATEGORIES = 35
25frames = 100
26rows = 100
27columns = 100
28channels = 3
29
30folder = ["00", "01", "02"]#, "03", "04", "05", "06", "07", "08", "09",
31          #"10", "11", "12", "13", "14", "15", "16", "17", "18", "19",
32          #"20", "21", "22", "23", "24", "25", "26", "27", "28", "29",
33          #"30", "31", "32", "33", "34", "35", "36", "37", "38", "39"]
34
35classes = ["building", "clothes", "cooking", "do", "eat", "go", "gohome",
36           "here", "house", "how", "left", "money", "no", "now",
37           "old", "place", "purpose", "rainy", "right", "signlanguage", "study",
38           "sunny", "sushi", "time", "toilet", "tomorrow", "understand", "want",
39           "weather", "what", "when", "which", "who", "why", "you"]
40
41X = []
42Y = []
43
44for i, number in enumerate(folder):
45    DIR = "./image/" + number
46    for index, name in enumerate(classes):
47        dir = DIR + "/" + name
48        files = sorted(glob.glob(dir + "/*.png"))
49        F = []
50        Y.append(index)
51        for i, file in enumerate(files):
52            image = Image.open(file)
53            image = image.convert("RGB")
54            data = np.asarray(image)
55            F.append(data)
56        F = np.array(F).astype(np.float32)
57        F = F / 255.0
58        X.append(F)
59
60X = np.array(X)
61Y = np.array(Y)
62
63Y = to_categorical(Y, 35)
64
65x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.20)
66
67print(x_train.shape)
68print(y_train.shape)
69print(x_test.shape)
70print(y_test.shape)
71
72def build_model():
73    video = Input(shape=(frames,
74                         rows,
75                         columns,
76                         channels))
77    cnn_base = VGG16(input_shape=(rows,
78                                  columns,
79                                  channels),
80                     weights="imagenet",
81                     include_top=False)
82    cnn_out = GlobalAveragePooling2D()(cnn_base.output)
83    cnn = Model(inputs=cnn_base.input, outputs=cnn_out)
84    cnn.trainable = False
85    encoded_frames = TimeDistributed(cnn)(video)
86    encoded_sequence = LSTM(256)(encoded_frames)
87    hidden_layer = Dense(1024, activation="relu")(encoded_sequence)
88    outputs = Dense(CATEGORIES, activation="softmax")(hidden_layer)
89    model = Model([video], outputs)
90    optimizer = Nadam(lr=0.002,
91                  beta_1=0.9,
92                  beta_2=0.999,
93                  epsilon=1e-08,
94                  schedule_decay=0.004)
95
96    model.compile(loss="categorical_crossentropy",
97                  optimizer=optimizer,
98                  metrics=["categorical_accuracy"])
99    return model
100
101model = build_model()
102
103model.summary()
104
105early_stopping = EarlyStopping(patience=2)
106model.fit(x_train,y_train,
107          batch_size=32,
108          epochs=100,
109          verbose=1,
110          validation_split=0.2,
111          shuffle=True,
112          callbacks=[early_stopping])
113
114evaluation=model.evaluate(x_test, y_test, batch_size=batch, verbose=1)
115
116model.save('camera.hdf5')

現状のプログラムの挙動

Model: "model_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_1 (InputLayer)         [(None, 100, 100, 100, 3) 0         
_________________________________________________________________
time_distributed (TimeDistri (None, 100, 512)          14714688  
_________________________________________________________________
lstm (LSTM)                  (None, 256)               787456    
_________________________________________________________________
dense (Dense)                (None, 1024)              263168    
_________________________________________________________________
dense_1 (Dense)              (None, 35)                35875     
=================================================================
Total params: 15,801,187
Trainable params: 1,086,499
Non-trainable params: 14,714,688
_________________________________________________________________
Train on 67 samples, validate on 17 samples
Epoch 1/100
2020-12-18 17:01:29.295779: W tensorflow/core/framework/cpu_allocator_impl.cc:81] Allocation of 8192000000 exceeds 10% of system memory.
tcmalloc: large alloc 8192000000 bytes == 0xc62d4000 @  0x7fd92effbb6b 0x7fd92f01b379 0x7fd9175cfc27 0x7fd9173c2a7f 0x7fd91728e3cb 0x7fd917254526 0x7fd9172553b3 0x7fd917255583 0x7fd91dec45b1 0x7fd9174f5afc 0x7fd9174e8205 0x7fd9175a8811 0x7fd9175a5f08 0x7fd92d8fb6df 0x7fd92e9dd6db 0x7fd92ed1671f
tcmalloc: large alloc 3456065536 bytes == 0x2aef54000 @  0x7fd92f0191e7 0x7fd91b034ab2 0x7fd91da96e8a 0x7fd91de97282 0x7fd91de98afd 0x7fd91dec089e 0x7fd91dec3d76 0x7fd91dec4837 0x7fd9174f5afc 0x7fd9174e8205 0x7fd9175a8811 0x7fd9175a5f08 0x7fd92d8fb6df 0x7fd92e9dd6db 0x7fd92ed1671f
2020-12-18 17:01:43.445971: W tensorflow/core/framework/cpu_allocator_impl.cc:81] Allocation of 8192000000 exceeds 10% of system memory.
tcmalloc: large alloc 8192000000 bytes == 0x2aef54000 @  0x7fd92effbb6b 0x7fd92f01b379 0x7fd9175cfc27 0x7fd9173c2a7f 0x7fd91728e3cb 0x7fd917254526 0x7fd9172553b3 0x7fd917255583 0x7fd91dec45b1 0x7fd9174f5afc 0x7fd9174e8205 0x7fd9175a8811 0x7fd9175a5f08 0x7fd92d8fb6df 0x7fd92e9dd6db 0x7fd92ed1671f
tcmalloc: large alloc 73728262144 bytes == 0x565bca000 @  0x7fd92f0191e7 0x7fd91b034ab2 0x7fd91da96e8a 0x7fd91de97282 0x7fd91de98afd 0x7fd91dec089e 0x7fd91dec3d76 0x7fd91dec4837 0x7fd9174f5afc 0x7fd9174e8205 0x7fd9175a8811 0x7fd9175a5f08 0x7fd92d8fb6df 0x7fd92e9dd6db 0x7fd92ed1671f
^C　(強制停止)

jbpb0

2020/12/18 10:08

VGG16はカラー画像用だから入力が3チャンネル必要なのに、input_shapeでは1チャンネルしかない、というエラーでは??

MyuW

2020/12/18 10:22

channelsを1から3に変更し、image = image.convert("RGB")のコメントアウトを外して実行したところ、上記のエラー文と同様に、 ValueError: The input must have 3 channels; got `input_shape=(3, 100, 100)` が返されてしまいます。

jbpb0

2020/12/18 10:38 編集

https://teratail.com/questions/157913 の回答の追記の対策1に書かれてるように、weights=None とすればいけるのかも??

MyuW

2020/12/18 10:54

weights=Noneとして実行したところ、 ValueError: Input size must be at least 32x32; got `input_shape=(1, 100, 100)` というエラー文が新たに返ってきました。エラーについて調べてみたのですが、VGG16やVGG19において、特定のサイズ(今回の場合は32×32？)より入力された画像が小さい場合に返されるようなのですが、、

jbpb0

2020/12/18 11:12

たしか、tensorflowをバックエンドにする場合は、チャンネルを次元の最後にしないといけなくて、お書きのコードはチャンネルが最初なので、1でも3でもどちらも32よりも小さいのでひっかかる、のではないですかね

MyuW

2020/12/18 11:50

勉強不足でした。ご指摘ありがとうございます！自分としてはVGG16のImageNetで事前学習した重みを利用したいと考えているため、入力画像をRGBとしてチャネルを3に設定し、 video = Input(shape=(frames, rows, columns, channels)) cnn_base = VGG16(input_shape=(rows, columns, channels),weights="imagenet",include_top=False) と変更して実行しています。この際新たに RuntimeError: Additional GRPC error information: {"created":"@1608290676.637613791","description":"Error received from peer","file":"external/grpc/src/core/lib/surface/call.cc","file_line":1039,"grpc_message":"","grpc_status":12} というエラーが返されました。これについて後ほど調べてみます。エラー文の全体については編集の方へ追記させていただきます。またメモリ不足の件についてですが、現状のmodel.fitからmodel.fit_generatorに変更することを考えています。そこでfit_generatorで使用するgeneratorを自作する必要があると思われるため、また後ほどご質問させていただくかもしれないです。

jbpb0

2020/12/18 13:25 編集

動かそうと思ってコード触ってるのですが、build_model()の中のDense()にunitsが定義されてないとか、いろいろあってすんなりと動きません私が使ってるのがtf 1.xなためかもしれませんが、tf 2.xのドキュメント読んでもDense()のunitsは必須のようで、よく分かりません参考にされているWebサイトのコードは、ちゃんと動くものなんでしょうか? 【追記】Dense()のoutput_dim=をunits=に書き換えたら、model.summary()までは行きました (GPU無しPCで)

jbpb0

2020/12/18 13:23 編集

参考にされてるコードはtf 1.x時代のだと思うので、tf 2.xで動かす場合は、下記を追加した方がいいかもです tf.compat.v1.disable_v2_behavior() tf.compat.v1.disable_eager_execution()

jbpb0

2020/12/18 13:46

質問者さんの環境でコードを実行してから、 print(x_train.shape) print(y_train.shape) を実行すると、どのように表示されるのでしょうか? こちらでは、ニューラルネットの入力の次元と合わないため、model.fit() ができません

MyuW

2020/12/18 17:35 編集

上記のコードを現状のものへ更新させていただきました。以前のコードから以下の変更をしています。・忘れていたEarlyStoppingのimport ・TensorFlow2.xの挙動を1.xに変更する2行・Denseの引数であったoutput_dimをunitsに書き換えまたディレクトリ00,01,02内の画像を読み込んだ場合、私の環境では、 print(x_train.shape) # (84, 100, 100, 100, 3) print(y_train.shape) # (84, 35) print(x_test.shape) # (21, 100, 100, 100, 3) print(y_test.shape) # (21, 35) と出力されています。

MyuW

2020/12/18 17:48

ディレクトリ00〜04の５つを読み込んだ場合では、 print(x_train.shape) # (140, 100, 100, 100, 3) print(y_train.shape) # (140, 35) print(x_test.shape) # (35, 100, 100, 100, 3) print(y_test.shape) # (35, 35) と出力され、summary後の出力が Train on 112 samples, validate on 28 samples となってしまっています。

jbpb0

2020/12/18 23:30

データの次元は合ってますねこちらで合わないのは、画像を置いてるフォルダ構成が再現できてないからかもしれません失礼しました

jbpb0

2020/12/18 23:54

model.fit()のbatch_size=32を減らしたら、メモリー使用量が減るかもしれません

MyuW

2020/12/19 09:01

返答が遅れてしまい申し訳ないです。 batch_sizeを2や4まで下げてみたところ、00〜09までの画像を読み込むことができるようになりました。また、00〜39の全画像を読み込むために画像サイズを50×50まで小さくして読み込んでみたのですが、 print(x_train.shape) #(1120,) print(y_train.shape) #(1120, 35) print(x_test.shape) #(280,) print(y_test.shape) #(280, 35) となってしまい、 ValueError: Error when checking input: expected input_1 to have 5 dimensions, but got array with shape (1120, 1) が返されます。00〜09のみを読み込んだ場合については正常なshapeの型が返ってきてくれるのですが、、また、50×50のサイズでは解像度が非常に低くなってしまうので、できれば100×100で学習させたいです。

jbpb0

2020/12/19 09:22

batch_size=1 では、動きませんか? あとは、100フレームを1枚飛ばしにして半分にするとか