Keras環境でのCNNとLSTMを組み合わせた読唇モデルの学習が上手くいきません。

前提・実現したいこと

機械学習初心者で機械学習に関する勉強を行っています。
現在CNNとLSTMを組み合わせた25単語の読唇モデルをkerasで構築しているのですが、認識率が5%程と、かなり低い状態です。
この原因がコーディングミスで生じているのか、純粋にモデルの相性等でこのような結果になっているのか分からない状態です。
どなたか分かる方がいらっしゃれば教えていただきたいです。

該当のソースコード

Python
1timesteps = 1 # input frame numbers for LSTM
2n_labels = 25 # Number of Dataset Labels
3Learning_rate = 0.0001 # Oprimizers lr, in this case, for adam
4batch_size = 32
5num_epochs = 1
6DATA_PATH = "LFROI"
7img_channel = 3 # RGB
8image_size=32
9
10def load_images(dir_name):
11	file_list = os.listdir(dir_name)
12
13	frame_num = len(file_list)
14	
15	if frame_num < timesteps:
16		dframe = timesteps - frame_num
17		iframe = round(dframe / 2)
18		fframe = dframe - iframe		
19		iframes = [1 for x in range(iframe)]
20		nframes = [x + 1 for x in range(frame_num)]
21		fframes = [frame_num for x in range(fframe)]
22		frames = iframes + nframes + fframes		
23	else:
24		frames = [round(x * frame_num / timesteps + 1) for x in range(timesteps)]
25	frame_array = []
26
27	for i in range(timesteps):
28		image_name = os.path.join(dir_name, str(frames[i]).zfill(5) + ".jpg")
29		img = cv2.imread(image_name)
30		if img is None:
31			print("ERROR: can not read image : ", image_name)
32		else:
33			img = cv2.resize(img, (image_size, image_size))
34			frame_array.append(img)
35
36	return np.array(frame_array)
37
38def load_data(list_file):
39	file_num = sum(1 for line in open(list_file))
40	X = []
41	labels = []
42	pbar = tqdm(total=file_num)
43
44	for line in open(list_file, "r"):
45		temp = line.split()
46		file_name = temp[0]
47		label = temp[1]
48		pbar.update(1)
49		dir_name = os.path.join(DATA_PATH, file_name)
50		labels.append(int(label))
51		X.append(load_images(dir_name))
52	pbar.close()
53
54	return np.array(X), labels
55
56print("loading training data...")
57x_train, y_train = load_data("…/training_LF-ROI.txt")
58print("loading test data...")
59x_test, y_test = load_data("…/test_LF-ROI.txt")
60
61
62X_train = x_train.reshape((x_train.shape[0],timesteps, image_size, image_size,  img_channel))
63X_test = x_test.reshape((x_test.shape[0], timesteps,image_size, image_size,img_channel))
64Y_train = np_utils.to_categorical(y_train, n_labels)
65Y_test = np_utils.to_categorical(y_test, n_labels)
66X_train = X_train.astype("float32")
67X_test = X_test.astype("float32")
68	
69print("X_shape:{}\nY_shape:{}".format(X_train.shape, Y_train.shape))
70print("X_shape:{}\nY_shape:{}".format(X_test.shape, Y_test.shape))
71
72video = Input(shape=(timesteps,image_size,image_size,img_channel))
73model = applications.MobileNet(input_shape=(image_size,image_size,img_channel), weights="imagenet", include_top=False)
74model.trainable = False
75x = model.output
76x = Flatten()(x)
77x = Dense(1024, activation="relu")(x)
78x = Dropout(0.3)(x)
79cnn_out = Dense(128, activation="relu")(x)
80Lstm_inp = Model(inputs=model.input, outputs=cnn_out)
81encoded_frames = TimeDistributed(Lstm_inp)(video)
82encoded_sequence = LSTM(256)(encoded_frames)
83hidden_Drop = Dropout(0.3)(encoded_sequence)
84hidden_layer = Dense(128, activation="relu")(encoded_sequence)
85outputs = Dense(n_labels, activation="softmax")(hidden_layer)
86model = Model([video], outputs)
87
88adam = keras.optimizers.Adam(lr=Learning_rate, beta_1=0.9, beta_2=0.999, epsilon=None, decay=0.0, amsgrad=False)
89model.compile(loss="categorical_crossentropy", optimizer=adam, metrics=["accuracy"])
90
91hist = model.fit(X_train, Y_train, batch_size=batch_size, validation_data=(X_test, Y_test), shuffle=True, epochs=num_epochs)

補足情報（FW/ツールのバージョンなど）

Microsoft Visual Studio2017
tensorflow 2.4.1
keras2.4.3
Python 3.6.13

toast-uz

2021/07/07 13:27

うまくいかない時は、どこまでならうまくいくのか、確認して切り分けるとよいでしょう。今回の場合、モデルが悪いのか、コーディングが悪いのか、データが悪いのか、様々な要因が考えられます。まずビデオクリップの分類問題の既存のコードを探すとともに、既存のデータセットを探して、それで動くか確かめる。次にデータを持っているものと変えてみて確かめる。最後にモデルを独自のものにして確かめる。といった手順が必要です。