Python ValueError: Data cardinality is ambiguous: エラーの改善

前提・実現したいこと

PythonでCNNモデルを用い音声感情分類を行うシステムを作っています。
以下のエラーメッセージが発生しました。xとyのサイズを揃えないといけないと思うのですが、
どう修正すれば良いか調べてみても分からなかったため、どなたかご教授いただけませんでしょうか？

発生している問題・エラーメッセージ

(637, 2913, 40, 1) (762, 4)
(450, 2913, 40, 1) (450, 4)
Traceback (most recent call last):
  File "cnn_model.py", line 329, in <module>
    main()
  File "cnn_model.py", line 286, in main
    callbacks=[valid_metrics, chkPoint]
  File "/usr/local/lib/python3.7/dist-packages/keras/utils/traceback_utils.py", line 67, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/usr/local/lib/python3.7/dist-packages/keras/engine/data_adapter.py", line 1657, in _check_data_cardinality
    raise ValueError(msg)
ValueError: Data cardinality is ambiguous:
  x sizes: 637
  y sizes: 762
Make sure all arrays contain the same number of samples.

該当のソースコード

Python
1f_train_path = os.getcwd() + '/feature/Session' + fold + '/train/'
2
3def hard_label(path):
4    label=[]
5    f_list = list(ii for ii in sorted(glob.glob(path)))
6    for file_path in f_list:
7        with open(file_path, 'r') as file:
8            for line in file:
9                line_sp = line.replace('Happiness', '0')
10                line_sp = line_sp.replace('Anger', '1')
11                line_sp = line_sp.replace('Neutral', '2')
12                line_sp = line_sp.replace('Sadness', '3')
13                line_sp = line_sp.replace('\n', '')
14                line_sp = line_sp.split(',')
15                if (len(line_sp)==10):
16                    if (line_sp[9] == '0') or (line_sp[9] == '1') or (line_sp[9] == '2') or (line_sp[9] == '3'):
17                        label.append(line_sp[9])
18                        print(line_sp[0])
19                        
20        file.close()
21    label = np.array(label)
22    label = label.astype('int16')
23    #print(label)
24    return label
25
26def load_data(path):
27    f_list = list(sorted(glob.glob(path + '*.npy')))
28    X = np.zeros((len(f_list), slen, f_dim), dtype='float32')
29
30    for fname, ii in zip(f_list, range(len(f_list))):
31        tmp = np.load(fname)
32        padd = np.zeros((slen-len(tmp), f_dim))
33        X[ii] = np.vstack((tmp, padd))
34    return X.reshape(len(X), slen, f_dim, 1)
35
36print(x_train.shape, y_train.shape)
37    print(x_test.shape, y_test.shape)
38
39  # Trains the model for a fixed number of epochs (iterations on a dataset)
40    model_history = model.fit(x=x_train,
41                              y=y_train,
42                              batch_size=batch_size,
43                              epochs=epochs,
44                              #class_weight=class_weight,
45                              verbose=0,
46                              validation_data=(x_test, y_test),
47                              callbacks=[valid_metrics, chkPoint]
48                              )
49
50def main():
51    x_train = load_data(f_train_path)

補足情報（FW/ツールのバージョンなど）

Google Colabratory
Python 3.6.5

jbpb0

2021/12/16 11:22

> (637, 2913, 40, 1) (762, 4) x_trainのサンプル数が637で、y_trainのサンプル数が762で、それが違うからエラーが出てるのだと思います x_trainは「x_train = load_data(f_train_path)」で作っていて、「def load_data(path):」を見ると、指定したディレクトリパス「f_train_path = os.getcwd() + '/feature/Session' + fold + '/train/'」にある、名前が「*.npy」のファイルの数がサンプル数のようです y_trainのサンプル数は、それを作ってるところのコードが質問に記載されて無いので分かりません

退会済みユーザー

2021/12/17 07:47

ご回答ありがとうございます。 def main(): x_train = load_data(f_train_path) y_train = hard_label(l_train_path) Happiness=np.count_nonzero(y_train==0) Anger=np.count_nonzero(y_train==1) Neutral=np.count_nonzero(y_train==2) Sadness=np.count_nonzero(y_train==3) n_max=max(Happiness, Anger, Neutral, Sadness) y_train = to_categorical(y_train, emo_classes) この部分でしょうか？

jbpb0

2021/12/17 08:15

> y_train = hard_label(l_train_path) の「hard_label()」の定義です「def hard_label(path):」みたいなのの中身

退会済みユーザー

2021/12/17 08:17

失礼いたしました。こちらです。CSVファイルの中にデータベースのラベル情報が入っています。9列目に取得したいラベルが記載されています。 def hard_label(path): label=[] f_list = list(ii for ii in sorted(glob.glob(path))) for file_path in f_list: with open(file_path, 'r') as file: for line in file: line_sp = line.replace('Happiness', '0') line_sp = line_sp.replace('Anger', '1') line_sp = line_sp.replace('Neutral', '2') line_sp = line_sp.replace('Sadness', '3') line_sp = line_sp.replace('\n', '') line_sp = line_sp.split(',') if (len(line_sp)==10): if (line_sp[9] == '0') or (line_sp[9] == '1') or (line_sp[9] == '2') or (line_sp[9] == '3'): label.append(line_sp[9]) print(line_sp[0]) file.close() label = np.array(label) label = label.astype('int16') return label

jbpb0

2021/12/17 08:23

まず、ディレクトリパス「f_train_path = os.getcwd() + '/feature/Session' + fold + '/train/'」にある、名前が「*.npy」のファイルの数を数えてくださいおそらく637個で、それがx_trainのサンプル数です

jbpb0

2021/12/17 08:24

「def hard_label(path):」は、質問を編集して追記してくださいここに書かれても、インデントが消えてしまい、よく分かりません

jbpb0

2021/12/17 08:35

> CSVファイルの中にデータベースのラベル情報が入っています。9列目に取得したいラベルが記載されています。ディレクトリパス「l_train_path」にある全てのファイルで、ラベルが記載されてる行の数を数えてくださいおそらく762行で、それがy_trainのサンプル数です

jbpb0

2021/12/17 08:39

・x_trainのサンプル数は、「f_train_path」にある、名前が「*.npy」のファイルの数・y_trainのサンプル数は、「l_train_path」にある、全(CSV?)ファイルに書かれてるラベルの総数で、両者が一致してないのがまずいのだと思いますそもそものデータの作り方が間違ってないか、確認してみてください

退会済みユーザー

2021/12/18 07:04

なるほど、ありがとうございます。確認してみます。

行動規範の内容に同意します

回答1件

ベストアンサー

(637, 2913, 40, 1) (762, 4)

x_trainのサンプル数が637で、y_trainのサンプル数が762で、それが違うからエラーが出てるのだと思います

コードを見ると、
・x_trainのサンプル数は、「f_train_path」にある、名前が「*.npy」のファイルの数
・y_trainのサンプル数は、「l_train_path」にある、全(CSV?)ファイルに書かれてるラベルの総数
のようです
そもそもの学習用データの作り方が間違ってないか、確認してみてください

投稿2021/12/20 02:22

jbpb0

総合スコア7658