ランダムフォレストの回帰に読み込ませるデータが適合しないが、原因がわからない

前提

説明変数x1-x15を用いて、y を予測するランダムフォレストの回帰モデルを作りたい。

実現したいこと

CSVデータで訓練データと試験データを用意した。
データを読み込ませようとしている段階で躓く。

発生している問題・エラーメッセージ

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Input In [39], in <cell line: 28>()
     25 rfr = RandomForestRegressor(n_estimators=10)
     27 # 学習の実行
---> 28 rfr.fit(x_train, y_train)
     29 # テストデータで予測実行
     30 predict_y = rfr.predict(x_test)

File ~\Anaconda3\lib\site-packages\sklearn\ensemble\_forest.py:327, in BaseForest.fit(self, X, y, sample_weight)
    325 if issparse(y):
    326     raise ValueError("sparse multilabel-indicator for y is not supported.")
--> 327 X, y = self._validate_data(
    328     X, y, multi_output=True, accept_sparse="csc", dtype=DTYPE
    329 )
    330 if sample_weight is not None:
    331     sample_weight = _check_sample_weight(sample_weight, X)

File ~\Anaconda3\lib\site-packages\sklearn\base.py:581, in BaseEstimator._validate_data(self, X, y, reset, validate_separately, **check_params)
    579         y = check_array(y, **check_y_params)
    580     else:
--> 581         X, y = check_X_y(X, y, **check_params)
    582     out = X, y
    584 if not no_val_X and check_params.get("ensure_2d", True):

File ~\Anaconda3\lib\site-packages\sklearn\utils\validation.py:964, in check_X_y(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, estimator)
    961 if y is None:
    962     raise ValueError("y cannot be None")
--> 964 X = check_array(
    965     X,
    966     accept_sparse=accept_sparse,
    967     accept_large_sparse=accept_large_sparse,
    968     dtype=dtype,
    969     order=order,
    970     copy=copy,
    971     force_all_finite=force_all_finite,
    972     ensure_2d=ensure_2d,
    973     allow_nd=allow_nd,
    974     ensure_min_samples=ensure_min_samples,
    975     ensure_min_features=ensure_min_features,
    976     estimator=estimator,
    977 )
    979 y = _check_y(y, multi_output=multi_output, y_numeric=y_numeric)
    981 check_consistent_length(X, y)

File ~\Anaconda3\lib\site-packages\sklearn\utils\validation.py:746, in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator)
    744         array = array.astype(dtype, casting="unsafe", copy=False)
    745     else:
--> 746         array = np.asarray(array, order=order, dtype=dtype)
    747 except ComplexWarning as complex_warning:
    748     raise ValueError(
    749         "Complex data not supported\n{}\n".format(array)
    750     ) from complex_warning

File ~\Anaconda3\lib\site-packages\pandas\core\generic.py:2064, in NDFrame.__array__(self, dtype)
   2063 def __array__(self, dtype: npt.DTypeLike | None = None) -> np.ndarray:
-> 2064     return np.asarray(self._values, dtype=dtype)

ValueError: could not convert string to float: '(28)'

該当のソースコード

python3
1import os
2
3import pandas as pd
4import numpy as np
5import matplotlib.pyplot as plt
6from sklearn.ensemble import RandomForestRegressor
7from sklearn.metrics import r2_score
8
9#  学習データの場所を教える
10os.chdir('C:/Analysis')
11# 学習データCSVファイル読み込み
12df_train_all = pd.read_csv('inputdata_train.csv',encoding="utf-8")
13# yを目的変数（ターゲット）にする
14y_train = df_train_all[u"y"].values.tolist()
15x_train =df_train_all 
16
17# テスト用データCSV読み込み
18df_test_all = pd.read_csv('inputdata_test.csv',encoding="utf-8")
19# yを目的変数（ターゲット）にする
20y_test = df_test_all[u"y"].values.tolist()
21x_test = df_test_all
22
23
24# ランダムフォレスト回帰オブジェクト生成
25rfr = RandomForestRegressor(n_estimators=10)
26
27# 学習の実行
28rfr.fit(x_train, y_train)
29# テストデータで予測実行
30predict_y = rfr.predict(x_test)
31# R2決定係数で評価
32r2_score = r2_score(y_test, predict_y)
33print(r2_score)
34# 特徴量の重要度を取得
35feature = rfr.feature_importances_
36# 特徴量の名前ラベルを取得
37label = data_train.columns[0:]
38# 特徴量の重要度順（降順）に並べて表示
39indices = np.argsort(feature)[::-1]
40for i in range(len(feature)):
41    print(str(i + 1) + "   " +
42          str(label[indices[i]]) + "   " + str(feature[indices[i]]))
43
44# 実績と予測値の比較グラフ
45plt.subplot(121, facecolor='white')
46plt.title('forecast')
47plt_label = [i for i in range(1, 32)]
48plt.plot(plt_label, y_test, color='blue')
49plt.plot(plt_label, predict_y, color='red')
50# 特徴量の重要度の棒グラフ
51plt.subplot(122, facecolor='white')
52plt.title('forecast')
53plt.bar(
54    range(
55        len(feature)),
56    feature[indices],
57    color='blue',
58    align='center')
59plt.xticks(range(len(feature)), label[indices], rotation=45)
60plt.xlim([-1, len(feature)])
61plt.tight_layout()
62# グラフの表示
63plt.show()
64

試したこと

①全データを確認したが、数値以外の文字はなく、空欄もない。
②CSV読み込み時に、エンコーディングで utf-8を指示した。
→df_train_all = pd.read_csv('inputdata_train.csv',encoding="utf-8")

補足情報（FW/ツールのバージョンなど）

x1-x15は、数値のみが入っている。小数点あり。
カラム名は、全て半角であることを確認した。

meg_

2022/10/14 11:30

> ①全データを確認したが、数値以外の文字はなく、空欄もない。データが数値のみであることをどのように確認されたのか教えてください。

melian

2022/10/14 11:39

inputdata_train.csv ですが、ヘッダが複数行(例えば1行目と2行目)に渡っているということはありませんか？

jbpb0

2022/10/14 12:27 編集

> rfr.fit(x_train, y_train) のすぐ上に下記を追加して実行したら、何て表示されますでしょうか？ print(x_train.dtypes) 【追記】下記を追加して実行したら、「x_train」の要素の型が詳細に分かります print(x_train.applymap(type)) 行や列が多い場合は、先に下記を実行しておけば、表示が省略されません (「最大表示○数」は、数値に置き換えてください) pd.set_option('display.max_rows', 最大表示列数) pd.set_option('display.max_columns', 最大表示行数)

退会済みユーザー

2022/10/17 01:03

>meg_さんコメントありがとうございます。 CSVをエクセルで開いて、目視で確認しました。その後、念のためフィルターなどで確認しました。最後に、コピペでペースト時に数値のみ貼り付けを実施しました。しかし、以下jbpb0さんの方法を実施したところ、objectがたくさん出てきました。マイナスの数値が（赤字）になっていたため、文字として認識されておりました。。。気づきを教えてくださり、ありがとうございます。

退会済みユーザー

2022/10/17 01:04

>melianさんコメントありがとうございます。ヘッダは複数行になっていないのですが、マイナス数値が（赤字）表記されていたため、オブジェクトとして認識されておりました。。。

退会済みユーザー

2022/10/17 01:07 編集

>jbpb0さん確認用コードまでありがとうございます。 x1 int64 x2 int64 x3 int64 x4 int64 x5 int64 x6 int64 x7 int64 x8 int64 x9 int64 x10 int64 x11 int64 x12 int64 x13 int64 x14 int64 x15 int64 y float64 dtype: object 最初、マイナスの数値が（赤字）表示されていたので、複数がobjectがになっておりました。そのため、表記方法を変更したところ、intになりました。しかし、y軸のデータは全て小数点を含むマイナスの数値なのですが、ここだけがfloat64のままです。これをint64にするにはどうすればよいでしょうか・・・？

退会済みユーザー

2022/10/17 01:10

小数点以下の数値が反映されていなかったので、反映させたところ、全データがfloat64になりました。当然と言えば当然なのですが、、、ランダムフォレストは、小数点以下を含む数値は処理不可能なのでしょうか・・・？

退会済みユーザー

2022/10/17 01:11

Input contains NaN, infinity or a value too large for dtype('float32').

jbpb0

2022/10/17 01:43 編集

> ランダムフォレストは、小数点以下を含む数値は処理不可能なのでしょうか・・・？そんなことないと思いますけど下記は実行できますよね？ import numpy as np x_train = np.random.rand(100, 15) y_train = np.random.rand(100,) print(x_train.shape) print(y_train.shape) print(x_train.dtype) print(y_train.dtype) from sklearn.ensemble import RandomForestRegressor rfr = RandomForestRegressor(n_estimators=10) rfr.fit(x_train, y_train) > Input contains NaN, infinity or a value too large for dtype('float32'). は、どの行を実行したら出るのでしょうか？

退会済みユーザー

2022/10/17 02:13 編集

(1299, 16) --------------------------------------------------------------------------- AttributeError Traceback (most recent call last) Input In [2], in <cell line: 26>() 23 import numpy as np 25 print(x_train.shape) ---> 26 print(y_train.shape) 27 print(x_train.dtype) 28 print(y_train.dtype) AttributeError: 'list' object has no attribute 'shape' 頂いたコードだと、ここで止まってしまいます。＞は、どの行を実行したら出るのでしょうか？以下がエラー表示の全体です。 --------------------------------------------------------------------------- ValueError Traceback (most recent call last) Input In [3], in <cell line: 30>() 28 rfr.fit(x_train, y_train) 29 # テストデータで予測実行 ---> 30 predict_y = rfr.predict(x_test) 31 # R2決定係数で評価 32 r2_score = r2_score(y_test, predict_y) File ~\Anaconda3\lib\site-packages\sklearn\ensemble\_forest.py:971, in ForestRegressor.predict(self, X) 969 check_is_fitted(self) 970 # Check data --> 971 X = self._validate_X_predict(X) 973 # Assign chunk of trees to jobs 974 n_jobs, _, _ = _partition_estimators(self.n_estimators, self.n_jobs) File ~\Anaconda3\lib\site-packages\sklearn\ensemble\_forest.py:579, in BaseForest._validate_X_predict(self, X) 576 """ 577 Validate X whenever one tries to predict, apply, predict_proba.""" 578 check_is_fitted(self) --> 579 X = self._validate_data(X, dtype=DTYPE, accept_sparse="csr", reset=False) 580 if issparse(X) and (X.indices.dtype != np.intc or X.indptr.dtype != np.intc): 581 raise ValueError("No support for np.int64 index based sparse matrices") File ~\Anaconda3\lib\site-packages\sklearn\base.py:566, in BaseEstimator._validate_data(self, X, y, reset, validate_separately, **check_params) 564 raise ValueError("Validation should be done on X, y or both.") 565 elif not no_val_X and no_val_y: --> 566 X = check_array(X, **check_params) 567 out = X 568 elif no_val_X and not no_val_y: File ~\Anaconda3\lib\site-packages\sklearn\utils\validation.py:800, in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator) 794 raise ValueError( 795 "Found array with dim %d. %s expected <= 2." 796 % (array.ndim, estimator_name) 797 ) 799 if force_all_finite: --> 800 _assert_all_finite(array, allow_nan=force_all_finite == "allow-nan") 802 if ensure_min_samples > 0: 803 n_samples = _num_samples(array) File ~\Anaconda3\lib\site-packages\sklearn\utils\validation.py:114, in _assert_all_finite(X, allow_nan, msg_dtype) 107 if ( 108 allow_nan 109 and np.isinf(X).any() 110 or not allow_nan 111 and not np.isfinite(X).all() 112 ): 113 type_err = "infinity" if allow_nan else "NaN, infinity" --> 114 raise ValueError( 115 msg_err.format( 116 type_err, msg_dtype if msg_dtype is not None else X.dtype 117 ) 118 ) 119 # for object dtype data, we only check for NaNs (GH-13254) 120 elif X.dtype == np.dtype("object") and not allow_nan: ValueError: Input contains NaN, infinity or a value too large for dtype('float32').

jbpb0

2022/10/17 02:20

> AttributeError: 'list' object has no attribute 'shape' 頂いたコードだと、ここで止まってしまいます。それは、下記で作ったデータ用ですから x_train = np.random.rand(100, 15) y_train = np.random.rand(100,) 私が書いたコードを、全部そのまま実行してみてくださいエラーは出ずに実行できますよね私が書いたコードでの「x_train」と「y_train」はどちらも「float64」ですから、不動小数点が問題無く扱えることが分かりますよねそれを確認してもらうためのコードです

jbpb0

2022/10/17 02:23

> ValueError: Input contains NaN, infinity or a value too large for dtype('float32'). が > ---> 30 predict_y = rfr.predict(x_test) で出てるのなら、それよりも前の > 28 rfr.fit(x_train, y_train) は問題無く実行できてますよねそのことからも、 > ランダムフォレストは、小数点以下を含む数値は処理不可能なのでしょうか・・・？は違う、ということが分かりますよね

退会済みユーザー

2022/10/17 02:24 編集

失礼しました。以下、動きました！！！ import numpy as np x_train = np.random.rand(100, 15) y_train = np.random.rand(100,) print(x_train.shape) print(y_train.shape) print(x_train.dtype) print(y_train.dtype) from sklearn.ensemble import RandomForestRegressor rfr = RandomForestRegressor(n_estimators=10) rfr.fit(x_train, y_train) (100, 15) (100,) float64 float64 RandomForestRegressor(n_estimators=10)

jbpb0

2022/10/17 02:29

> ValueError: Input contains NaN, infinity or a value too large for dtype('float32'). が出てる > ---> 30 predict_y = rfr.predict(x_test) の「x_test」の元の「df_test_all」に「NaN」が含まれてませんでしょうか？エラーが出てる行のすぐ上に下記を追加して実行してみてください print(df_test_all.isnull().any()) 参考 https://note.nkmk.me/python-pandas-nan-judge-count/ の「行・列ごとに欠損値をひとつでも含むか判定」

退会済みユーザー

2022/10/17 02:32

こちらになりました。 x1 False x2 False x3 False x4 False x5 False x6 False x7 False x8 False x9 False x10 False x11 False x12 False x13 False x14 False x15 False y True dtype: bool

退会済みユーザー

2022/10/17 02:37

もしかしてなのですが、、、 train データは教師用データとしてy軸に変数が入っています。 testデータは、予測するためのデータとして、カラム名として y はありますが、その行は空欄です。そのため、カンマが文字として認識されてしまっているのでしょうか・・・？

jbpb0

2022/10/17 02:48 編集

> y True なので、「y」に「NaN」が含まれてるようです > x_test = df_test_all なので、「y」も「x_test」に入ってます？もしそうなら、「y」は「x_test」に要らないのでは？ > x_train =df_train_all なので、「x_train」にも「y」が入ってます？【追記】「x_train」に「y」が入ってるということは、 > rfr.fit(x_train, y_train) の「y_train」と同じものが「x_train」に入ってる、ということですよねその場合は、「x_train」の「y」を使って「y_train」を予測する、という学習がされてしまうと思います「x_train」の「y」は「y_train」と同じものだから、カンニングですよね「x_train」と「x_test」の両方から、「y」を削除しましょう

退会済みユーザー

2022/10/17 02:49

>jbpb0さん何度も懲りずにご返信くださり、ありがとうございます。現在、使用しているデータは以下の通りです。実データを２分割し、教師用データと予測用データにわけました。 inputdata_train.csv　→　教師用データ inputdata_test.csv　 →　予測用データ教師用データは、答えを入れなければならないので、y行が存在し、データで埋まっています。予測用データは、y行はカラム名のみであとは空欄です。この場合、予測用データのy行を削除すればよいのでしょうか？

退会済みユーザー

2022/10/17 02:50

入れ違いになりました！　ご指示の通り、実施してみます！

jbpb0

2022/10/17 02:57 編集

> 予測用データのy行を削除すればよいのでしょうか？学習用の「x_train」も、テスト用の「x_test」も、どちらも「y」は要りません csvファイルを読み込んだ後で、 > x_train =df_train_all > x_test = df_test_all のところで、「y」が入らないようにしてください【追記】 csvファイルは、触らなくて大丈夫です csvファイルを読み込んだ後で、ちゃんとやれば

退会済みユーザー

2022/10/17 03:30

何から何まで自己解決できず、お恥ずかしい限りなのですが。。。以下のコードを入れてエラーがでてしまいます。 import os import pandas as pd import numpy as np import matplotlib.pyplot as plt from sklearn.ensemble import RandomForestRegressor from sklearn.metrics import r2_score # 学習データの場所を教える os.chdir('C:/Analysis') # 学習データCSVファイル読み込み df_train_all = pd.read_csv('inputdata_train.csv',encoding="utf-8") # yを目的変数（ターゲット）にする y_train = df_train_all[u'y'].values.tolist() # y列を切り落として説明変数にする data_train = df_train_all.drop([u'y'], axis=1) x_train =df_train_all # テスト用データCSV読み込み df_test_all = pd.read_csv('inputdata_test.csv',encoding="utf-8") # yを目的変数（ターゲット）にする y_test = df_test_all[u'y'].values.tolist() # y列を切り落として説明変数にする data_test = df_test_all.drop([u'y'], axis=1) x_test = df_test_all # ランダムフォレスト回帰オブジェクト生成 rfr = RandomForestRegressor(n_estimators=10) # 学習の実行 rfr.fit(x_train, y_train) # テストデータで予測実行 predict_y = rfr.predict(x_test) # R2決定係数で評価 r2_score = r2_score(y_test, predict_y) print(r2_score) # 特徴量の重要度を取得 feature = rfr.feature_importances_ # 特徴量の名前ラベルを取得 label = data_train.columns[0:] # 特徴量の重要度順（降順）に並べて表示 indices = np.argsort(feature)[::-1] for i in range(len(feature)): print(str(i + 1) + " " + str(label[indices[i]]) + " " + str(feature[indices[i]])) # 実績と予測値の比較グラフ plt.subplot(121, facecolor='white') plt.title('forecast') plt_label = [i for i in range(1, 32)] plt.plot(plt_label, y_test, color='blue') plt.plot(plt_label, predict_y, color='red') # 特徴量の重要度の棒グラフ plt.subplot(122, facecolor='white') plt.title('forecast') plt.bar( range( len(feature)), feature[indices], color='blue', align='center') plt.xticks(range(len(feature)), label[indices], rotation=45) plt.xlim([-1, len(feature)]) plt.tight_layout() # グラフの表示 plt.show()

退会済みユーザー

2022/10/17 03:30

ｙが削除できていないのでしょうか？ --------------------------------------------------------------------------- ValueError Traceback (most recent call last) Input In [16], in <cell line: 35>() 33 rfr.fit(x_train, y_train) 34 # テストデータで予測実行 ---> 35 predict_y = rfr.predict(x_test) 36 # R2決定係数で評価 37 r2_score = r2_score(y_test, predict_y) File ~\Anaconda3\lib\site-packages\sklearn\ensemble\_forest.py:971, in ForestRegressor.predict(self, X) 969 check_is_fitted(self) 970 # Check data --> 971 X = self._validate_X_predict(X) 973 # Assign chunk of trees to jobs 974 n_jobs, _, _ = _partition_estimators(self.n_estimators, self.n_jobs) File ~\Anaconda3\lib\site-packages\sklearn\ensemble\_forest.py:579, in BaseForest._validate_X_predict(self, X) 576 """ 577 Validate X whenever one tries to predict, apply, predict_proba.""" 578 check_is_fitted(self) --> 579 X = self._validate_data(X, dtype=DTYPE, accept_sparse="csr", reset=False) 580 if issparse(X) and (X.indices.dtype != np.intc or X.indptr.dtype != np.intc): 581 raise ValueError("No support for np.int64 index based sparse matrices") File ~\Anaconda3\lib\site-packages\sklearn\base.py:566, in BaseEstimator._validate_data(self, X, y, reset, validate_separately, **check_params) 564 raise ValueError("Validation should be done on X, y or both.") 565 elif not no_val_X and no_val_y: --> 566 X = check_array(X, **check_params) 567 out = X 568 elif no_val_X and not no_val_y: File ~\Anaconda3\lib\site-packages\sklearn\utils\validation.py:800, in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator) 794 raise ValueError( 795 "Found array with dim %d. %s expected <= 2." 796 % (array.ndim, estimator_name) 797 ) 799 if force_all_finite: --> 800 _assert_all_finite(array, allow_nan=force_all_finite == "allow-nan") 802 if ensure_min_samples > 0: 803 n_samples = _num_samples(array) File ~\Anaconda3\lib\site-packages\sklearn\utils\validation.py:114, in _assert_all_finite(X, allow_nan, msg_dtype) 107 if ( 108 allow_nan 109 and np.isinf(X).any() 110 or not allow_nan 111 and not np.isfinite(X).all() 112 ): 113 type_err = "infinity" if allow_nan else "NaN, infinity" --> 114 raise ValueError( 115 msg_err.format( 116 type_err, msg_dtype if msg_dtype is not None else X.dtype 117 ) 118 ) 119 # for object dtype data, we only check for NaNs (GH-13254) 120 elif X.dtype == np.dtype("object") and not allow_nan: ValueError: Input contains NaN, infinity or a value too large for dtype('float32').

jbpb0

2022/10/17 05:08

> rfr.fit(x_train, y_train) のすぐ上に下記を追加して実行してみてください print(x_train.isnull().any()) print(x_test.isnull().any()) 結果表示を見て、どちらにも「y」が無く、どちらにも「True」が無い状態になってますでしょうか？

jbpb0

2022/10/17 05:13 編集

data_train = df_train_all.drop([u'y'], axis=1) data_test = df_test_all.drop([u'y'], axis=1) なのに、なぜ x_train =df_train_all x_test = df_test_all としてるのでしょうか？ x_train =df_train x_test = df_test でしょう

退会済みユーザー

2022/11/01 09:30

ご回答遅くなり大変失礼しました。（挙式を挙げておりました。。。）もし、よろしければ続きをご指導頂けますと幸いです。上記のように、printを入れて、関数名を入れました。まず、以下が入力したコードです。

退会済みユーザー

2022/11/01 09:30

import os import pandas as pd import numpy as np import matplotlib.pyplot as plt from sklearn.ensemble import RandomForestRegressor from sklearn.metrics import r2_score # 学習データの場所を教える os.chdir('C:/Analysis') # 学習データCSVファイル読み込み df_train_all = pd.read_csv('inputdata_train.csv',encoding="utf-8") # yを目的変数（ターゲット）にする y_train = df_train_all[u'y'].values.tolist() # y列を切り落として説明変数にする data_train = df_train_all.drop([u'y'], axis=1) x_train = data_train # テスト用データCSV読み込み df_test_all = pd.read_csv('inputdata_test.csv',encoding="utf-8") # yを目的変数（ターゲット）にする y_test = df_test_all[u'y'].values.tolist() # y列を切り落として説明変数にする data_test = df_test_all.drop([u'y'], axis=1) x_test = data_test # ランダムフォレスト回帰オブジェクト生成 rfr = RandomForestRegressor(n_estimators=10) # 学習の実行 rfr.fit(x_train, y_train) print(x_train.isnull().any()) print(x_test.isnull().any()) # テストデータで予測実行 predict_y = rfr.predict(x_test) # R2決定係数で評価 r2_score = r2_score(y_test, predict_y) print(r2_score) # 特徴量の重要度を取得 feature = rfr.feature_importances_ # 特徴量の名前ラベルを取得 label = data_train.columns[0:] # 特徴量の重要度順（降順）に並べて表示 indices = np.argsort(feature)[::-1] for i in range(len(feature)): print(str(i + 1) + " " + str(label[indices[i]]) + " " + str(feature[indices[i]])) # 実績と予測値の比較グラフ plt.subplot(121, facecolor='white') plt.title('forecast') plt_label = [i for i in range(1, 32)] plt.plot(plt_label, y_test, color='blue') plt.plot(plt_label, predict_y, color='red') # 特徴量の重要度の棒グラフ plt.subplot(122, facecolor='white') plt.title('forecast') plt.bar( range( len(feature)), feature[indices], color='blue', align='center') plt.xticks(range(len(feature)), label[indices], rotation=45) plt.xlim([-1, len(feature)]) plt.tight_layout() # グラフの表示 plt.show()

退会済みユーザー

2022/11/01 09:30

得られた結果は以下です。

退会済みユーザー

2022/11/01 09:31

x1 False x2 False x3 False x4 False x5 False x6 False x7 False x8 False x9 False x10 False x11 False x12 False x13 False x14 False x15 False dtype: bool x1 False x2 False x3 False x4 False x5 False x6 False x7 False x8 False x9 False x10 False x11 False x12 False x13 False x14 False x15 False dtype: bool --------------------------------------------------------------------------- ValueError Traceback (most recent call last) Input In [6], in <cell line: 42>() 40 predict_y = rfr.predict(x_test) 41 # R2決定係数で評価 ---> 42 r2_score = r2_score(y_test, predict_y) 43 print(r2_score) 44 # 特徴量の重要度を取得 File ~\Anaconda3\lib\site-packages\sklearn\metrics\_regression.py:789, in r2_score(y_true, y_pred, sample_weight, multioutput) 702 def r2_score(y_true, y_pred, *, sample_weight=None, multioutput="uniform_average"): 703 """:math:`R^2` (coefficient of determination) regression score function. 704 705 Best possible score is 1.0 and it can be negative (because the (...) 787 -3.0 788 """ --> 789 y_type, y_true, y_pred, multioutput = _check_reg_targets( 790 y_true, y_pred, multioutput 791 ) 792 check_consistent_length(y_true, y_pred, sample_weight) 794 if _num_samples(y_pred) < 2: File ~\Anaconda3\lib\site-packages\sklearn\metrics\_regression.py:95, in _check_reg_targets(y_true, y_pred, multioutput, dtype) 61 """Check that y_true and y_pred belong to the same regression task. 62 63 Parameters (...) 92 the dtype argument passed to check_array. 93 """ 94 check_consistent_length(y_true, y_pred) ---> 95 y_true = check_array(y_true, ensure_2d=False, dtype=dtype) 96 y_pred = check_array(y_pred, ensure_2d=False, dtype=dtype) 98 if y_true.ndim == 1: File ~\Anaconda3\lib\site-packages\sklearn\utils\validation.py:800, in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator) 794 raise ValueError( 795 "Found array with dim %d. %s expected <= 2." 796 % (array.ndim, estimator_name) 797 ) 799 if force_all_finite: --> 800 _assert_all_finite(array, allow_nan=force_all_finite == "allow-nan") 802 if ensure_min_samples > 0: 803 n_samples = _num_samples(array) File ~\Anaconda3\lib\site-packages\sklearn\utils\validation.py:114, in _assert_all_finite(X, allow_nan, msg_dtype) 107 if ( 108 allow_nan 109 and np.isinf(X).any() 110 or not allow_nan 111 and not np.isfinite(X).all() 112 ): 113 type_err = "infinity" if allow_nan else "NaN, infinity" --> 114 raise ValueError( 115 msg_err.format( 116 type_err, msg_dtype if msg_dtype is not None else X.dtype 117 ) 118 ) 119 # for object dtype data, we only check for NaNs (GH-13254) 120 elif X.dtype == np.dtype("object") and not allow_nan: ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

退会済みユーザー

2022/11/01 09:32

おんぶにだっこで大変申し訳ないのですが、これは中に入っている数値データのTypeがあっていないので、変換する必要があると理解してよいでしょうか？

jbpb0

2022/11/01 10:01 編集

> ValueError: Input contains NaN, infinity or a value too large for dtype('float64'). が出てるのが > ---> 42 r2_score = r2_score(y_test, predict_y) なので、「y_test」と「predict_y」を調べてみてください変なものが含まれてないかもともとの質問のエラーが出てたのが > # 学習の実行 rfr.fit(x_train, y_train) で、それが直って次にエラーが出たのが、 > # テストデータで予測実行 predict_y = rfr.predict(x_test) で、それも直って今エラー出てるのが > # R2決定係数で評価 r2_score = r2_score(y_test, predict_y) なので、ちょっとずつですが前身してます下記を無効にしたら、その先は最後まで実行できますか？ (行頭に「#」を付けてコメントにする) # R2決定係数で評価 r2_score = r2_score(y_test, predict_y) print(r2_score) 下記のどちらかでエラー出るかも > plt.plot(plt_label, y_test, color='blue') plt.plot(plt_label, predict_y, color='red')

退会済みユーザー

2022/11/01 23:52

ご回答くださって、ありがとうございます。まず、　＃r2_score = r2_score(y_test, predict_y)　にしました。出た結果は以下の内容となりました。 -------------------------------------------------------------------------------------------- 1 x13 0.8329134472116941 2 x14 0.03518887742193335 3 x15 0.02122176455932032 4 x9 0.018532556894252962 5 x11 0.018116616749020008 6 x1 0.01407167314821801 7 x3 0.01240505809528215 8 x6 0.012351605727518766 9 x8 0.006757431668461389 10 x10 0.0065468592845417325 11 x2 0.005889648901127411 12 x12 0.00542049325026797 13 x4 0.005163394686602255 14 x5 0.003553672702399675 15 x7 0.0018668996993598988 --------------------------------------------------------------------------- ValueError Traceback (most recent call last) Input In [2], in <cell line: 53>() 51 plt.title('forecast') 52 plt_label = [i for i in range(1, 32)] ---> 53 plt.plot(plt_label, y_test, color='blue') 54 plt.plot(plt_label, predict_y, color='red') 55 # 特徴量の重要度の棒グラフ File ~\Anaconda3\lib\site-packages\matplotlib\pyplot.py:2757, in plot(scalex, scaley, data, *args, **kwargs) 2755 @_copy_docstring_and_deprecators(Axes.plot) 2756 def plot(*args, scalex=True, scaley=True, data=None, **kwargs): -> 2757 return gca().plot( 2758 *args, scalex=scalex, scaley=scaley, 2759 **({"data": data} if data is not None else {}), **kwargs) File ~\Anaconda3\lib\site-packages\matplotlib\axes\_axes.py:1632, in Axes.plot(self, scalex, scaley, data, *args, **kwargs) 1390 """ 1391 Plot y versus x as lines and/or markers. 1392 (...) 1629 (``'green'``) or hex strings (``'#008000'``). 1630 """ 1631 kwargs = cbook.normalize_kwargs(kwargs, mlines.Line2D) -> 1632 lines = [*self._get_lines(*args, data=data, **kwargs)] 1633 for line in lines: 1634 self.add_line(line) File ~\Anaconda3\lib\site-packages\matplotlib\axes\_base.py:312, in _process_plot_var_args.__call__(self, data, *args, **kwargs) 310 this += args[0], 311 args = args[1:] --> 312 yield from self._plot_args(this, kwargs) File ~\Anaconda3\lib\site-packages\matplotlib\axes\_base.py:498, in _process_plot_var_args._plot_args(self, tup, kwargs, return_kwargs) 495 self.axes.yaxis.update_units(y) 497 if x.shape[0] != y.shape[0]: --> 498 raise ValueError(f"x and y must have same first dimension, but " 499 f"have shapes {x.shape} and {y.shape}") 500 if x.ndim > 2 or y.ndim > 2: 501 raise ValueError(f"x and y can be no greater than 2D, but have " 502 f"shapes {x.shape} and {y.shape}") ValueError: x and y must have same first dimension, but have shapes (31,) and (150,)

退会済みユーザー

2022/11/01 23:56

数値は、もしかすると桁数が多すぎてTypeに適合していないのでは？と思いました。今回のエラーの場所はご指摘の通り、　plt.plot(plt_label, y_test, color='blue')　ここでした。

退会済みユーザー

2022/11/02 00:10 編集

そして、 y_test のデータがすべて non になっておりました。

退会済みユーザー

2022/11/01 23:58

一度、データを分割するところを確認致します。その後、再度こちらにてご相談させてください。

退会済みユーザー

2022/11/02 00:10

各データを見直したのですが、、、y_testは、これから予測するデータなので、nonで当たり前だと思うのですが、正しいでしょうか？なお、元のinputdata_test.csvには、y軸のカラム名は残してありますが、セルは全て空欄です。

jbpb0

2022/11/02 04:35 編集

> y_testは、これから予測するデータなので、nonで当たり前だと思うのですが、正しいでしょうか？間違ってます > # テストデータで予測実行 predict_y = rfr.predict(x_test) で、「x_test」から予測したのが「predict_y」です > # R2決定係数で評価 r2_score = r2_score(y_test, predict_y) で、「予測したpredict_y」がどれくらい「正解であるy_test」と合ってるかを評価します「y_test」が正解でなければ、「予測したpredict_y」の評価ができません > 元のinputdata_test.csvには、y軸のカラム名は残してありますが、セルは全て空欄です。その時点から間違ってます

退会済みユーザー

2022/11/02 07:51

ご回答ありがとうございました。まず、現在うまく動かすことができました。本当にありがとうございます。 y_testに回答を入れました。次に、以下の対応をしました。このようなエラーが出ていたので ValueError: x and y must have same first dimension, but have shapes (31,) and (150,) plt_label = [i for i in range(0, 150)] ここの数値をエラーの数値に当てはめたところ、正常にグラフが動作しました。この0-150の数字はX軸の数値なのですが、、、データ数だと思いますが、都度入れるのも手間なのですが、自動調整はできないのでしょうか？

jbpb0

2022/11/02 08:49

> # 実績と予測値の比較グラフのすぐ上に print(len(y_test)) を追加して実行したら、何て表示されますか？

jbpb0

2022/11/10 23:11

上記は、いかがでしょうか？