python DecisionTreeClassifierでデータを比較できません。エラー：Unknown label type: 'continuous'

###前提・実現したいこと
プログラミング初心者です。最終的にはランダムフォレストという手法を用いて分析してみたいと考えています。そこで、"合計"というデータと他の４つのデータを比較しようとしたところ、以下のようなエラーがでました。yのデータが１行だからでしょうか。どのように直したらデータ同士を比較できるのかご教示いただけますと幸いです。よろしくお願いいたします。

なお、"合計（train_y）"のデータと、それと比較したい４つのデータ（train_x）は以下となります。

train_x
970 0.000000
367 0.000000
803 0.000000
151 0.000000
851 0.000000
590 0.000000
265 0.000000
386 0.000000
524 0.000000
1095 0.000000
609 0.000000
510 0.000000
462 0.000000
1214 0.000000
53 0.000000
753 0.000000
425 0.001755
236 0.000000
477 0.000000
310 0.006713
132 0.000000
912 0.000000
465 0.000000
9 0.000000
68 0.000000
779 0.000000
1094 0.000000
412 0.000000
892 0.000000
997 0.000000
...
186 0.000000
819 0.000000
888 0.000000
1198 0.000000
60 0.000000
202 0.000000
322 0.000000
824 0.000000
836 0.000000
365 0.000000
967 0.000000
1204 0.000000
1098 0.000000
1127 0.000000
577 0.000000
340 0.000000
850 0.000000
321 0.000000
32 0.000000
654 0.000000
397 0.000000
792 0.000000
814 0.000000
985 0.000000
116 0.000000
639 0.000000
71 0.000000
934 0.006466
815 0.000000
103 0.000000

train_y
[852 rows x 4 columns]
気温(℃) 降水量(mm) 風速(m/s) 相対湿度(％)
970 2.9 0.5 3.7 80.0
367 1.4 0.0 3.1 65.0
803 2.8 0.0 1.7 74.0
151 6.7 0.0 3.6 64.0
851 6.7 0.0 3.6 72.0
590 6.6 0.0 2.0 75.0
265 4.9 0.0 4.4 67.0
386 0.1 0.5 6.2 83.0
524 2.8 0.0 2.5 82.0
1095 7.3 1.5 4.5 83.0
609 5.1 0.5 3.2 83.0
510 6.6 1.5 0.7 90.0
462 5.0 0.0 4.0 66.0
1214 7.8 0.0 3.4 49.0
53 5.2 0.0 2.2 78.0
753 6.2 0.0 1.6 73.0
425 6.8 0.0 1.2 89.0
236 8.0 0.0 6.1 41.0
477 -0.6 0.0 1.8 86.0
310 7.5 0.0 3.0 74.0
132 5.5 0.0 1.3 75.0
912 -0.2 2.0 5.8 88.0
465 2.8 1.0 2.2 88.0
9 1.3 0.5 2.0 90.0
68 3.4 1.5 2.7 86.0
779 3.3 0.0 2.0 85.0
1094 4.9 0.0 1.1 74.0
412 3.7 0.0 2.8 67.0
892 7.6 0.0 2.1 58.0
997 1.2 0.0 3.6 76.0
... ... ... ... ...
186 1.1 0.5 6.3 72.0
819 1.4 0.0 1.8 71.0
888 7.2 0.0 5.1 73.0
1198 4.0 0.0 4.4 79.0
60 4.1 1.0 1.9 86.0
202 6.8 0.0 3.1 49.0
322 2.0 0.0 2.0 88.0
824 6.5 0.0 1.8 54.0
836 6.1 0.0 1.7 61.0
365 2.8 0.0 5.4 84.0
967 8.3 0.0 5.4 59.0
1204 6.4 0.0 3.6 65.0
1098 -0.6 0.5 4.4 82.0
1127 5.1 0.0 2.6 71.0
577 -0.5 0.0 1.7 86.0
340 8.3 0.0 4.3 51.0
850 6.9 0.0 3.1 60.0
321 2.0 0.0 2.0 88.0
32 0.5 0.5 1.8 90.0
654 3.9 1.0 3.4 83.0
397 -0.7 0.0 2.8 65.0
792 -0.8 1.5 5.0 88.0
814 -1.3 0.0 6.3 53.0
985 -0.5 2.0 6.1 88.0
116 1.6 1.0 2.0 88.0
639 4.4 0.0 2.8 69.0
71 5.6 1.5 2.9 89.0
934 9.2 0.0 3.1 48.0
815 -1.3 0.0 6.3 53.0
103 5.6 0.0 3.2 79.0

###発生している問題・エラーメッセージ

Traceback (most recent call last):
  File "C:/Users/aa/Documents/MyPythonProject/snow_train/reference_1.py", line 102, in <module>
    clf = clf.fit(train_x, train_y)
  File "C:\Anaconda\lib\site-packages\sklearn\tree\tree.py", line 739, in fit
    X_idx_sorted=X_idx_sorted)
  File "C:\Anaconda\lib\site-packages\sklearn\tree\tree.py", line 146, in fit
    check_classification_targets(y)
  File "C:\Anaconda\lib\site-packages\sklearn\utils\multiclass.py", line 172, in check_classification_targets
    raise ValueError("Unknown label type: %r" % y_type)
ValueError: Unknown label type: 'continuous'

###該当のソースコード

from sklearn.model_selection import train_test_split
data_first_list = data_first_list.drop(["列車番号","天気","地点"], axis=1)
#print(data_first_list["合計"])
train_x = data_first_list.drop("合計", axis=1)
train_y = data_first_list["合計"]
(train_x, test_x, train_y, test_y) = train_test_split(train_x, train_y, test_size = 0.3, random_state = 777)

# 決定木 #
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import DecisionTreeRegressor
clf = DecisionTreeClassifier(random_state=0)
clf = clf.fit(train_x, train_y)
pred = clf.predict(test_x)

###試したこと
課題に対してアプローチしたことを記載してください
DecisionTreeRegressorを使ってみましたが、こちらも上手くいきませんでした。

###補足情報(言語/FW/ツール等のバージョンなど)
より詳細な情報

行動規範の内容に同意します

回答1件

ベストアンサー

分類器はラベルをYに取ります。
ラベルは整数である必要があります。

エラーは連続変数がYに渡されたので、分類できないことを知らせています。

ClassifierではなくRegressorにするか、連続変数を整数に射影して使ってください。

投稿2018/01/08 12:01

mkgrei

総合スコア8562

MasaKoba

2018/01/08 13:38

早速ご回答いただき誠にありがとうございます。以下のとおり、整数に射影したのですが、エラーが出てしまいます（データ自体は欠損もなく問題なさそうなのですが…）。もし何かお気づきの点があればご教示いただけますと大変うれしいです。よろしくお願いいたします。 from sklearn.model_selection import train_test_split data_first_list = data_first_list.drop(["列車番号","天気","地点"], axis=1) #print(data_first_list["合計"]) train_x = data_first_list.drop("合計", axis=1) train_y_0 = data_first_list["合計"]*1000000 train_y =train_y_0.astype(int) （エラー内容） Traceback (most recent call last): File "C:/Users/aa/Documents/MyPythonProject/snow_train/reference_1.py", line 112, in <module> score = accuracy_score(train_y, pred) File "C:\Anaconda\lib\site-packages\sklearn\metrics\classification.py", line 172, in accuracy_score y_type, y_true, y_pred = _check_targets(y_true, y_pred) File "C:\Anaconda\lib\site-packages\sklearn\metrics\classification.py", line 72, in _check_targets check_consistent_length(y_true, y_pred) File "C:\Anaconda\lib\site-packages\sklearn\utils\validation.py", line 181, in check_consistent_length " samples: %r" % [int(l) for l in lengths]) ValueError: Found input variables with inconsistent numbers of samples: [852, 366]

MasaKoba

2018/01/08 13:42 編集

度々申し訳ございません。Regressorにかえた場合も（少し文言は異なりますが）以下のように似た感じのエラーが出てしまいます。 Traceback (most recent call last): File "C:/Users/aa/Documents/MyPythonProject/snow_train/reference_1.py", line 111, in <module> score = accuracy_score(train_y, pred) File "C:\Anaconda\lib\site-packages\sklearn\metrics\classification.py", line 172, in accuracy_score y_type, y_true, y_pred = _check_targets(y_true, y_pred) File "C:\Anaconda\lib\site-packages\sklearn\metrics\classification.py", line 72, in _check_targets check_consistent_length(y_true, y_pred) File "C:\Anaconda\lib\site-packages\sklearn\utils\validation.py", line 181, in check_consistent_length " samples: %r" % [int(l) for l in lengths]) ValueError: Found input variables with inconsistent numbers of samples: [852, 366]