前提

サンプルデータを使って’churm’を予測するモデルを作るためにランダムフォレストを使いました。
test_score train_score
RandomForest 0.019183 0.857964
そうすると上記のような訓練データとテストデータが全く違う結果がでました。
欠損値は平均値で埋めました。

実現したいこと

この2つの結果を同じくらいに寄せるにはどうすればよいのでしょうか？？

該当のソースコード

python
X = df2.drop('churn',axis=1)
y = df2['churn'] # 目的変数

トレーニングデータ,テストデータの分割

X_train, X_valid, y_train, y_valid = train_test_split(X, y,test_size=0.2, random_state=0)

from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor

ランダムフォレストの設定

models = {
'RandomForest': RandomForestRegressor(random_state=0),
}

モデル構築

scores = {}
for model_name, model in models.items():
model.fit(X_train, y_train)
scores[(model_name, 'train_score')] = model.score(X_train, y_train)
scores[(model_name, 'test_score')] = model.score(X_valid, y_valid)

結果を表示

pd.Series(scores).unstack()

補足情報（FW/ツールのバージョンなど）

データの情報です
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 50 columns):

Column Non-Null Count Dtype

0 rev_Mean 99643 non-null float64
1 mou_Mean 99643 non-null float64
2 totmrc_Mean 99643 non-null float64
3 da_Mean 99643 non-null float64
4 ovrmou_Mean 99643 non-null float64
5 ovrrev_Mean 99643 non-null float64
6 vceovr_Mean 99643 non-null float64
7 datovr_Mean 99643 non-null float64
8 roam_Mean 99643 non-null float64
9 change_mou 99109 non-null float64
10 change_rev 99109 non-null float64
11 drop_vce_Mean 100000 non-null float64
12 drop_dat_Mean 100000 non-null float64
13 blck_vce_Mean 100000 non-null float64
14 blck_dat_Mean 100000 non-null float64
15 unan_vce_Mean 100000 non-null float64
16 unan_dat_Mean 100000 non-null float64
17 plcd_vce_Mean 100000 non-null float64
18 plcd_dat_Mean 100000 non-null float64
19 recv_vce_Mean 100000 non-null float64
20 recv_sms_Mean 100000 non-null float64
21 comp_vce_Mean 100000 non-null float64
22 comp_dat_Mean 100000 non-null float64
23 custcare_Mean 100000 non-null float64
24 ccrndmou_Mean 100000 non-null float64
25 cc_mou_Mean 100000 non-null float64
26 inonemin_Mean 100000 non-null float64
27 threeway_Mean 100000 non-null float64
28 mou_cvce_Mean 100000 non-null float64
29 mou_cdat_Mean 100000 non-null float64
30 mou_rvce_Mean 100000 non-null float64
31 owylis_vce_Mean 100000 non-null float64
32 mouowylisv_Mean 100000 non-null float64
33 iwylis_vce_Mean 100000 non-null float64
34 mouiwylisv_Mean 100000 non-null float64
35 peak_vce_Mean 100000 non-null float64
36 peak_dat_Mean 100000 non-null float64
37 mou_peav_Mean 100000 non-null float64
38 mou_pead_Mean 100000 non-null float64
39 opk_vce_Mean 100000 non-null float64
40 opk_dat_Mean 100000 non-null float64
41 mou_opkv_Mean 100000 non-null float64
42 mou_opkd_Mean 100000 non-null float64
43 drop_blk_Mean 100000 non-null float64
44 attempt_Mean 100000 non-null float64
45 complete_Mean 100000 non-null float64
46 callfwdv_Mean 100000 non-null float64
47 callwait_Mean 100000 non-null float64
48 churn 100000 non-null int64
49 months 100000 non-null int64
dtypes: float64(48), int64(2)
memory usage: 38.1 MB
None

ps_aux_grep

2023/01/25 02:42 編集

CrossValidationを利用したハイパーパラメータチューニングが良さそうですね https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html ライブラリを参照してわかる通り，モデルの性能を決めるパラメータは多岐に渡ります．例えば，生成する弱学習機の個数を決めるn_estimatorsはデフォルトで100になっていたりと，解きたい問題に対して高性能すぎる可能性が否定できません．