Python LightGBMで説明変数が1つしか考慮されない

前提

Pythonにおいて、LightGBMを用いて「回帰問題」を解くためのプログラムを実装しようとしています。
ですが、説明変数が一つしか考慮されず、その原因、および解決方法がわかりません。

実現したいこと

正しいLightGBMのモデルを作成する。

該当のソースコード

Python
1#データフレームの初めから最後の一つ前までのカラムを説明変数とする
2X_train = df_stamp_2019.iloc[:, :df_stamp_2019.shape[1]-1].values
3y_train = df_stamp_2019.iloc[:, df_stamp_2019.shape[1]-1].values
4X_test = df_stamp_2020.iloc[:, :df_stamp_2020.shape[1]-1].values
5y_test = df_stamp_2020.iloc[:, df_stamp_2020.shape[1]-1].values
6
7train_set = lgb.Dataset(X_train, y_train)
8valid_set = lgb.Dataset(X_test, y_test)
9
10params = {
11        'task': 'train',
12        'boosting_type': 'gbdt',
13        'objective': 'regression', # 目的 : 回帰  
14        'metric': {'rmse'}, # 評価指標 : rsme(平均二乗誤差の平方根) 
15}
16
17model_lgbm = lgb.train(
18    params = params,
19    train_set = train_set,
20    valid_sets = [train_set, valid_set],
21    num_boost_round = 100
22)
23
24
25y_pred = model_lgbm.predict(X_test)
26
27lgb.plot_importance(model_lgbm)

「df_stamp_2019」の先頭5行は以下のようになっており、「predict」以外を説明変数、「predict」を目的変数に設定しています。
「count_ago」以外はワンホットエンコーディングにより生成した変数になります。

また、「df_stamp_2019」の情報（df_stamp_2020も同様）は以下のようになっています。

実行結果

特徴量重要度をプロットすると以下のようになり、18個目のカラム「count_ago」だけが考慮されてしまいます。

jbpb0

2022/11/17 00:51

質問内容とは関係無いかもしれませんが、「count_ago」と「predict」は、何で「float」じゃなくて「object」なのでしょうか？ > 「df_stamp_2019」の先頭5行は以下のようになっており、を見ると浮動小数点数のようですが、何か数値以外のものが混ざってるのでしょうか？

ken_seki_1701

2022/11/17 01:00

お返事ありがとうございます。いえ、数値以外のものは混ざっておらず、objectになっているのが気になったため、floatに変換しても問題は解決しませんでした。

jbpb0

2022/11/17 01:20

これも質問内容と関係無いかもしれませんが、「valid_sets = [train_set, valid_set]」と、「train_set」を混ぜてるのは何故でしょうか？

ken_seki_1701

2022/11/17 01:41

こちら当方あまり内容を理解しておらず、ネットにあるコードをそのまま利用し、それを変更したため、このような形になっています。

jbpb0

2022/11/17 02:00

https://github.com/microsoft/LightGBM/blob/master/examples/python-guide/simple_example.py を見ると「valid_sets=lgb_eval」とテスト用データだけ指定してるし、他のアルゴでは一般的にそうするので聞きましたが、 https://lightgbm.readthedocs.io/en/v3.3.3/pythonapi/lightgbm.train.html の「evals_result」の説明や、 https://zenn.dev/megane_otoko/articles/2021ad_03_simple_regression https://qiita.com/yuki_edy/items/61789121371a60fb5e83 https://blog.amedama.jp/entry/lightgbm-custom-metric あたりを見ると、そのような指定をしてるので、おそらくそれでも大丈夫なのでしょう失礼しました

jbpb0

2022/11/17 02:14 編集

学習後に、「X_test」から予測した値と、真値「y_test」を、rmseとかr2とかで評価した場合、説明変数を「count_ago」だけにすると、性能はどれくらい低下するのでしょうか？もしも、「count_ago」だけでも性能がたいして変わらないなら、本当に「count_ago」だけが予測に寄与してるのではないですかね参考 https://mathmatical22.xyz/2020/04/09/%E3%80%90%E5%88%9D%E5%AD%A6%E8%80%85%E5%90%91%E3%81%91%E3%80%91lightgbm-%E5%9F%BA%E6%9C%AC%E7%9A%84%E3%81%AA%E4%BD%BF%E3%81%84%E6%96%B9-%E5%9B%9E%E5%B8%B0%E5%88%86%E6%9E%90%E7%B7%A8%E3%80%90python/ 【追記】予測に寄与してなくても、特徴量重要度グラフに現れないのは変な気はするけどあ、そういう質問ですか失礼しました

退会済みユーザー

2022/11/17 21:38

差し支えなければ、生データとソースコード全体をアップロードできますか？

ken_seki_1701

2022/11/17 22:08

承知いたしました。少々お待ちください。

ken_seki_1701

2022/11/17 22:38

''' df_2019_month = pd.DataFrame({'month':[str(1),str(1),str(1),str(1),str(1), str(2),str(2),str(2),str(2), str(3),str(3),str(3),str(3), str(4),str(4),str(4),str(4), str(5),str(5),str(5),str(5),str(5), str(6),str(6),str(6),str(6), str(7),str(7),str(7),str(7), str(8), str(8), str(8), str(8),str(8), str(9),str(9),str(9),str(9), str(10),str(10),str(10),str(10),str(10), str(11), str(11), str(11), str(11), str(12), str(12), str(12), str(12)]}) df_2020_month = pd.DataFrame({'month':[str(1),str(1),str(1),str(1),str(1), str(2),str(2),str(2),str(2), str(3),str(3),str(3),str(3), str(4),str(4),str(4),str(4),str(4), str(5),str(5),str(5),str(5), str(6),str(6),str(6),str(6), str(7),str(7),str(7),str(7),str(7), str(8), str(8), str(8), str(8), str(9),str(9),str(9),str(9), str(10),str(10),str(10),str(10),str(10), str(11), str(11), str(11), str(11), str(12), str(12), str(12), str(12),str(12)]}) df_2019_week = pd.DataFrame({'week':[str(1),str(2),str(3),str(4),str(5), str(1),str(2),str(3),str(4), str(1),str(2),str(3),str(4), str(1),str(2),str(3),str(4), str(1),str(2),str(3),str(4),str(5), str(1),str(2),str(3),str(4), str(1),str(2),str(3),str(4), str(1), str(2), str(3), str(4),str(5), str(1),str(2),str(3),str(4), str(1),str(2),str(3),str(4),str(5), str(1), str(2), str(3), str(4), str(1), str(2), str(3), str(4)]}) df_2020_week = pd.DataFrame({'week':[str(1),str(2),str(3),str(4),str(5), str(1),str(2),str(3),str(4), str(1),str(2),str(3),str(4), str(1),str(2),str(3),str(4),str(5), str(1),str(2),str(3),str(4), str(1),str(2),str(3),str(4), str(1),str(2),str(3),str(4),str(5), str(1), str(2), str(3), str(4), str(1),str(2),str(3),str(4), str(1),str(2),str(3),str(4),str(5), str(1), str(2), str(3), str(4), str(1), str(2), str(3), str(4), str(5)]}) df_2019_sale = pd.DataFrame({'sale':[0,0,0,0,0,0,0,0,0,148, 2,0,0,0,0,0,0,0,0,0, 0,0,124,26,0,0,0,0,0,0, 0,0,0,0,0,100,50,0,0,0, 0,0,0,0,0,0,0,0,100,50,0,0]}) df_2020_sale = pd.DataFrame({'sale':[0,0,0,0,0,0,0,0,0,100, 50,0,0,0,0,0,0,0,0,0, 0,0,76,74,0,0,0,0,0,0, 0,0,0,0,0,52,98,0,0,0, 0,0,0,0,0,0,0,0,52,98,0,0,0]}) count_ago_2019 = pd.DataFrame({'count_ago':[0.6190180518747931 , 1.1332339968657208 , 1.2360771858639064 , 2.058822697849391 , 0.5675964573757003 , 3.70431372182036 , 2.213087481346669 , -0.5122570271052481 , 1.338920374862092 , 2.5730386428403187 , -0.3579922436079697 , -0.30657064910887694 , -0.3579922436079697 , -0.2037274601106914 , 0.3619100793793292 , 0.0533805123847725 , 0.3619100793793292 , -0.7179434051016191 , -0.5636786216043409 , 1.081812402366628 , -1.9006400785807531 , -0.5636786216043409 , 0.6190180518747931 , 0.001958917885679718 , -1.7977968895825676 , -0.2037274601106914 , -0.3579922436079697 , -0.4094138381070625 , -0.30657064910887694 , -0.7179434051016191 , -1.0778945665952686 , -0.6665218106025264 , 0.20764529588205083 , -0.975051377597083 , -0.7179434051016191 , -0.10088427111250584 , -0.25514905460978415 , -0.7179434051016191 , -0.30657064910887694 , -0.7179434051016191 , -0.3579922436079697 , -0.8722081885988975 , 0.10480210688386528 , -0.3579922436079697 , 2.0074011033502983 , -1.4892673225880109 , -0.5122570271052481 , -0.6665218106025264 , 0.7732828353720714 , 0.001958917885679718 , -1.2835809445916397]}) predict_2019 =pd.DataFrame({'predict':[1.1332339968657208 , 1.2360771858639064 , 2.058822697849391 , 0.5675964573757003 , 3.70431372182036 , 2.213087481346669 , -0.5122570271052481 , 1.338920374862092 , 2.5730386428403187 , -0.3579922436079697 , -0.30657064910887694 , -0.3579922436079697 , -0.2037274601106914 , 0.3619100793793292 , 0.0533805123847725 , 0.3619100793793292 , -0.7179434051016191 , -0.5636786216043409 , 1.081812402366628 , -1.9006400785807531 , -0.5636786216043409 , 0.6190180518747931 , 0.001958917885679718 , -1.7977968895825676 , -0.2037274601106914 , -0.3579922436079697 , -0.4094138381070625 , -0.30657064910887694 , -0.7179434051016191 , -1.0778945665952686 , -0.6665218106025264 , 0.20764529588205083 , -0.975051377597083 , -0.7179434051016191 , -0.10088427111250584 , -0.25514905460978415 , -0.7179434051016191 , -0.30657064910887694 , -0.7179434051016191 , -0.3579922436079697 , -0.8722081885988975 , 0.10480210688386528 , -0.3579922436079697 , 2.0074011033502983 , -1.4892673225880109 , -0.5122570271052481 , -0.6665218106025264 , 0.7732828353720714 , 0.001958917885679718 , -1.2835809445916397 , -1.1293161610943614]}) count_ago_2020 = pd.DataFrame({'count_ago':[-1.5921105115861964 , 0.7218612408729786 , -0.30657064910887694 , -0.04946267661341306 , 0.3619100793793292 , -0.975051377597083 , 1.6474499418566486 , 0.10480210688386528 , 0.10480210688386528 , 0.6190180518747931 , -1.0778945665952686 , -1.4892673225880109 , -0.8207865940998047 , 0.2590668903811436 , -0.7693649996007119 , 0.3104884848802364 , 0.15622370138295805 , 0.001958917885679718 , 0.6190180518747931 , 2.1616658868475764 , -0.3579922436079697 , -0.4608354326061553 , 0.6704396463738859 , -0.2037274601106914 , -0.9236297830979903 , 0.3104884848802364 , 0.876126024370257 , 0.6704396463738859 , 0.6704396463738859 , 2.058822697849391 , 0.6704396463738859 , 1.0303908078675352 , -0.7693649996007119 , -0.2037274601106914 , -0.3579922436079697 , 0.10480210688386528 , -1.0778945665952686 , -0.7179434051016191 , -1.0778945665952686 , -0.9236297830979903 , -0.30657064910887694 , -1.3864241335898253 , 0.2590668903811436 , 0.6704396463738859 , -1.1293161610943614 , 0.41333167387842196 , 0.6704396463738859 , 0.15622370138295805 , 0.876126024370257 , 1.544606752858463 , 0.6704396463738859 , 0.0533805123847725]}) predict_2020 =pd.DataFrame({'predict':[0.7218612408729786 , -0.30657064910887694 , -0.04946267661341306 , 0.3619100793793292 , -0.975051377597083 , 1.6474499418566486 , 0.10480210688386528 , 0.10480210688386528 , 0.6190180518747931 , -1.0778945665952686 , -1.4892673225880109 , -0.8207865940998047 , 0.2590668903811436 , -0.7693649996007119 , 0.3104884848802364 , 0.15622370138295805 , 0.001958917885679718 , 0.6190180518747931 , 2.1616658868475764 , -0.3579922436079697 , -0.4608354326061553 , 0.6704396463738859 , -0.2037274601106914 , -0.9236297830979903 , 0.3104884848802364 , 0.876126024370257 , 0.6704396463738859 , 0.6704396463738859 , 2.058822697849391 , 0.6704396463738859 , 1.0303908078675352 , -0.7693649996007119 , -0.2037274601106914 , -0.3579922436079697 , 0.10480210688386528 , -1.0778945665952686 , -0.7179434051016191 , -1.0778945665952686 , -0.9236297830979903 , -0.30657064910887694 , -1.3864241335898253 , 0.2590668903811436 , 0.6704396463738859 , -1.1293161610943614 , 0.41333167387842196 , 0.6704396463738859 , 0.15622370138295805 , 0.876126024370257 , 1.544606752858463 , 0.6704396463738859 , 0.0533805123847725]}) df_sta_2019 = pd.concat([df_2019_month, df_2019_week, df_2019_sale, count_ago_2019, predict_2019], axis=1) df_sta_2019_oh = pd.get_dummies(df_sta_2019) df_sta_2019_oh = df_sta_2019_oh.reindex(columns=['month_1','month_2', 'month_3', 'month_4', 'month_5', 'month_6', 'month_7', 'month_8', 'month_9', 'month_10', 'month_11','month_12', 'week_1', 'week_2', 'week_3', 'week_4', 'week_5', 'sale', 'count_ago', 'predict']) df_sta_2020 = pd.concat([df_2020_month, df_2020_week, df_2020_sale, count_ago_2020, predict_2020], axis=1) df_sta_2020_oh = pd.get_dummies(df_sta_2020) df_sta_2020_oh = df_sta_2020_oh.reindex(columns=['month_1','month_2', 'month_3', 'month_4', 'month_5', 'month_6', 'month_7', 'month_8', 'month_9', 'month_10', 'month_11','month_12', 'week_1', 'week_2', 'week_3', 'week_4', 'week_5', 'sale','count_ago', 'predict']) '''

ken_seki_1701

2022/11/17 22:40

''' X_train = df_sta_2019_oh.drop('predict', axis=1).values y_train = df_sta_2019_oh['predict'].values X_test = df_sta_2020_oh.drop('predict', axis=1).values y_test = df_sta_2020_oh['predict'].values train_set = lgb.Dataset(X_train, y_train) valid_set = lgb.Dataset(X_test, y_test) params = { 'objective': 'regression', # 目的 : 回帰 'metric': {'rmse'}, # 評価指標 : rsme(平均二乗誤差の平方根) } model_lgbm = lgb.train( params = params, train_set = train_set, num_boost_round = 100 ) y_pred = model_lgbm.predict(X_test) plt.plot(range(1,54), y_pred, 'r', label='predict') plt.plot(range(1,54), y_test, 'b', label='actual') plt.legend() ''' こちらになります。見づらくて申し訳ありません。こちらの都合でデータを変換する必要があり、質問にて掲載したコードと少し異なっています。

jbpb0

2022/11/18 00:54 編集

コードの最後に下記を追加して実行したら、各説明変数の特徴量重要度が数値で表示されますので、確認してみてください print(model_lgbm.feature_importance()) > 予測に寄与してなくても、特徴量重要度グラフに現れないのは変な気はするけど当方で、適当なデータをでっち上げて確認したら、特徴量重要度が「0」の説明変数は、グラフに現れませんでしたこの質問の場合は、「count_ago」以外の全ての説明変数の特徴量重要度が「0」(予測に全く寄与してない)なのではないですかね

jbpb0

2022/11/18 00:58 編集

データ数が合ってないので、データ数が少ない「count_ago」と「predict」の最後に「NaN」が付きます 2019と2020のどちらも > こちらの都合でデータを変換する必要があり、の時にデータ数が合わなくなったのですかね【追記】「NaN」が無くなるように df_sta_2019_oh = df_sta_2019_oh.iloc[:50, :] df_sta_2020_oh = df_sta_2020_oh.iloc[:50, :] と、先頭から50個のみ使うようにして、学習・予測を実行してから、 from sklearn.metrics import mean_squared_error, r2_score mse = mean_squared_error(y_test, y_pred) rmse = np.sqrt(mse) print('RMSE :',rmse) r2 = r2_score(y_test,y_pred) print('R2 :',r2) で評価した結果と、 X_train = df_sta_2019_oh['count_ago'].values.reshape(-1, 1) X_test = df_sta_2020_oh['count_ago'].values.reshape(-1, 1) と、説明変数を「count_ago」だけにして、学習・予測した場合の、評価(RMSE, r2)結果は、全く同じでした

jbpb0

2022/11/22 10:23 編集

count_ago_2019_2 = (count_ago_2019 * 10).set_axis(['count_ago2'], axis=1).astype("int") と、「count_ago_2019」に傾向が近い(けど微妙に違う)データをでっち上げて、 df_sta_2019 = pd.concat([df_2019_month, df_2019_week, df_2019_sale, count_ago_2019, predict_2019], axis=1) ↓ 変更 df_sta_2019 = pd.concat([df_2019_month, df_2019_week, df_2019_sale, count_ago_2019_2, count_ago_2019, predict_2019], axis=1) と、それを説明変数に含めて、 df_sta_2019_oh = df_sta_2019_oh.reindex(... の最後のところを 'sale', 'count_ago', 'predict']) ↓ 変更 'sale', 'count_ago2', 'count_ago', 'predict']) と変えて辻褄を合わせます「count_ago_2020」も、同様にしますそれで学習・予測して、 lgb.plot_importance(model_lgbm) を実行したら、結果は [ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 57 43] となり、二つの説明変数の特徴量重要度が「0」ではなくなりました (でっち上げたデータの特徴量重要度も「0」ではない) 上記の通り、「count_ago」以外にも、説明変数内に予測に寄与するデータが他にあれば、それの特徴量重要度も「0」以外の値になるはずなので、やはり、もともとのデータの内で、「count_ago」だけが予測に寄与していて、それ以外のデータは予測に全く寄与してないのではないですかねつまり、この質問の原因は、そういうデータである、ということ

ken_seki_1701

2022/11/23 00:17

ご丁寧な解説、ご回答ありがとうございます！「モデルには考慮されるが、その変数は重要ではない」という結果だったのですね。お時間をとって、回答してくださりありがとうございました！

行動規範の内容に同意します

回答1件

特徴量重要度をプロットすると以下のようになり、18個目のカラム「count_ago」だけが考慮されてしまいます。

当方で、適当なデータをでっち上げて確認したら、特徴量重要度が「0」の説明変数は、グラフに現れませんでした

「count_ago」以外の全ての説明変数の特徴量重要度が「0」(予測に全く寄与してない)なのではないですかね

コードの最後に下記を追加して実行したら、各説明変数の特徴量重要度が数値で表示されますので、確認してみてください

python
1print(model_lgbm.feature_importance())

投稿2022/11/24 04:50

jbpb0

総合スコア7658

あなたの回答

tips

プレビュー

行動規範の内容に同意します

質問の解決につながる回答をしましょう。サンプルコードなど、より具体的な説明があると質問者の理解の助けになります。また、読む側のことを考えた、分かりやすい文章を心がけましょう。

まだベストアンサーが選ばれていません

会員登録して回答してみよう

アカウントをお持ちの方は

15分調べてもわからないことは
teratailで質問しよう！

ただいまの回答率
85.30%

質問をまとめることで
思考を整理して素早く解決

テンプレート機能で
簡単に質問をまとめる

質問する

質問をすることでしか得られない、回答やアドバイスがある。

15分調べてもわからないことは、質問しよう！

Python LightGBMで説明変数が1つしか考慮されない

前提

実現したいこと

該当のソースコード

実行結果

関連した質問