RandomForest、XGBoosting、LightGBM、各手法における特徴量の重要度についての比較

前提・実現したいこと

KaggleのTitanicにおいて、RandomForest、XGBoosting、LightGBMで特徴量の重要度を算出し比較を行ってみたのですが、結果の解釈をどのようにすればいいか悩んでいます。

発生している問題・エラーメッセージ

下記のように精度的にはXGBoostingとLightGBMのBoostingを用いた手法が若干勝り、Boosting両手法における重要度も近しい値となっているのですが、一方でTitanicでは重要な項目とされる性別の重要度が異常に低く、重要度に関してはRandomForestのほうが納得がいく結果になっているのですが、RandomForestとBoostingにおける特徴量の重要度はそこまで異なるものなのでしょうか？

RandomForest
f1 core:0.833

Pclass 8.4
Sex 29.5
Age 23.7
SibSp 4.8
Parch 4.5
Fare 20.0
Cabin 4.9
Embarked 4.1

XGBoosting
f1 score:0.848

Pclass 4.6
Sex 6.1
Age 39.9
SibSp 4.9
Parch 1.8
Fare 36.3
Cabin 3.4
Embarked 3.0

LightGBM
f1 score:0.838

Pclass 3.6
Sex 3.1
Age 40.9
SibSp 8.2
Parch 5.9
Fare 33.5
Cabin 2.3
Embarked 2.6

### 該当のソースコード

import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer
import xgboost as xgb
import lightgbm as lgb

# トレーニングデータ作成
df = pd.read_csv("train.csv")
X_train = df.drop(["y"],axis=1)
y_train = df.y

# スコア検証分割
X_train, X_test, y_train, y_test = train_test_split(X_train, y_train, test_size=0.2, random_state=0)

# スコア方法をF1に設定
f1_scoring = make_scorer(f1_score,  pos_label=1)

# RandomForest
print("RandomForest")
forest_param = {
    'n_estimators': [20,100,500],
    'max_depth': [3,5,7,9],
    'min_samples_leaf': [1, 2, 4]
}

# グリッドサーチで学習
rf = GridSearchCV(RandomForestClassifier(random_state=0, n_jobs=-1), forest_param, scoring=f1_scoring, cv=5)
rf.fit(X_train, y_train)
print('Best parameters: {}'.format(rf.best_params_))
# スコア検証
print('Train score: {:.3f}'.format(rf.score(X_train, y_train)))
print('Confusion matrix:\n{}'.format(confusion_matrix(y_test, rf.predict(X_test))))
print('f1 score: {:.3f}'.format(f1_score(y_test, rf.predict(X_test))))
rf_importances = pd.Series(rf.best_estimator_.feature_importances_, index = X_train.columns)
print(rf_importances)

# XGBoosting
print("XGBoosting")
xgb_param = {
    'learning_rate':[0.1,0.2],
    'n_estimators':[20,100,500],
    'max_depth':[3,5,7,9],
    'min_child_weight':[0.5,1,2],
    'max_delta_step':[5],
    'gamma':[1,3,5],
    'subsample':[0.8],
    'colsample_bytree':[0.8],
    'objective':['binary:logistic'],
    'nthread':[4],
    'scale_pos_weight':[1],
    'seed':[0]
}
# グリッドサーチで学習
xgb = GridSearchCV(xgb.XGBClassifier(
    silent=True, booster='gbtree', reg_alpha=0, reg_lambda=1, base_score=0.5, random_state=0, missing=None),
    xgb_param, scoring=f1_scoring, cv=4)
xgb.fit(X_train, y_train)
print('Best parameters: {}'.format(xgb.best_params_))
# スコア検証
print('Train score: {:.3f}'.format(xgb.score(X_train, y_train)))
print('Confusion matrix:\n{}'.format(confusion_matrix(y_test, xgb.predict(X_test))))
print('f1 score: {:.3f}'.format(f1_score(y_test, xgb.predict(X_test))))
xgb_importances = pd.Series(xgb.best_estimator_.feature_importances_, index = X_train.columns)
print(xgb_importances)

# LightGBM
print("LightGBM")
gbm_param = {
    'learning_rate':[0.1,0.2],
    'n_estimators':[20,100,500],
    'max_depth':[3,5,7,9],
    'min_child_weight':[0.5,1,2],
    'min_child_samples':[5,10,20],
    'subsample':[0.8],
    'colsample_bytree':[0.8],
    'verbose':[-1],
    'num_leaves':[80]
}
# グリッドサーチで学習
gbm = GridSearchCV(lgb.LGBMClassifier(),gbm_param, scoring=f1_scoring, cv=5)
gbm.fit(X_train, y_train)
print('Best parameters: {}'.format(gbm.best_params_))
# スコア検証
print('Train score: {:.3f}'.format(gbm.score(X_train, y_train)))
print('Confusion matrix:\n{}'.format(confusion_matrix(y_test, gbm.predict(X_test))))
print('f1 score: {:.3f}'.format(f1_score(y_test, gbm.predict(X_test))))
gbm_importances = pd.Series(gbm.best_estimator_.feature_importances_, index = X_train.columns)
print(gbm_importances)

試したこと

https://qiita.com/TomokIshii/items/290adc16e2ca5032ca07
ではパラメータの調整不足が指摘されていたので、グリッドサーチを用いてパラメータの最適化をしてみました。

補足情報（FW/ツールのバージョンなど）

python3.6
scikit-learn==0.19.1
lightgbm==2.1.0
xgboost==0.7
返信は月曜まで遅れるかもしれません、ご容赦お願いします。
コードについても修正点などあれば指摘してもらえれば幸いです。
よろしくお願いいたします。

行動規範の内容に同意します

回答1件

ベストアンサー

http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html#sklearn.feature_selection.RFE

削除していく時には、関連性の低い特徴量から消すのが一般的かと思います。

「重要度」は汎用的にアルゴリズム全般に対して定義しにくい量です。

アルゴリズムが異なりますので、重要度も異なるのだと思われます。
Boostingするとややこしくなるのは仕方のないことだと思います。
珍しい項目をもうまく予測するアルゴリズムがBoostingだと思うと、
そのような項目を決定するために大局的な特徴量とは別のものを注目してしまう傾向はあるのだと思います。

選択する際には単純なモデルで行い、精度を出す際に複雑なモデルを用いるのがよい作戦だと思います。

以下英語ですが、参考になる話がいろいろ含まれています。

https://stats.stackexchange.com/questions/279730/why-gradient-boosting-random-forest-generate-unstable-feature-importance

https://stackoverflow.com/questions/15810339/how-are-feature-importances-in-randomforestclassifier-determined

https://www.fabienplisson.com/choosing-right-features/

https://machinelearningmastery.com/feature-importance-and-feature-selection-with-xgboost-in-python/

投稿2018/02/02 14:35