機械学習のensemble LearningにおけるOOF(out of fold)は学習時のどこで発生するもの?

機械学習手法の精度向上について勉強しております。

分類器の結果を組み合わせるEnsemble Learningについて勉強しているのですが、

**OOF (Out of Fold)**とは、k-Fold などでデータを分割した際に学習に使わなかったデータを指すことがわかりました。

しかし、hold-out法やKfoldによるcross_validationなどをみていくと、"重複なく"データ分割を分割しているのですが、どこに out_of_foldが存在するのでしょうか。

model.fitを行う際に、学習に用いたデータすべてをもちいて学習していない、DeeplearningにおけるDropOutのようにモデルを構築する際に学習に用いなかった部分が存在しており、それをOOFと呼ぶと捉えてよろしいのでしょうか。

OOFがどこで発生していくものなのか調べているのですが、わからず。曖昧な質問で申し訳ありません。

行動規範の内容に同意します

回答1件

自己解決

kaggleにOOFをOUTPUTするプログラムが記載されていたため参考に載せさせていただきました。
random_forestのout_of_bag同様にKfoldのcrossvalidationで重複なくデータをfoldに分けて交差検証を行う際にもout_of_foldのようなデータ数がある程度存在するということなんだとは理解できました。
titanicの生存者データなどで以下のプログラムを出力すると0.3前後の数値となります。それが高いのか低いのか不明ですが、、、、
数学的な知識を含めて知識が追いつかず、詳細については記載できず、すみません。
もう少し調べてみます。

質問を閲覧して頂いた方々、ありがとうございます。

python
1
2Use Scikit Learn's cross_val_predict to do a Out-of-Fold Cross validation as opposed 
3to averaging out the scores on each fold.
4This **usually** tends to be more stable/reliable compared to within fold average.
5This script works for all Scikit Learn models as well as the Scikit Learn APIs of
6XGBoost, LightGBM and Keras.
7
8
9
10"""
11import numpy as np 
12import pandas as pd 
13from xgboost import XGBRegressor
14from sklearn.metrics import mean_squared_error
15from sklearn.model_selection import cross_val_predict
16# Read Data
17print("Reading Dataset...")
18train = pd.read_csv("../input/train.csv")
19target = np.array(train["target"])
20target_log = np.log1p(target) # Log transform target as the evaluation metric uses it
21xtrain = np.array(train.iloc[:,2:])
22print("Shape of training data: {}".format(np.shape(xtrain)))
23# Define Model 
24xgb_model = XGBRegressor(max_depth=6, learning_rate=0.1, n_estimators=70,
25                         min_child_weight=100, subsample=1.0, 
26                         colsample_bytree=0.8, colsample_bylevel=0.8,
27                         random_state=42, n_jobs=4)
28# Make OOF predictions using 5 folds
29print("Cross Validating...")
30oof_preds_log = cross_val_predict(xgb_model, xtrain, target_log, cv=5, 
31                                  n_jobs=1, method="predict")
32                                  
33# Calculate RMSLE (RMSE of Log(1+y))
34cv_rmsle = np.sqrt(mean_squared_error(target_log, oof_preds_log))
35print("\nOOF RMSLE Score: {:.4f}".format(cv_rmsle))
36
37
38
39貼り付け元  <https://www.kaggle.com/adarshchavakula/out-of-fold-oof-model-cross-validation>
40コード