hayataka2049 様のおかげで5,000以上あったDataConversionWarningはすべてなくなりました! ありがとうございます(*≧∀≦)

以前の質問: [kaggle x Titanic]うっすらピンクの背景でDataConversionWarning

開発環境

Python3.6.5
Jupyter notebook
Windows7

困っていること

kaggleのTitanic課題kernelを写経して提出までこぎつけました。が、警告をガン無視して進めたので今から警告に対する対処をしようとしています。
写経対象: A Data Science Framework: To Achieve 99% Accuracy

今回はUserWarningとConvergenceWarningで困っています。
該当コード

python
1#WARNING: Running is very computational intensive and time expensive.
2grid_n_estimator = [10, 50, 100, 300]
3grid_ratio = [.1, .25, .5, .75, 1.0]
4grid_learn = [.01, .03, .05, .1, .25]
5grid_max_depth = [2, 4, 6, 8, 10, None]
6grid_min_samples = [5, 10, .03, .05, .10]
7grid_criterion = ['gini', 'entropy']
8grid_bool = [True, False]
9grid_seed = [0]
10
11grid_param = [
12                [{
13                    'n_estimators': grid_n_estimator,
14                    'learning_rate': grid_learn,
15                    'random_state': grid_seed
16                }],
17    
18                [{
19                    'n_estimators': grid_n_estimator,
20                    'max_samples': grid_ratio,
21                    'random_state': grid_seed
22                }],
23
24                [{
25                    'n_estimators': grid_n_estimator,
26                    'criterion': grid_criterion,
27                    'max_depth': grid_max_depth,
28                    'random_state': grid_seed
29                }],
30
31                [{
32                    'learning_rate': [.05],
33                    'n_estimators': [300],
34                    'max_depth': grid_max_depth,
35                    'random_state': grid_seed
36                }],
37
38                [{
39                    'n_estimators': grid_n_estimator,
40                    'criterion': grid_criterion,
41                    'max_depth': grid_max_depth,
42                    'oob_score': [True],
43                    'random_state': grid_seed
44                }],
45                
46                [{
47                    'max_iter_predict': grid_n_estimator,
48                    'random_state': grid_seed
49                }],
50    
51                [{
52                    'fit_intercept': grid_bool,
53                    'solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'],
54                    'random_state': grid_seed
55                }],
56    
57                [{
58                    'alpha': grid_ratio,
59                }],
60    
61                [{}],
62    
63                [{
64                    'n_neighbors': [1,2,3,4,5,6,7],
65                    'weights': ['uniform', 'distance'],
66                    'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute']
67                }],
68    
69                [{
70                    'C': [1,2,3,4,5],
71                    'gamma': grid_ratio,
72                    'decision_function_shape': ['ovo', 'ovr'],
73                    'probability': [True],
74                    'random_state': grid_seed
75                }],
76    
77                [{
78                    'learning_rate': grid_learn,
79                    'max_depth': [1,2,4,6,8,10],
80                    'n_estimators': grid_n_estimator,
81                    'seed': grid_seed
82                }]
83]
84
85
86start_total = time.perf_counter()
87for clf, param in zip (vote_est, grid_param):
88    start = time.perf_counter()
89    best_search = model_selection.GridSearchCV(estimator = clf[1], param_grid = param, cv = cv_split, scoring = 'roc_auc')
90    best_search.fit(data1[data1_x_bin], data1[Target].values.ravel())
91    run = time.perf_counter() - start
92    
93    best_param = best_search.best_params_
94    print('The best parameter for {} is {}  with a runtime of {:.2f} seconds'.format(clf[1].__class__.__name__, best_param, run))
95    clf[1].set_params(**best_param)
96    
97run_total = time.perf_counter() - start_total
98print('Total optimization time was {:.2f} minutes.'.format(run_total/60))
99
100print('-' *10)

警告文
同じ内容の文が重複して出てくるので一部だけですが、このような警告文が延々と出てきます。

C:\Users\ayumusato\Anaconda3\lib\site-packages\sklearn\ensemble\forest.py:453: UserWarning: Some inputs do not have OOB scores. This probably means too few trees were used to compute any reliable oob estimates.
  warn("Some inputs do not have OOB scores. "
C:\Users\ayumusato\Anaconda3\lib\site-packages\sklearn\linear_model\sag.py:326: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
  "the coef_ did not converge", ConvergenceWarning)

私調べUserWarning: Some inputs do not have OOB scores.

OOBスコアのない入力がある。
OOB(Out-Of-Bag): 選ばれなかったデータ。ランダムフォレストのエラーの評価に使われる。
→OOBスコアが必要? どうやれば警告への対処ができるのでしょうか。

私調べConvergenceWarning: The max_iter was reached which means the coef_ did not converge

max_iterに到達し、coef_が集中しなかった???
ConvergenceWarning: 収束問題を捉えるための警告。
どうやれば収束問題? が解決するのでしょうか。

よろしくお願いしますorz

行動規範の内容に同意します

回答2件

①OOBを出すためにStratifiedKFoldについて調べてみてください。
もしやランダムフォレスト→http://scikit-learn.org/stable/auto_examples/ensemble/plot_ensemble_oob.html

②Convergeは収束です。
係数が一意的に決まらなかったということを言っています。
これは、反復を続けても、また別の数字になったりすることから、その係数に意味がない、もしくは悪さをするかもしれないことを意味しています。
その前にmax_iterを大きくすると収束するかもしれませんが…

投稿2018/09/17 06:20

編集2018/09/17 06:26

mkgrei

総合スコア8562

Yukiya025

2018/09/17 10:19

mkgrei様、ありがとうございます(*≧∀≦) ① StratifiedKFoldはtrain/testセットにあるデータを分割するためのtrain/testインデックスを提供する。クラスの割合を保ったまま分割したときは、StratifiedKFold を使う。 K-分割交差検証 (K-fold cross-validation)は、標本群をK個に分割する。そして、そのうちの1つをテスト事例とし、残る K − 1 個を訓練事例とするのが一般的である。交差検証は、K 個に分割された標本群それぞれをテスト事例として k 回検証を行う。そうやって得られた k 回の結果を平均して1つの推定を得る。 ...OOBを出すにはOOBの対象をK個に分割して検証すればよいということですか? ② > 係数が一意的に決まらなかったということを言っています。係数(Coefficient)とは、一個以上の変数の積にかかっている定数。例えば「y = 7x + 8」で言えば7が係数となる。 max_iter (最大回数) はsklearnのモデルのパラメータ ...この場合の係数はコードのどの部分を言っていますか?

行動規範の内容に同意します

ベストアンサー

UserWarning: Some inputs do not have OOB scores.

ランダムフォレストでOOB誤り率を計算できないというエラー。

OOB誤り率の仕組みとかは勉強してください。

'oob_score': [False],で良いかも（他の場所でOOB誤り率を使っていなければ）。最初から計算しない、というアプローチです。

駄目だったら無視してください。実害はそんなにありません。

ConvergenceWarning: The max_iter was reached which means the coef_ did not converge

LogisticRegressionですかね。

収束しない系の問題の対処法としては、

max_iterを上げてたくさん回す。回せば収束することを祈る
tolを上げて多少アレでも収束しているとみなす
他のパラメータも制約がゆるくなりそうな方向にいじってみる
ドキュメントを見ると「solverにsagとかsagaを使うなら変数スケーリングしないとちゃんと収束しないぞ」と書いてあるみたいなので、そのへんに気を使ってみる
参考：http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

どうしても収束しなかったら、しょうがないので無視・・・

投稿2018/09/17 13:44

hayataka2049

総合スコア30939

Yukiya025

2018/09/18 01:19

hayataka2049さま、おはようございます! > 'oob_score': [False] UserWarningがすべて消えました! ありがとうございます(*≧∀≦) 実害はなさそうなんですけどね。。。スクロールが下に下に下に (略必要になるほど警告がずらずらと出てくるので目障りなのです(´；ω；｀)ﾌﾞﾜｯしかもこのセル、数あるセルの中で実行速度がトップクラスの重さなのですorz > max_iterを上げてたくさん回す。回せば収束することを祈る ↑これを採用するとなると下のように書けばよいですか? またはmax_iter = 1000とか? best_search.fit(data1[data1_x_bin], data1[Target].values.ravel(), max_iter=self.max_iter) 参考にしたサイト: https://teratail.com/questions/84752 > solverにsagとかsagaを使うなら変数スケーリングしないとちゃんと収束しないぞ Note that ‘sag’ and ‘saga’ fast convergence is only guaranteed on features with approximately the same scale. You can preprocess the data with a scaler from sklearn.preprocessing. ↑この部分ですね! 「同じスケールを使って」「sklearn.preprocessingのスケーラでデータを前処理できる」か (でもどうすればよいのかわからない。max_iterやその他の候補がダメだったら使おう)。。。

hayataka2049

2018/09/18 12:04

邪魔なものを消したいだけなら警告を種類ごとに非表示にしたりする方法もあるので調べてみてください。無視して良いかの判断が自力でできる人向けですが

Yukiya025

2018/09/19 02:06

hayataka2049さま、ありがとうございます! import warnings warnings.filterwarnings('ignore') で警告がすべて表示されなくなりました(*≧∀≦) >無視して良いかの判断が自力でできる人向けですが自力ではできませんが、かといって修正する力もないorzので、修正ができるほどの力がついたらこの質問見て該当のignore文を消して警告を確認してやります^^/ 本当にありがとうございました(*≧∀≦)

行動規範の内容に同意します

あなたの回答

tips

プレビュー

行動規範の内容に同意します

質問の解決につながる回答をしましょう。サンプルコードなど、より具体的な説明があると質問者の理解の助けになります。また、読む側のことを考えた、分かりやすい文章を心がけましょう。

15分調べてもわからないことは
teratailで質問しよう！

ただいまの回答率
85.30%

質問をまとめることで
思考を整理して素早く解決

テンプレート機能で
簡単に質問をまとめる

質問する

[kaggle x Titanic]うっすらピンクの背景でUserWarningとConvergenceWarning

開発環境

困っていること

私調べUserWarning: Some inputs do not have OOB scores.

私調べConvergenceWarning: The max_iter was reached which means the coef_ did not converge

UserWarning: Some inputs do not have OOB scores.

ConvergenceWarning: The max_iter was reached which means the coef_ did not converge

関連した質問