ValueError: Input contains NaN, infinity or a value too large for dtype('float32') のエラーを解決したいです

データを定期的に取り込み、定点的に分析するツールを作りたく、調べながら、その雛形を作成中です。

これまでに試したこと

前回に引き続き、ご支援を頂きながら、Goodgleで類例検索し、エラーを潰しながら進んでおりますが、以下のエラーメッセージが出ます。

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-19-bc82a3736ff0> in <module>
     95                                 max_depth=3,
     96                                 random_state=0)
---> 97 model3.fit(X_train, Y_train)
     98 
     99 print('正解率(train):{:.3f}'.format(model3.score(X_train, Y_train)))

~/opt/anaconda3/lib/python3.7/site-packages/sklearn/tree/tree.py in fit(self, X, y, sample_weight, check_input, X_idx_sorted)
    814             sample_weight=sample_weight,
    815             check_input=check_input,
--> 816             X_idx_sorted=X_idx_sorted)
    817         return self
    818 

~/opt/anaconda3/lib/python3.7/site-packages/sklearn/tree/tree.py in fit(self, X, y, sample_weight, check_input, X_idx_sorted)
    128         random_state = check_random_state(self.random_state)
    129         if check_input:
--> 130             X = check_array(X, dtype=DTYPE, accept_sparse="csc")
    131             y = check_array(y, ensure_2d=False, dtype=None)
    132             if issparse(X):

~/opt/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
    540         if force_all_finite:
    541             _assert_all_finite(array,
--> 542                                allow_nan=force_all_finite == 'allow-nan')
    543 
    544     if ensure_min_samples > 0:

~/opt/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py in _assert_all_finite(X, allow_nan)
     54                 not allow_nan and not np.isfinite(X).all()):
     55             type_err = 'infinity' if allow_nan else 'NaN, infinity'
---> 56             raise ValueError(msg_err.format(type_err, X.dtype))
     57     # for object dtype data, we only check for NaNs (GH-13254)
     58     elif X.dtype == np.dtype('object') and not allow_nan:

ValueError: Input contains NaN, infinity or a value too large for dtype('float32').

現在のコード

# 追記2/19が今回の追加のコードです

import pandas as pd
#Pandas.ExcelFile でデータを読み込む
input_book = pd.ExcelFile('FIFA19_data.xlsx')

input_sheet_name = input_book.sheet_names
num_sheet = len(input_sheet_name)
print(input_sheet_name)
print("sheet の数:", num_sheet)
input_sheet_df = input_book.parse(input_sheet_name[0])

#GK のデータのみを外す
input_sheet_df = input_sheet_df[input_sheet_df['Position'] != "GK"]
#最初の10行のみを表示する
input_sheet_df.head(10)

import numpy as np

#データを読み込む
age = input_sheet_df.Age  #年齢
overall = input_sheet_df.Overall  #総合能力
wage = input_sheet_df.Wage  #給与
PreferredFoot = input_sheet_df.PreferredFoot  #利き足
Reputation = input_sheet_df.Reputation  #レピュテーション
least_contract = input_sheet_df.least_contract  #残りの契約年数
crossing = input_sheet_df.Crossing  #クロス精度
Finishing = input_sheet_df.Finishing  #フィニッシュ精度
heading = input_sheet_df.HeadingAccuracy  #ヘディング精度
ShortPassing = input_sheet_df.ShortPassing  #ショートパス精度
Dribbling = input_sheet_df.Dribbling  #ドリブルの精度
Curve = input_sheet_df.Curve  #カーブの精度
FKAccuracy = input_sheet_df.FKAccuracy  #FK の精度
LongPassing = input_sheet_df.LongPassing  #ロングパスの精度
BallControl = input_sheet_df.BallControl  #ボールコントロール
Acceleration = input_sheet_df.Acceleration  #飛び出し
SprintSpeed = input_sheet_df.SprintSpeed  #スプリントスピード
Agility = input_sheet_df.Agility  #アジリティ
Reactions = input_sheet_df.Reactions  #リアクション
Balance = input_sheet_df.Balance  #バランス
ShotPower = input_sheet_df.ShotPower  #シュートパワー
stamina = input_sheet_df.Stamina  #スタミナ
Jumping = input_sheet_df.Jumping  #ジャンプ
Strength = input_sheet_df.Strength  #ストレングス
LongShots = input_sheet_df.LongShots  #ロングシュート
Aggression = input_sheet_df.Aggression  #アグレッション
Interceptions = input_sheet_df.Interceptions  #インターセプト
Positioning = input_sheet_df.Positioning
Vision = input_sheet_df.Vision
Penalties = input_sheet_df.Penalties
Composure = input_sheet_df.Composure
Marking = input_sheet_df.Marking
StandingTackle = input_sheet_df.StandingTackle
SlidingTackle = input_sheet_df.SlidingTackle

#利用するパラメータを指定する
equation_df2=pd.concat([wage, age, PreferredFoot, Reputation, least_contract, \
                        crossing, Finishing, heading, ShortPassing, Dribbling, Curve, FKAccuracy, \
                       LongPassing, BallControl, Acceleration, SprintSpeed, Agility, Reactions, \
                       Balance, ShotPower, stamina, Jumping, Strength, LongShots, Aggression, \
                       Interceptions, Positioning, Vision, Penalties, Composure, Marking, \
                       StandingTackle, SlidingTackle], axis=1)

#被説明変数として利用するものを取り出す
wage2 = pd.DataFrame(equation_df2.Wage)
#被説明変数を抜き取る
x_list2 = equation_df2.drop("Wage", 1)

from sklearn import preprocessing, linear_model
import sklearn
import seaborn as sns

#データの整形を行う
#データの標準化を行う
sc = preprocessing.StandardScaler()
sc.fit(x_list2)

X = sc.transform(x_list2)

#相関係数を確認する

import matplotlib.pyplot as plt

plt.figure(figsize=(30, 24))
sns.heatmap(x_list2.pct_change().corr(), annot=True, cmap='Blues')

#追記2/19

from sklearn import model_selection
#学習データとテストデータに分割する
#分割する割合は2:8 で作業する
X_train, X_test, Y_train, Y_test = model_selection.train_test_split(
    X, wage, test_size=0.2, random_state=0)

from sklearn.tree import DecisionTreeClassifier

#決定木分析を, X_train 値と Y_train 値に基づき行う
model3 = DecisionTreeClassifier(criterion='entropy',
                                max_depth=3,
                                random_state=0)
model3.fit(X_train, Y_train)

print('正解率(train):{:.3f}'.format(model3.score(X_train, Y_train)))
print('正解率(train):{:.3f}'.format(model3.score(X_test, Y_test)))

読み込むデータ (Dropboxリンク）

https://www.dropbox.com/s/41lap8qzcxez33o/FIFA19_data.xlsx?dl=0

備考

類例解決でこちらを見ながら、
https://qiita.com/twaka/items/eb3ff958f87ca1f4c971

#NaN を含む列を削除する
x_list2 = x_list2.drop(x_list2.columns[np.isnan(x_list2).any()], axis=1)

上記のコードも試行しましたが、うまく解決出来ませんでした。サポートくださる方に感謝いたします。

行動規範の内容に同意します

回答2件

google翻訳
ValueError: 入力にNaN、無限大、またはdtype（ 'float32'）には大きすぎる値が含まれています

ということなんで、入力値が不正だ、ってことはわかると思います
あとは、どこの数値がどういうふうに不正になってるかを調べて、
なぜそれが不正になっているのか、それを不正にしないためにはどうすればいいか、を考えましょう。

原因を調べもしないで後付でコードいじって、たまたまうまく行ったらそれで良しとする、ってなことはやってはいけません

投稿2021/02/19 03:46

y_waiwai

総合スコア87749

2年近く前の質問ですが、たまたま同じエラーにぶち当たったのですが、やはり地道にエラーのある行を探すしかないみたいです。例えばこんな感じで範囲を指定してfitに入れて、エラーが出るかをちょっとずつ範囲をずらして確認してみました。

tmp = X_train[:100] #最初の100行を抜き出して
tmp.to_excel("tmp.xlsx") #Excelに出力して確認する

私の場合は結局、ダミー変数の変換を行う際に、先に学習用データとテスト用データを分離していて、テスト用データのみにあるダミー変数がNaNになっていた、というオチでした。
やはり地道に調べるしかないのですね。

投稿2023/01/11 12:41

usugita_san

総合スコア226

あなたの回答

tips

プレビュー

行動規範の内容に同意します

質問の解決につながる回答をしましょう。サンプルコードなど、より具体的な説明があると質問者の理解の助けになります。また、読む側のことを考えた、分かりやすい文章を心がけましょう。

まだベストアンサーが選ばれていません

会員登録して回答してみよう

アカウントをお持ちの方は

15分調べてもわからないことは
teratailで質問しよう！

ただいまの回答率
85.48%

質問をまとめることで
思考を整理して素早く解決

テンプレート機能で
簡単に質問をまとめる

質問する

質問をすることでしか得られない、回答やアドバイスがある。

15分調べてもわからないことは、質問しよう！