前提
・pythonのpandasを用いて機械学習の前処理をおこなっている際に問題が発生
・pandas.Dataframe.mean()を用いてpandasの各columnの平均値の取得を試みた
課題
・一部のcolumnでの結果が平均値ではなく「nan」もしくは「0.0」となった
(※Dataframeは458913行179列)
知りたいこと
・pandas.Dataframe.mean()の結果で「nan」もしくは「0.0」が出力される理由と正しい結果を取得する方法
データ
データセット Kaggle - American Express - Default Prediction
https://www.kaggle.com/datasets/munumbutt/amexfeather
実行したコード
python
1## データの読み込み 2for data in ["test", "train"]: 3 df = pd.read_feather(f'../input/amexfeather/{data}_data.ftr') 4 df = df.groupby('customer_ID').tail(1).set_index('customer_ID') 5 if data == "test": 6 df_test = df 7 else: 8 df_train = df 9 10del df 11gc.collect() 12 13categorical = ['B_30', 'B_38', 'D_114', 'D_116', 'D_117', 'D_120', 'D_126', 'D_63', 'D_64', 'D_66', 'D_68'] 14 15df_train.drop(categorical, axis="columns", inplace=True) 16df_test.drop(categorical, axis="columns", inplace=True) 17 18## 問題発生部分 19df_train.mean() 20
出力結果
python
1 2df_train.mean() 3 4P_2 NaN 5D_39 NaN 6B_1 0.000000 7B_2 NaN 8R_1 0.000000 9S_3 NaN 10D_41 0.000000 11B_3 NaN 12D_42 0.177979 13D_43 0.000000 14D_44 0.000000 15B_4 NaN 16D_45 NaN 17B_5 0.000000 18R_2 0.000000 19D_46 NaN 20D_47 NaN 21D_48 NaN 22D_49 0.191162 23B_6 NaN 24B_7 NaN 25B_8 NaN 26D_50 0.000000 27D_51 NaN 28B_9 NaN 29R_3 0.000000 30D_52 NaN 31P_3 NaN 32B_10 NaN 33D_53 0.000000 34S_5 0.000000 35B_11 0.000000 36S_6 NaN 37D_54 NaN 38R_4 0.000000 39S_7 NaN 40B_12 0.000000 41S_8 NaN 42D_55 NaN 43D_56 0.000000 44B_13 0.000000 45R_5 0.000000 46D_58 NaN 47S_9 0.000000 48B_14 0.000000 49D_59 NaN 50D_60 NaN 51D_61 NaN 52B_15 0.000000 53S_11 NaN 54D_62 NaN 55D_65 0.000000 56B_16 NaN 57B_17 NaN 58B_18 NaN 59B_19 NaN 60B_20 NaN 61S_12 NaN 62R_6 0.000000 63S_13 NaN 64B_21 0.000000 65D_69 NaN 66B_22 0.000000 67D_70 0.000000 68D_71 0.000000 69D_72 0.000000 70S_15 NaN 71B_23 NaN 72D_73 0.170654 73P_4 0.000000 74D_74 NaN 75D_75 NaN 76D_76 0.143066 77B_24 0.000000 78R_7 NaN 79D_77 0.000000 80B_25 0.000000 81B_26 0.000000 82D_78 0.000000 83D_79 0.000000 84R_8 0.000000 85R_9 0.252930 86S_16 0.000000 87D_80 0.000000 88R_10 0.000000 89R_11 0.000000 90B_27 0.000000 91D_81 0.000000 92D_82 0.000000 93S_17 0.000000 94R_12 NaN 95B_28 NaN 96R_13 0.000000 97D_83 0.000000 98R_14 NaN 99R_15 0.000000 100D_84 0.000000 101R_16 0.000000 102B_29 0.046021 103S_18 0.000000 104D_86 0.000000 105D_87 1.000000 106R_17 0.000000 107R_18 0.000000 108D_88 0.208130 109B_31 NaN 110S_19 0.000000 111R_19 0.000000 112B_32 0.000000 113S_20 0.000000 114R_20 0.000000 115R_21 0.000000 116B_33 NaN 117D_89 0.000000 118R_22 0.000000 119R_23 0.000000 120D_91 0.000000 121D_92 0.000000 122D_93 0.000000 123D_94 0.000000 124R_24 0.000000 125R_25 0.000000 126D_96 0.000000 127S_22 NaN 128S_23 NaN 129S_24 NaN 130S_25 NaN 131S_26 0.000000 132D_102 NaN 133D_103 NaN 134D_104 NaN 135D_105 NaN 136D_106 0.222290 137D_107 NaN 138B_36 0.000000 139B_37 0.000000 140R_26 0.087769 141R_27 NaN 142D_108 0.072083 143D_109 0.000000 144D_110 0.746582 145D_111 0.886230 146B_39 0.320068 147D_112 NaN 148B_40 NaN 149S_27 NaN 150D_113 NaN 151D_115 NaN 152D_118 NaN 153D_119 NaN 154D_121 NaN 155D_122 NaN 156D_123 0.000000 157D_124 NaN 158D_125 0.000000 159D_127 0.000000 160D_128 NaN 161D_129 NaN 162B_41 0.000000 163B_42 0.110535 164D_130 NaN 165D_131 0.000000 166D_132 0.209473 167D_133 0.000000 168R_28 0.000000 169D_134 0.341553 170D_135 0.029068 171D_136 0.246826 172D_137 0.014122 173D_138 0.158936 174D_139 NaN 175D_140 0.000000 176D_141 NaN 177D_142 0.000000 178D_143 NaN 179D_144 0.000000 180D_145 0.000000 181target 0.258934 182dtype: float64 183
試したこと
・fillna()を利用してNaN値をすべて適当な値で埋めたうえでmean()を利用する ⇒ 結果変わらず
・mean(skipna=True)で平均値算出 ⇒ 結果変わらず
補足情報(FW/ツールのバージョンなど)
S_2の日付データ以外は全てastypeで数値データに変換済み
上記のコードは全てKaggleのnotebook環境で実行(2022年9月22日時点)
お知恵をお貸しいただけますと大変幸いです。

回答1件
あなたの回答
tips
プレビュー