pythonでdrop、論理演算子の使い方

Question

### 前提・実現したいこと pandasのデータフレームの操作について dropを用いる際に、論理演算子の＆を使って、一つ目の条件式と二つ目の条件式を同時に満たしているデータ（外れ値）のみ削除したい。エラーは起きないが、明らかに必要以上にデータをdropしてしまっている。 ### 該当のソースコード https://qiita.com/katsu1110/items/a1c3185fec39e5629bcb 上のURLの通りにプログラムしていった。 # 該当プログラム Xmat = X_train Xmat['SalePrice'] = y_train Xmat = Xmat.drop(Xmat[(Xmat['TotalSF']>5) & (Xmat['SalePrice']<12.5)].index) Xmat = Xmat.drop(Xmat[(Xmat['GrLivArea']>5) & (Xmat['SalePrice']<13)].index) y_train = Xmat['SalePrice'] X_train = Xmat.drop(['SalePrice'], axis=1) 上のプログラムでデータを必要以上にdropしてしまう。 ### 試したこと Xmat.shapeを調べた。 drop前は(1460,31) 一回dropすると(172,32) 二回dropすると(15,32) 明らかに外れ値以外も削除している。 ### 補足情報（FW/ツールのバージョンなど） windows 10 anaconda,atom 最新アップデートを使用 7) 各Featureとターゲットの関係を調べるにおける、自分の実行結果を載せます。 ![イメージ説明](863580f232c59fc79a8096f2bb1209c8.png) ```python import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns train = pd.read_csv(r'C:\Users ingo\Documents\Python\House Prices rain.csv') test = pd.read_csv(r'C:\Users ingo\Documents\Python\House Prices est.csv') print(train.dtypes) train.shape test.shape from sklearn.preprocessing import LabelEncoder for i in range(train.shape[1]): if train.iloc[:,i].dtypes == object: lbl = LabelEncoder() lbl.fit(list(train.iloc[:,i].values) + list(test.iloc[:,i].values)) train.iloc[:,i] = lbl.transform(list(train.iloc[:,i].values)) test.iloc[:,i] = lbl.transform(list(test.iloc[:,i].values)) import missingno as msno train.head(5) msno.matrix(df=train, figsize=(20,14), color=(0.5,0,0)) train_ID = train['Id'] test_ID = test['Id'] y_train = train['SalePrice'] X_train = train.drop(['Id','SalePrice'], axis=1) X_test = test.drop('Id', axis=1) Xmat = pd.concat([X_train, X_test]) Xmat = Xmat.drop(['LotFrontage','MasVnrArea','GarageYrBlt'], axis=1) Xmat = Xmat.fillna(Xmat.median()) Xmat['TotalSF'] = Xmat['TotalBsmtSF'] + Xmat['1stFlrSF'] + Xmat['2ndFlrSF'] X_train=Xmat[0:1460] X_test=Xmat[1460:2920] ax = sns.distplot(y_train) plt.show() y_train = np.log(y_train) ax = sns.distplot(y_train) plt.show() from sklearn.ensemble import RandomForestRegressor rf = RandomForestRegressor(n_estimators=80, max_features='auto') rf.fit(X_train, y_train) print('Training done using Random Forest') ranking = np.argsort(-rf.feature_importances_) f, ax = plt.subplots(figsize=(11, 9)) sns.barplot(x=rf.feature_importances_[ranking], y=X_train.columns.values[ranking], orient='h') ax.set_xlabel("feature importance") plt.tight_layout() plt.show() X_train = X_train.iloc[:,ranking[:30]] X_test = X_test.iloc[:,ranking[:30]] X_train["Interaction"] = X_train["TotalSF"]*X_train["OverallQual"] X_test["Interaction"] = X_test["TotalSF"]*X_test["OverallQual"] fig = plt.figure(figsize=(12,7)) for i in np.arange(30): ax = fig.add_subplot(5,6,i+1) sns.regplot(x=X_train.iloc[:,i], y=y_train) plt.tight_layout() plt.show() Xmat = X_train Xmat['SalePrice'] = y_train Xmat = Xmat.drop(Xmat[(Xmat['TotalSF']>5) & (Xmat['SalePrice']<12.5)].index) Xmat = Xmat.drop(Xmat[(Xmat['GrLivArea']>5) & (Xmat['SalePrice']<13)].index) y_train = Xmat['SalePrice'] X_train = Xmat.drop(['SalePrice'], axis=1) ```

Accepted Answer

データ周りで何か違う状況になっているのではないでしょうか（上で違うことをした、ミスっている等）。

その記事の**7) 各Featureとターゲットの関係を調べる**でデータをプロットしているはずですから、記事のプロットと自分のプロットをよく見比べてください。一番上の行の左から2番目、4番目のグラフを見て外れ値と判断しているのだと思いますが。

前提・実現したいこと

該当のソースコード

該当プログラム

試したこと

補足情報（FW/ツールのバージョンなど）

関連した質問