drop_duplicatesで重複を削除した後、残ったものを抽出する

データフレームdf1（7976 rows × 167 columns）があります。

df2=df1.drop_duplicates()でdf2（6439 rows × 167 columns）を得ます。

ここでdf1から削除された行の中で特定のインデックスに対応する行（これらのインデックスはindex_listに入っているとします。12個のインデックスです。）と重複しており、df2に残ったもののインデックスを取り出したいです。わかりにくいかもしれませんが、削除された行の中でindex_listに含まれるインデックスを持った行と重複していながらも重複一番目だけを残す条件（keep=first)により残ったもののインデックスを知りたいです。

なお重複している場合はすべてのコラムの値が重複しています。

下を実行し以下のエラーが出ました。

python
1
2get_index=[]
3for i in index_list:
4    for j in df2.index:
5        if df1.loc[i, :]==df2.loc[j,:]:
6            get_index.append(j)
7        
8get_index
9
10
11---------------------------------------------------------------------------
12
13ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
14

どうぞよろしくお願いします。

行動規範の内容に同意します

回答2件

ベストアンサー

こうですかね。

Python
1get_index=[]
2for i in index_list:
3    for j in df2.index:
4        if (df1.loc[i, :] == df2.loc[j, :]).all():
5            get_index.append(j)

投稿2018/06/11 08:47

magichan

総合スコア15898

nouken

2018/06/11 10:25

回答ありがとうございます。お聞きしたいのですが、ここでのall関数の意味はどういうものなのでしょう？調べたのですが、このような使い方の例は見つけられなかったのですが‥

magichan

2018/06/11 10:47

df1.loc[i, :] == df2.loc[j, :] の部分は df の行データ(Series)同士の比較を行っておりますので、その結果はbool型のSeriesデータで得られます。

magichan

2018/06/11 10:48

こんなかんじ print(pd.Series([1,2]) == pd.Series([1,2])) #0 True #1 True #dtype: bool print(pd.Series([1,2]) == pd.Series([1,3])) #0 True #1 False #dtype: bool

magichan

2018/06/11 10:48

そこで pandas.Series.all() を使うことによりSeriesデータの全ての要素がTrueの場合のみ Trueが返るようにしております。 http://pandas.pydata.org/pandas-docs/version/0.22/generated/pandas.Series.all.html

nouken

2018/06/11 11:12

なるほど！非常にわかりやすい説明ありがとうございます！

行動規範の内容に同意します

python
1import numpy as np
2import pandas as pd
3
4df = pd.DataFrame()
5df['a'] = np.random.randint(5, size=10)
6df2 = df.drop_duplicates(keep='first')
7print(df)
8print(df2.index)
9print(df2.sort_values('a').index)