Pythonで辞書型データから高速に検索したい

Question

以下の辞書型のデータ構造で，`value`から検索したいケースが生じてしまい，良い検索手段を検討しております。
高速な方法があればご教授いただきたいです。

以下のような文字列を`key`とし，文字列が格納されたリストが`value`になった辞書型のデータ構造で，数万レコードあります。

なお，`リスト[0]`がkeyと同値になっています。
```python
test_dict =
{"A":["A","B","C"],
 "B":["B","C","D"],
 :
 "X":["X","Y","Z"]}
```

これまではこの辞書型データに対して，`key`を利用した検索を行っておりました。

```python
>print(test_dict["A"])
["A","B","C"] 
```

しかし，`リスト[1]`が`"B"`のリストを取得したいという事態となりました。

これを純粋にやると以下のようになり性能懸念があり，改善策を検討しております。

```python
index = 1     # 取得したい値のインデックス
target = "B"  # 取得したい値

for key, value in test_dict.items():
    if target == value[index]:
        print(value) # ["A","B","C"]
```

良い方法があればご教授ください。

Accepted Answer

検索回数によってやるべきことが変わります。

一度しか検索しないのであればO(N)を我慢するしかありません。

何度も検索するのであれば、最初にそのための辞書を作ってからの方が速くなります。

---

衝突回避版。

```python
test_dict = {
    "A":["A","B","C"],
    "B":["B","C","D"],
    "X":["X","Y","Z"],
}

test_dict_1 = {}
for k, v in test_dict.items():
    if v[1] in test_dict_1:
        test_dict_1[v[1]].append(k)
    else:
        test_dict_1[v[1]] = [k]

print([test_dict[k] for k in test_dict_1['B']])
```

Answer

Pythonの機能だけだと改善のしようがないので、pandasのDataFrameを使った方法で回答します。

```Python
import pandas as pd
test_dict = {
     "A":["A","B","C"],
     "B":["B","C","D"],
     "X":["X","Y","Z"]
}
df = pd.DataFrame.from_dict(test_dict, orient="index")
print(df)

#    0  1  2
# A  A  B  C
# B  B  C  D
# X  X  Y  Z
```

これで今までのデータをDataFrameに移行しました。
ここから、
```Python
index = 1     # 取得したい値のインデックス
target = "B"  # 取得したい値

result = df[df[index] == target]
result = result.values.tolist()  # コメントアウトしても動きます
print(result)
# [['A', 'B', 'C']]
```

などと出来ます。結果が配列の配列になっているのは、該当するレコードが２個以上になる場合もあるためです。
ただ、```kye```アクセスには少しだけ手間が増えますので、場合によってはリファクタリング箇所が増えてしまう恐れがあります。

```Python
key = "X"
value = df.loc[key].tolist()
print(value)
# ['X', 'Y', 'Z']
```
ご参考までに。

--------
追記
速度を測ってみました。結論から言うと、1回の検索ですらpandasのほうが平均して遅いという結果になりました。

適当にデータを作成
```Python
import pandas as pd
from itertools import product
temp = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"
test_dict = {s+t+u: [s+t+u+"0", s+t+u+"1", s+t+u+"2"] for s, t, u in product(temp, repeat=3)}
df = pd.DataFrame.from_dict(test_dict, orient="index")
print("head")
print(df.head())
print("tail")
print(df.tail())
print("レコード数", len(df))

# head
#         0     1     2
# aaa  aaa0  aaa1  aaa2
# aab  aab0  aab1  aab2
# aac  aac0  aac1  aac2
# aad  aad0  aad1  aad2
# aae  aae0  aae1  aae2
# tail
#         0     1     2
# ZZV  ZZV0  ZZV1  ZZV2
# ZZW  ZZW0  ZZW1  ZZW2
# ZZX  ZZX0  ZZX1  ZZX2
# ZZY  ZZY0  ZZY1  ZZY2
# ZZZ  ZZZ0  ZZZ1  ZZZ2
# レコード数 140608
```

これに対して元の方法とpandasを比較します。
```Python
def get_value0(index, target):
    # 元の方法
    for key, value in test_dict.items():
        if target == value[index]:
            return value

def get_value1(index, target):
    # pandas使った方法
    return df[df[index] == target].values.tolist()[0]

# 探したいデータ        
index = 2     # 取得したい値のインデックス
target = "Gcw2"  # 取得したい値

# 同じ結果になるか確認
assert get_value0(index, target) == get_value1(index, target)

# 元の方法
%timeit get_value0(index, target)
# 7.78 ms ± 131 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# pandasを使った方法
%timeit get_value1(index, target)
# 11.7 ms ± 231 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
```

targetによっては元の方法では最後までloopを回さないとダメなにで、パフォーマンスにムラがあります。ただ、ワーストケースでpandasより1ms遅い程度でした。

実行環境
Python 3.6.4
pandas==0.22.0
MBP

Answer

こうやってあらかじめデータ構造を変更してみればいかがでしょうか？

```python
test_dict = {
    "A":["A","B","C"],
    "B":["B","C","D"],
    "X":["X","Y","Z"],
}
 
test_dict_1 = {v[1]: v for v in test_dict.values()}
print(test_dict_1)
```

関連した質問