pythonで近い値をまとめてデータの数を減らしたい

この配列を

[
{"r": 264.0, "theta": 0.034906585}#A
{"r": 254.0, "theta": 0.10471976}#A

{"r": 284.0, "theta": 1.6057029}#B
{"r": 309.0, "theta": 1.6057029}#B
{"r": 306.0, "theta": 1.6231562}#B

{"r": 67.0, "theta": 1.5882496}#C
{"r": 61.0, "theta": 1.6057029}#C
{"r": 72.0, "theta": 1.5882496}#C

{"r": 133.0, "theta": 1.6057029}#D
{"r": 142.0, "theta": 1.6057029}#D
{"r": 147.0, "theta": 1.5882496}#D
{"r": 131.0, "theta": 1.6057029}#D
{"r": 137.0, "theta": 1.5882496}#D

{"r": 132.0, "theta": 0.6981317}#E
{"r": 142.0, "theta": 0.6457718}#E
{"r": 144.0, "theta": 0.6632251}#E

{"r": -283.0, "theta": 2.5481806}#F
{"r": -292.0, "theta": 2.6179938}#F
{"r": -286.0, "theta": 2.6005406}#F
{"r": -289.0, "theta": 2.565634}#F
]

以下のように6つのパターンに分けて出力したい

[
  {"r": 260, "theta": 0.05 }#パターンA
  {"r": 300, "theta": 1.6  }#パターンB
  {"r":  70, "theta": 1.6  }#パターンC
  {"r": 140, "theta": 1.6  }#パターンD
  {"r": 135, "theta": 0.65 }#パターンE
  {"r":-290, "theta": 2.6  }#パターンF
]

#条件

元データはパターンごとにソートされてはいない
パターンはrとθによって分けられて合計6パターン(パターンの数は変化しない)
各パターンでの値のばらつきはrは+-10、θは+-0.1程度(パターンの判別が難しくなるほどばらついていないはず)
各パターンに含まれる要素数はランダム

#期待する出力

r,θともに平均値や中央値など、ある程度真ん中をとった値を出力したい(特にこだわりはない)

どのようにすればいいのか全くわからないので、ご教授ください。
よろしくおねがいします。

LouiS0616

2018/10/13 13:06

クラスタリングについて調べてみると、似たような状況の解法が見つかるかもしれません。

行動規範の内容に同意します

回答2件

K-meansクラスタリングを使ってみましょうか。

まずデータを可視化してみましょう。

python
1import matplotlib.pyplot as plt
2import pandas as pd
3df = pd.DataFrame(data)
4
5data = [
6    {"r": 264.0, "theta": 0.034906585},
7    {"r": 254.0, "theta": 0.10471976},
8    {"r": 284.0, "theta": 1.6057029},
9    {"r": 309.0, "theta": 1.6057029},
10    {"r": 306.0, "theta": 1.6231562},
11    {"r": 67.0, "theta": 1.5882496},
12    {"r": 61.0, "theta": 1.6057029},
13    {"r": 72.0, "theta": 1.5882496},
14    {"r": 133.0, "theta": 1.6057029},
15    {"r": 142.0, "theta": 1.6057029},
16    {"r": 147.0, "theta": 1.5882496},
17    {"r": 131.0, "theta": 1.6057029},
18    {"r": 137.0, "theta": 1.5882496},
19    {"r": 132.0, "theta": 0.6981317},
20    {"r": 142.0, "theta": 0.6457718},
21    {"r": 144.0, "theta": 0.6632251},
22    {"r": -283.0, "theta": 2.5481806},
23    {"r": -292.0, "theta": 2.6179938},
24    {"r": -286.0, "theta": 2.6005406},
25    {"r": -289.0, "theta": 2.565634}
26]
27
28plt.scatter(df["r"], df["theta"])
29plt.xlabel("theta")
30plt.ylabel("r")
31plt.show()

このデータをラベル付けするのが目的です。

Python
1from sklearn.cluster import KMeans
2from sklearn.preprocessing import MinMaxScaler
3from sklearn.pipeline import Pipeline
4
5scaler = MinMaxScaler()  # # クラスタリングがうまくいくための前処理
6kmeans = KMeans(n_clusters=6)  # 今回はクラスター数が既知
7
8# 前処理とクラスタをパイプラインでつないでおく。
9cls = Pipeline([("scaler", scaler), ("cluster", kmeans)])
10C = cls.fit_predict(df.values)
11
12# うまく行ったか確認
13plt.scatter(df["r"], df["theta"], c=C)
14plt.xlabel("theta")
15plt.ylabel("r")
16plt.show()

各クラスタの中心を追記しておきましょう。

Python
1centers = scaler.inverse_transform(kmeans.cluster_centers_)
2centers = pd.DataFrame(centers, columns=["r", "theta"])
3print(centers)
4
5#             r     theta
6# 0 -287.500000  2.583087
7# 1  138.000000  1.598722
8# 2  139.333333  0.669043
9# 3  299.666667  1.611521
10# 4  259.000000  0.069813
11# 5   66.666667  1.594067
12
13plt.scatter(df["r"], df["theta"], c=C)
14plt.scatter(centers["r"], centers["theta"], c="red")
15plt.xlabel("r")
16plt.ylabel("theta")
17plt.show()

投稿2018/10/13 13:36

tachikoma

総合スコア3601

ベストアンサー

こんな感じでどうでしょうか？

手順

r と theta ではスケールが異なるので正規化
クラスタリング (例: k-平均法など)
クラスタリング結果に基づき、正規化前のデータを分類

サンプルコード

python
1src = [{"r": 264.0, "theta": 0.034906585},#A
2        {"r": 254.0, "theta": 0.10471976},#A
3         
4        {"r": 284.0, "theta": 1.6057029},#B
5        {"r": 309.0, "theta": 1.6057029},#B
6        {"r": 306.0, "theta": 1.6231562},#B
7         
8        {"r": 67.0, "theta": 1.5882496},#C
9        {"r": 61.0, "theta": 1.6057029},#C
10        {"r": 72.0, "theta": 1.5882496},#C
11         
12        {"r": 133.0, "theta": 1.6057029},#D
13        {"r": 142.0, "theta": 1.6057029},#D
14        {"r": 147.0, "theta": 1.5882496},#D
15        {"r": 131.0, "theta": 1.6057029},#D
16        {"r": 137.0, "theta": 1.5882496},#D
17         
18        {"r": 132.0, "theta": 0.6981317},#E
19        {"r": 142.0, "theta": 0.6457718},#E
20        {"r": 144.0, "theta": 0.6632251},#E
21         
22        {"r": -283.0, "theta": 2.5481806},#F
23        {"r": -292.0, "theta": 2.6179938},#F
24        {"r": -286.0, "theta": 2.6005406},#F
25        {"r": -289.0, "theta": 2.565634}]#F
26
27# numpy 配列に変換
28x = np.array([[v['r'], v['theta']] for v in src])
29# print(x.shape, x.dtype)
30
31# # スケールを統一 列ごとに [0, 1] に正規化
32x_normalized = x - np.min(x, axis=0).clip(0) # [-a, b] -> [0, a + b]
33x_normalized /= np.max(x_normalized, axis=0) # [0, a + b] -> [0, 1]
34
35# kmean
36num_classes = 6
37from sklearn.cluster import KMeans
38kmean = KMeans(n_clusters=num_classes)
39y = kmean.fit_predict(x_normalized)
40print(y)
41
42# 結果表示
43for label in range(num_classes):
44    print('label: {}\n{}'.format(label, x[y == label]))

label: 0
[[-283.           2.5481806]
 [-292.           2.6179938]
 [-286.           2.6005406]
 [-289.           2.565634 ]]
label: 1
[[133.          1.6057029]
 [142.          1.6057029]
 [147.          1.5882496]
 [131.          1.6057029]
 [137.          1.5882496]]
label: 2
[[132.          0.6981317]
 [142.          0.6457718]
 [144.          0.6632251]]
label: 3
[[284.          1.6057029]
 [309.          1.6057029]
 [306.          1.6231562]]
label: 4
[[2.6400000e+02 3.4906585e-02]
 [2.5400000e+02 1.0471976e-01]]
label: 5
[[67.         1.5882496]
 [61.         1.6057029]
 [72.         1.5882496]]

投稿2018/10/13 13:35