Pythonクラスタリング方法について

前提・実現したいこと

現在２種類のタイプの機器があり
上と下でグループ分けをしてそれぞれで単回帰分析をしたいのですが

まずクラスタリングでそれぞれの機器にグループ分けをしたいのですが
kmeansだときれいに分けることができません

グループ分けするよい手法があれば教えてください

発生している問題・エラーメッセージ

該当のソースコード

python
1import pandas as pd
2
3url = "https://docs.google.com/spreadsheets/d/e/2PACX-1vSA9NhTNG6rcb1BAdVzC2RYgPPCCd0ryo1YconlDj7TK15IAO8rIi3uY9FzRCkXsj48BO4hWtceriKq/pub?gid=0&single=true&output=csv"
4
5df = pd.read_csv(url)
6
7sns.scatterplot(x='ta', y='m', data=df)

試したこと

KMeans

python
1df1 = df.copy()
2
3from sklearn.cluster import KMeans
4
5kmeans = KMeans(n_clusters=2, random_state=0)
6
7clusters = kmeans.fit(df1)
8df1['cluster'] = clusters.labels_
9
10sns.scatterplot(x='ta', y='m', hue='cluster', data=df1)

GaussianMixture

python
1df2 = df.copy()
2
3from sklearn.mixture import GaussianMixture
4
5model = GaussianMixture(n_components=2)
6model.fit(df2)
7df2['cluster'] = model.predict(df2)
8
9sns.scatterplot(x='ta', y='m', hue='cluster', data=df2)

SpectralClustering

python
1df3 = df.copy()
2
3from sklearn import cluster
4
5spectral = cluster.SpectralClustering(n_clusters=2, eigen_solver='arpack', affinity='nearest_neighbors')
6
7clusters = spectral.fit(df3)
8
9df3['cluster'] = clusters.labels_
10
11sns.scatterplot(x='ta', y='m', hue='cluster', data=df3)

LinearRegression

python
1df4 = df.copy()
2
3from sklearn.linear_model import LinearRegression
4
5lr = LinearRegression()
6lr.fit(df4['ta'].values.reshape(-1, 1), df4['m'].values.reshape(-1, 1))
7pred_y = lr.predict(df4['ta'].values.reshape(-1, 1)).reshape(-1)
8df4['cluster'] = (df4['m'] < pred_y).astype(int)
9
10for name, dfg in df4.groupby('cluster'):
11    lr.fit(dfg['ta'].values.reshape(-1, 1), dfg['m'].values.reshape(-1, 1))
12    print(name, lr.coef_, lr.intercept_)
13
14sns.scatterplot(x='ta', y='m', hue='cluster', data=df4)

補足情報（FW/ツールのバージョンなど）

Python3.8
pandas

lehshell

2021/11/06 15:19

混合ガウス分布は試されましたか？ from sklearn.mixture import GaussianMixture model = GaussianMixture(n_components=2) model.fit(df) print(model.predict(df))

退会済みユーザー

2021/11/06 21:22

https://scikit-learn.org/stable/auto_examples/cluster/plot_cluster_comparison.html#sphx-glr-auto-examples-cluster-plot-cluster-comparison-py Spectral clusteringもよさそうです。

barobaro

2021/11/06 23:10

サンプルデータを追加しました

barobaro

2021/11/06 23:11

lehshell さん混合ガウス分布を試してみました結果を追記しておりますがうまくいきませんでした混合ガウス分布はまた別の用途で試してみますありがとうございます

barobaro

2021/11/06 23:11

fourteenlength さん Spectral clusteringを試してみました結果を追記しておりますがうまくいきませんでしたパラメータも多いためほかのも試してみたいとおもいますありがとうございます

行動規範の内容に同意します

回答2件

コードはないですが、RANSACを複数回行うのもありと思います。2回目はOutlierに対してRANSACを行うイメージでしょうか。

投稿2021/11/07 12:12

退会済みユーザー

総合スコア0

ベストアンサー

仮に、つねに２タイプでお互いのデータが交差しないという前提であれば
全体の回帰結果を得て、それぞれがその上下どちらかにあるかでクラスタリングすればよいかと思います。

Python
1import pandas as pd
2import random
3from sklearn.linear_model import LinearRegression
4import matplotlib.pyplot as plt
5import seaborn as sns
6
7# テストデータ
8xs = list(range(10))
9y1 = [ 4*x + 0 + random.uniform(-5,5) for x in xs]
10y2 = [ 3*x -20 + random.uniform(-3,3) for x in xs]
11df = pd.DataFrame({'x':xs*2, 'y':y1+y2})
12
13# 全体の結果からクラスタリング
14lr = LinearRegression()
15lr.fit(df['x'].values.reshape(-1,1), df['y'].values.reshape(-1,1))
16pred_y = lr.predict(df['x'].values.reshape(-1,1)).reshape(-1)
17df['c'] = (df['y'] < pred_y).astype(int) # 上 or 下
18
19for name, df2 in df.groupby('c'):
20    lr.fit(df2['x'].values.reshape(-1,1), df2['y'].values.reshape(-1,1))
21    print(name, lr.coef_, lr.intercept_)
22#0 [[3.98516963]] [0.03326308]
23#1 [[3.30785385]] [-22.03818755]
24
25sns.scatterplot(x='x', y='y', hue='c', data=df)
26plt.show()