クラスタリングの番号をある条件で番号を振りなおしたい

やりたいことを簡単にしました

python
1df = pd.read_table(io.StringIO("""
2top	row
31	2
42	2
53	0
64	0
75	1
86	1
9"""))

クラスタリングするとこういう結果がでたのですが
top row
1 2
2 2
3 0
4 0
5 1
6 1

こういうtop順にrowを0から番号を振りなおしたい
top row
1 0
2 0
3 1
4 1
5 2
6 2

よろしくお願いします。

以下サンプルは不要です。

PDFの表からテキスト抽出してTSVの表を作成したいのですが

popplerのpdftohtmlからxmlを作成
xmlからX/Y座標とテキストのcsvを作成
X座標とY座標をクラスタリングしクラス分け

クラスタリングはできたのですが順番がバラバラなのでX座標の順、Y座標の順で番号を振り直したい。

df2_results['top'] Y座標
df2_results['left'] X座標
df2_results['row'] Yのクラス
df2_results['col'] Xのクラス

になっています。

よろしくお願いします。

python
1import pandas as pd
2from sklearn.cluster import KMeans
3
4df = pd.read_csv('data.csv')
5
6# ページ・縦・横の順にソート
7df.sort_values(by=['page', 'top', 'left'], inplace=True)
8df.head()
9
10df1 = df.loc[:, ['page', 'top', 'left', 'text']]
11df1.plot.scatter(x='left', y='top', ylim=(892, 0))

python
1# ヘッダー部を除外
2df2 = df1[df1['top'] > 100]
3df2
4
5# 表部分確認
6df2.plot.scatter(x='left', y='top', ylim=(892, 0))
7
8# Y座標をクラスタリング
9mdl_y = KMeans(n_clusters=33)
10df2_y = df2[['top']]
11mdl_y.fit(df2_y)
12y2 = mdl_y.labels_
13print(y2)
14
15# X座標をクラスタリング
16mdl_x = KMeans(n_clusters=6)
17df2_x = df2[['left']]
18mdl_x.fit(df2_x)
19x2 = mdl_x.labels_
20print(x2)
21
22df2_results = df2.copy()
23
24df2_results['row'] = y2
25df2_results['col'] = x2
26
27df2_results.sort_values('top')

python
1# Y座標散布図
2df2_results.plot.scatter(
3    x='left', y='top', c='row', cmap='winter', ylim=(892, 0))

Y座標散布図

python
1# X座標散布図
2df2_results.plot.scatter(
3    x='left', y='top', c='col', cmap='winter', ylim=(892, 0))

X座標散布図

行動規範の内容に同意します

回答1件

ベストアンサー

こんな感じでできます。

python
1>>> x_cluster_centers_ = np.array([[0], [15], [-30]])  # 適当に見立ててデータを作る
2>>> x_labels_ = np.array([0, 0, 0, 1, 1, 1, 2, 2, 2])  # 適当に見立ててデータを作る
3>>> x_centers = x_cluster_centers_.ravel()  # 取扱いやすいようベクトル（ndim=1の配列）にする
4>>> x_convert_dic = {k:v for v,k in enumerate(x_centers.argsort())}  # argsortしてindexと値を逆転させた辞書を作る
5>>> [x_convert_dic[l] for l in x_labels_]  # その辞書を使って変換
6[1, 1, 1, 2, 2, 2, 0, 0, 0]

色がわかりづらい件については

色がわかりやすいcmapを探して、winterの代わりに指定する。
Choosing Colormaps — Matplotlib 2.0.2 documentation

投稿2018/06/15 12:17

hayataka2049

総合スコア30933

barobaro

2018/06/15 12:31

コードを参考に今から試してみます。色について種類があるのに気づかなくて少し見やすくなりましたありがとうございます。

barobaro

2018/06/15 13:55

やってみたのですがargsortをうまく使えなくて動きませんでした。サンプルをシンプルに書き換えましたのでよければもう一度お願いします。

hayataka2049

2018/06/15 13:58

ん、動きません？　どんな感じでやったのか見せていただけると直せるかもしれません追加注文の方は、どうやるのがうまいかすぐ思いつかないので、考えておいて明日回答します

barobaro

2018/06/15 14:50 編集

いつでもかまいませんのでお時間がお手すきの際によろしくお願いします。サンプルを利用しています y_conv_dic = {k:v for v,k in enumerate(df['top'].argsort())} df['test'] = [y_conv_dic[l] for l in df['row']] df top row test 0 1 3 3 1 2 3 3 2 3 1 1 3 4 1 1 4 5 2 2 5 6 2 2

hayataka2049

2018/06/15 14:58

df['top']の代わりにmdl_x.cluster_centers_.ravel()とかを入れればうまくいくと思いますその情報を使わないでやろうとすると、各クラスタごとに集計してtopの平均を取ってその平均でソート、でしょうか。それはそれでやればできますけど、ちょっと面倒ですよ y_conv_dic = {k:v for v,k in enumerate(df.groupby("row").mean()["top"].argsort())} df['test'] = [y_conv_dic[l] for l in df['row']] print(df)

barobaro

2018/06/15 15:27

できましたどうもありがとうございます。まだいろいろとわからないことばかりなので勉強します。

行動規範の内容に同意します