クラスタごとにプロットしたい。

前提・実現したいこと

K平均法を実装しようとしています。
xとyの2次元の30個のデータをクラスタリングしたいと思っています。
データはエクセエルを読み込みます。
クラスタ数は2と3をやりたいです。

発生している問題・エラーメッセージ

なんとなくプログラムを書いて、クラスタ数が2の時はクラスタリングできたのかなと思います。
各クラスタで色を変えて散布図を作成する方法を教えていただきたいです。
どこにどんな文を加えれば良いのか教えてください。
また、クラスタ数を3に変えた時に、以下のエラーが発生しました。

---> d = sum([x*x for x in point-center[i]])
IndexError: index 2 is out of bounds for axis 0 with size 2

該当のソースコード

python
1import numpy as np
2import matplotlib.pyplot as plt
3import pandas as pd
4import xlrd
5from pandas import Series, DataFrame
6from numpy.random import randint
7from PIL import Image
8from numpy.random import normal
9from matplotlib import pyplot
10
11xlsFile="data_1.xlsx"
12sheetName="data1"
13data1=pd.read_excel(xlsFile,sheet_name=sheetName)
14
15clus = [2]  #クラスタの数
16
17# k平均法による処理
18def run_kmeans(pixels, k):
19    cls =[0]* len(pixels)
20
21    # 代表点の初期値を設定
22    center1 = [[10,0],[0,10]]
23    center=np.array(center1)
24    print ("Initial centers:")
25    print ("========================")
26    distortion = 0.0
27
28    # 最大50回のIterationを実施
29    for iter_num in range(50): 
30        center_new= []
31        for i in range(k):
32            center_new.append(np.array([0.0,0.0]))
33            num_points = [0] * k
34        distortion_new = 0.0
35
36        # E Phase: 各データが属するグループ（代表点）を計算
37        for pix, point in enumerate(pixels):
38            min_dist = 256*256*3
39            point = np.array(point)
40            for i in range(k):
41                d = sum([x*x for x in point-center[i]])
42                if d < min_dist:
43                    min_dist = d
44                    cls[pix] = i
45            center_new[cls[pix]] += point
46            num_points[cls[pix]] += 1
47            distortion_new += min_dist
48
49        # M Phase: 新しい代表点を計算
50        for i in range(k):
51            center_new[i] = center_new[i] // num_points[i]
52        center = center_new
53        print (list(map(lambda x: x.tolist(), center)))
54        print ("Distortion: J=%d" % distortion_new)
55
56        # Distortion(J)の変化が0.1%未満になったら終了
57        if iter_num > 0 and distortion - distortion_new < distortion * 0.001:
58            break
59        distortion = distortion_new
60        
61        for i in range(k):
62            labels=point
63            print(labels)
64            x=labels[:,0]
65            y=labels[:,1]
66            plt.scatter(x,y)
67            
68        df=pd.DataFrame(center)
69        plt.scatter(df[0], df[1])
70
71        plt.show()
72        
73    return pixels
74
75# Main
76if __name__ == '__main__':
77    for k in clus:
78        print ("")
79        print ("========================")
80        print ("Number of clusters: K=%d" % k)
81        pixels1 = data1
82        pixels=np.array(pixels1)
83        run_kmeans(pixels, k)

試したこと

x=pixels[:,0]
y=pixels[:,1]
plt.scatter(x,y)
をplt.show()の前において、なんとなくクラスタリングできていることを確認しました。
代表点と各クラスタの推移の散布図を作成できればいいと思います。

補足情報（FW/ツールのバージョンなど）

python3
jupyter notebook

hayataka2049

2018/07/12 04:22

そのコードだとMainでpixelsがundefined nameになりますかね・・・あとkmeansを自分で書かないといけない制約でもあるのでしょうか。特にこだわる必要がなければsklearn等使った方が楽でバグが入りづらいです

行動規範の内容に同意します

回答1件

ベストアンサー

とり合えずでよければ、run_kmeans関数の最後にて

plt.scatter(pixels[:,0], pixels[:,1], c=cls)

を行うと散布図を描画できるかと思います。

もとのコードを修正すると以下のようになります。

Python
1import numpy as np
2import matplotlib.pyplot as plt
3import pandas as pd
4import xlrd
5from pandas import Series, DataFrame
6from numpy.random import randint
7from PIL import Image
8from numpy.random import normal
9from matplotlib import pyplot
10
11xlsFile="data_1.xlsx"
12sheetName="data1"
13data1=pd.read_excel(xlsFile,sheet_name=sheetName)
14
15clus = [2]  #クラスタの数
16
17# k平均法による処理
18def run_kmeans(pixels, k):
19    cls =[0]* len(pixels)
20
21    # 代表点の初期値を設定
22    center1 = [[10,0],[0,10]]
23    center=np.array(center1)
24    print ("Initial centers:")
25    print ("========================")
26    distortion = 0.0
27
28    # 最大50回のIterationを実施
29    for iter_num in range(50): 
30        center_new= []
31        for i in range(k):
32            center_new.append(np.array([0.0,0.0]))
33            num_points = [0] * k
34        distortion_new = 0.0
35
36        # E Phase: 各データが属するグループ（代表点）を計算
37        for pix, point in enumerate(pixels):
38            min_dist = 256*256*3
39            point = np.array(point)
40            for i in range(k):
41                d = sum([x*x for x in point-center[i]])
42                if d < min_dist:
43                    min_dist = d
44                    cls[pix] = i
45            center_new[cls[pix]] += point
46            num_points[cls[pix]] += 1
47            distortion_new += min_dist
48
49        # M Phase: 新しい代表点を計算
50        for i in range(k):
51            center_new[i] = center_new[i] // num_points[i]
52        center = center_new
53        print (list(map(lambda x: x.tolist(), center)))
54        print ("Distortion: J=%d" % distortion_new)
55
56        # Distortion(J)の変化が0.1%未満になったら終了
57        if iter_num > 0 and distortion - distortion_new < distortion * 0.001:
58            break
59        distortion = distortion_new
60
61    # 散布図を描写
62    plt.scatter(pixels[:,0], pixels[:,1], c=cls, cmap='jet')
63    plt.show()
64
65    return pixels
66
67# Main
68if __name__ == '__main__':
69    for k in clus:
70        print ("")
71        print ("========================")
72        print ("Number of clusters: K=%d" % k)
73        pixels1 = data1
74        pixels=np.array(pixels1)
75        run_kmeans(pixels, k)