pythonを用いて、クラスタリングを行おうとすると、valueerror length of values does not match length of index となり、苦慮しております。

pythonを用いて、クラスタリングを行おうとすると、valueerror length of values does not match length of index となり、苦慮しております。

こちらが、今回、エラーコードをだしました、ソースコードでございます。

# -*- coding: utf-8 -*-

import pandas as pd
import numpy as np
import matplotlib.lines as mlines
import codecs
from sklearn.cluster import KMeans

f1 = codecs.open('claster_panda_1_1.csv', 'w', 'utf-8')


# データセットを読み込み

cust_df = pd.read_csv("odds_test_1_1.csv" , sep=",")

print cust_df

# Pandas のデータフレームから Numpy の行列 (Array) に変換

cust_array = cust_df.as_matrix().astype(np.int)


# 行列を転置
cust_array = cust_array.T

# クラスタ分析を実行 (クラスタ数=4)
pred = KMeans(n_clusters=4).fit_predict(cust_array)
print pred

# Pandas のデータフレームにクラスタ番号を追加
cust_df['cluster_id']=pred
print cust_df

cust_df.to_csv('claster_panda_1_1.csv', index=None)


# 各クラスタに属するサンプル数の分布
cust_df['cluster_id'].value_counts()

print cust_df['cluster_id'].value_counts()

# 各クラスタのデータの平均値

cust_df[cust_df['cluster_id']==0].mean() # クラスタ番号 = 0
print cust_df[cust_df['cluster_id']==0].mean() # クラスタ番号 = 0

cust_df[cust_df['cluster_id']==1].mean() # クラスタ番号 = 1
print cust_df[cust_df['cluster_id']==1].mean() # クラスタ番号 = 1

cust_df[cust_df['cluster_id']==2].mean() # クラスタ番号 = 2
print cust_df[cust_df['cluster_id']==2].mean() # クラスタ番号 = 2

cust_df[cust_df['cluster_id']==3].mean() # クラスタ番号 = 3
print cust_df[cust_df['cluster_id']==3].mean() # クラスタ番号 = 3


# 可視化（積み上げ棒グラフ）
import matplotlib.pyplot as plt

clusterinfo = pd.DataFrame()
for i in range(4):
    clusterinfo['cluster' + str(i)] = cust_df[cust_df['cluster_id'] == i].mean()
clusterinfo = clusterinfo.drop('cluster_id')

my_plot = clusterinfo.T.plot(kind='bar', stacked=True, title="Mean Value of 4 Clusters")
my_plot.set_xticklabels(my_plot.xaxis.get_majorticklabels(), rotation=0)

plt.legend(loc='uppper right',
           bbox_to_anchor=(1.05, 0.5, 0.5, 10), 
           borderaxespad=0.,)
plt.show()

my_plot = clusterinfo.T.plot(kind='bar', stacked=True, title="Mean Value of 4 Clusters")
my_plot.set_xticklabels(my_plot.xaxis.get_majorticklabels(), rotation=0)
plt.show()

こちらが、読み込ませるデータセット odds_test_1_1.csv でございます。

temp(0)	temp(1)	temp(2)	temp(3)	temp(4)	temp(5)	temp(6)	temp(7)	temp(8)	temp(9)	temp(10)	temp(11)	temp(12)	arrived
0	0	4	8.1	13.1	12.3	9.1	9.2	6.4	6.6	6.3	6.5	6.9	1
0	6.8	3.7	9.9	16	7	5.3	4.9	5.1	5.1	5.2	4.7	4.7	1
0	3.4	30	61.4	27.8	11.5	11.7	11.9	12.8	13.4	14	14.5	15.6	0
0	3.4	25	9.1	48	38	20.4	17.7	18.3	15	14.6	14.2	14.9	0
0	6.8	9.9	19.7	12.8	13.6	14.3	14	14.2	15	14.3	14.5	15.3	1
0	3.4	25	34.5	156.1	107.4	84.6	59.3	63.2	67.7	65.7	67.2	69.2	0
0	0	37.6	92.2	198.7	137.7	125.3	99.2	90.1	93.8	92.4	93.8	90.9	0
0	0	5.5	12	27.9	28.4	26.9	25.9	27.5	27.4	27.8	27.4	28.7	0
0	0	4.3	1.4	1.1	1.3	1.5	1.7	1.8	1.8	1.8	1.8	1.8	0
0	0	50.1	61.4	118.1	57	58.7	32.5	35.1	38.3	39.5	40.3	42.9	0
0	0	25.6	88.4	88.4	47.1	76.8	70.6	70.6	76.7	81.4	78.9	80.2	0
0	0	12.8	7.1	7.1	5.1	5.6	6.7	6.7	5.3	5.6	5.9	5.9	1
0	0	5.9	7.7	7.7	5.8	4.1	4.3	4.3	3.9	4.1	4.3	4.4	1
0	0	5.4	5.1	5.1	9.8	14.7	15.3	15.3	15.8	13.2	11.6	11.7	0
0	0	3.3	13	13	8.7	14.8	18.9	18.9	19.6	20.8	18	17	0
0	0	76.8	63.2	63.2	68	94.5	117.8	117.8	95	100.5	92	76.6	0
0	0	25.6	29.4	29.4	12.8	12.5	15	15	15.6	15.8	15.7	16	0
0	0	38.4	8.6	8.6	4.6	4.9	3.2	3.2	3.8	3.9	4	4	0
0	0	76.8	13.4	13.4	15.9	8.8	9.6	9.6	8.9	7.5	7.7	7.8	0
0	0	10.9	3	3	7.2	9	11.9	11.9	12.7	11.8	10.6	9.7	1
0	0	38.4	44.2	44.2	81.6	57.1	51.2	51.2	52	54.9	57.1	58.1	0
0	0	12.8	16.3	16.3	22.2	10.5	12.7	12.7	13.7	14.3	14.3	14.3	0
0	0	5.1	20.1	20.1	12.3	18.6	13.8	13.8	15.1	15.6	15.1	15.4	0
0	0	19.7	4	5.7	5.5	5.1	5.2	5.2	5.3	5.1	5	5	1

これを動かしますと、下記のようなエラーメッセージが返ってまいります。

[2693 rows x 14 columns]
[3 3 2 0 1 1 1 1 1 1 1 1 1 3]
Traceback (most recent call last):
File "C:\Users\satoru\satoru_system_2.7\data_test\klaster_1_1.py", line 34, in <module>
cust_df['cluster_id']=pred
File "C:\Python27\lib\site-packages\pandas\core\frame.py", line 2357, in setitem
self._set_item(key, value)
File "C:\Python27\lib\site-packages\pandas\core\frame.py", line 2423, in _set_item
value = self._sanitize_column(key, value)
File "C:\Python27\lib\site-packages\pandas\core\frame.py", line 2578, in _sanitize_column
value = _sanitize_index(value, self.index, copy=False)
File "C:\Python27\lib\site-packages\pandas\core\series.py", line 2770, in _sanitize_index
raise ValueError('Length of values does not match length of ' 'index')
ValueError: Length of values does not match length of index

C:\Users\satoru\satoru_system_2.7\data_test>

申し訳ございませんが、上記のエラーコードの
Python：「値の長さがインデックスの長さと一致しません」ということまでしか、小生には理解できませんでした。

そのため、エラーの出ている位置までを指摘することは出来ません。

ただ、少しずつ、ソースコードを追加していき、エラーコードの出た箇所は、

# クラスタ分析を実行 (クラスタ数=4)
pred = KMeans(n_clusters=4).fit_predict(cust_array)

cust_df['cluster_id']=pred
print cust_df

cust_df.to_csv('claster_panda_1_1.csv', index=None)

おそらく、上記の箇所ではないかと推測されます。

御教示、よろしくお願いいたします。

_Victorique__

2017/07/29 04:45

このコードを見てエラーの位置まで特定しろっていうのはちょっと...。せめてエラーの出ている位置を教えてくださいね

8524ba23

2017/07/29 05:24

20行目で「ValueError: invalid literal for long() with base 10」と異なる結果になります。なぜでしょうか？追記ください。

akakage13

2017/07/29 05:36

can110様、ValueError: invalid literal for long() with base 10のことでございますが、どこの箇所を御指摘してくださっていることすら分かりません。申し訳ございません。

8524ba23

2017/07/29 05:38

提示された「odds_test_1_1.csv」とコードをそのまま実行しただけです。異なるエラー結果となりました。

akakage13

2017/07/29 05:57

can110様、誠に恐縮ですが、小生も再度、試しましたが、ValueError: Length of values does not match length of indexのエラーがでてまいります。

8524ba23

2017/07/29 06:13

了解です。何か環境等が異なるのかもしれませんね。再現しませんので回答できずすみません。

akakage13

2017/07/29 06:18

can110様、貴重なお時間を小生のためにお使いいただきまして、本当にありがとうございました。今後とも、よろしくお願いいたします。

行動規範の内容に同意します

回答2件

単純に行列を転置している部分が不要なのかと思います。

サンプルのコードでは、わざわざ行列を転置しているため、その結果列毎のデータをクラスタリングしております。(temp0,temp1…をクラスタリングしている)
当然結果は１４個のデータ(temp0～temp12 ,arrived) が得られているのですが、これを元データの列に追加しようとして、データの数が合わないとのエラーを得ているようです。

ですので、この部分をコメントアウトするとエラーでなくなるのではないでしょうか。

あと、気になる部分としては

1. DataFrameを行列に変換(as_matrix())する際に、astype()でint型にしているのですが、これは意図した仕様ですか？　元データは少数で与えられているようなのですが・・

2. 後半の部分（クラスター毎の平均を求めて、グラフ化している箇所）はかなり冗長なような気がします。

「# 各クラスタのデータの平均値を求める」 のこコメントがある部分の以下は

Python
1# 各クラスタのデータの平均値を求める
2clusterinfo = cust_df.groupby("cluster_id").mean()
3
4# 可視化（積み上げ棒グラフ）
5clusterinfo.plot(kind='bar', stacked=True, title="Mean Value of 4 Clusters")
6(略)
7plt.show()

で良いのではないでしょうか

投稿2017/07/29 14:34

編集2017/07/29 14:50

magichan

総合スコア15898

akakage13

2017/07/30 05:11

magichan様、御教示ありがとうございました。 cust_array = cust_df.as_matrix().astype(np.float) 御指摘のとおり、上記のように改善させていただきました。 # 各クラスタのデータの平均値を求める」以下の御教示については、今後、勉強する良い材料にさせていただきます。今後とも、よろしくお願いいたします。

行動規範の内容に同意します

ベストアンサー

こういうことがやりたいのかな〜と思いながらコードを直しました。要件がよくわからないので違う可能性があります。

エラーが出た！だけではなくて、何をやりたいのか書くと良いと思います。

python
1import pandas as pd
2import numpy as np
3import matplotlib.lines as mlines
4import codecs
5from sklearn.cluster import KMeans
6
7f1 = codecs.open('claster_panda_1_1.csv', 'w', 'utf-8')
8# データセットを読み込み
9
10cust_df = pd.read_csv("odds_test_1_1.csv" , sep=",")
11
12print(cust_df)
13
14# Pandas のデータフレームから Numpy の行列 (Array) に変換
15
16cust_array = cust_df.as_matrix().astype(np.int)
17
18
19# 行列を転置
20cust_array = cust_array.T
21
22# クラスタ分析を実行 (クラスタ数=4)
23pred = KMeans(n_clusters=4).fit_predict(cust_array)
24pred = np.array(pred)
25print(pred)
26
27# Pandas のデータフレームにクラスタ番号を追加
28columns = cust_df.columns.tolist()
29print(columns)
30# 各クラスタに属するサンプル数の分布
31id_df = pd.DataFrame([pred])
32print(id_df.ix[0].value_counts())
33
34id0 = list(filter(lambda b: b is not None, np.where(pred==0, columns, None)))
35id1 = list(filter(lambda b: b is not None, np.where(pred==1, columns, None)))
36id2 = list(filter(lambda b: b is not None, np.where(pred==2, columns, None)))
37id3 = list(filter(lambda b: b is not None, np.where(pred==3, columns, None)))
38# 各クラスタのデータの平均値
39
40print(cust_df[id0].mean()) # クラスタ番号 = 0
41print(cust_df[id1].mean()) # クラスタ番号 = 1
42print(cust_df[id2].mean()) # クラスタ番号 = 2
43print(cust_df[id3].mean()) # クラスタ番号 = 3
44
45# 可視化（積み上げ棒グラフ）
46import matplotlib.pyplot as plt
47
48clusterinfo = pd.DataFrame()
49for i, id_i in enumerate((id0, id1, id2, id3)):
50    print(cust_df[id_i].as_matrix().mean())
51    clusterinfo['cluster' + str(i)] = [cust_df[id_i].as_matrix().mean()]
52# clusterinfo = clusterinfo.drop('cluster_id')
53print(clusterinfo)
54my_plot = clusterinfo.T.plot(kind='bar', stacked=True, title="Mean Value of 4 Clusters")
55my_plot.set_xticklabels(my_plot.xaxis.get_majorticklabels(), rotation=0)
56plt.legend(loc='uppper right',
57    bbox_to_anchor=(1.05, 0.5, 0.5, 10),
58    borderaxespad=0.)
59plt.show()
60

投稿2017/07/29 12:16

MasashiKimura

総合スコア1150

akakage13

2017/07/29 20:17

MasashiKimura様、御教示ありがとうございます。懇切丁寧なソースコード、感謝しております。小生の意図することと、ほぼ同じでございますが、一点、どうしても御教示いただきたいことがございます。それは、クラスタ番号が追加されたpandasのデータフレームをcsvファイルに書き込む作業でございます。この作業が今回、やりたいことの一つでございます。小生は、この箇所を # Pandas のデータフレームにクラスタ番号を追加 cust_df['cluster_id']=pred print cust_df cust_df.to_csv('claster_panda_1.csv', index=None)　と表しておりました。貴殿は # Pandas のデータフレームにクラスタ番号を追加 columns = cust_df.columns.tolist() print(columns)　　と表されておられます。このcolumnsを、csvファイルに出力したいと思い、 cust_df.columns.tolist().to_csv('claster_panda_1.csv', index=None)　等、いろいろ試しましたがうまく出力出来ませんでした。上記の点について、御教示いただけますと、とても助かります。よろしくお願いいたします。

MasashiKimura

2017/07/30 02:03

それは、ただコメントを消し忘れたのです。以下のようにすれば良いかもしれません。 items = [(col, [cid]) for col, cid in zip(columns, pred)] df = pd.DataFrame.from_items(items) df.index= ['cluster_id'] cust_df = cust_df.append(df) print(cust_df) cust_df.to_csv('cluster_pandas_1_1.csv') あと、私の回答よりは、magichan氏の回答のほうが優れていますよ。

akakage13

2017/07/30 04:30

MasashiKimura様、御教示ありがとうございます。早速、上記のスクリプトを挿入しまして、動かしてみましたが、 claster_panda_1_1.csvは出力されるのですが、クラスターラベルが無いデータでございました。また、claster_panda_1_1.csvが２つ出力され、１つはカラ、他方がクラスターラベルが無いデータでございました。 cust_df.to_csv('cluster_pandas_1_1.csv')の命令は1箇所だけであるのに、 2回も、cluster_pandas_1_1.csv　が出力される事象にも、理由が見つからない状態でございます。念のため、小生の動かさせていただきました、貴殿のソースコードを　記させて頂きます。御教示、よろしくお願いいたします。 # -*- coding: utf-8 -*- import pandas as pd import numpy as np import matplotlib.lines as mlines import codecs from sklearn.cluster import KMeans f1 = codecs.open('claster_panda_1_1.csv', 'w', 'utf-8') # データセットを読み込み cust_df = pd.read_csv("odds_test_1_1.csv" , sep=",") print(cust_df) # Pandas のデータフレームから Numpy の行列 (Array) に変換 cust_array = cust_df.as_matrix().astype(np.int) # 行列を転置 cust_array = cust_array.T # クラスタ分析を実行 (クラスタ数=4) pred = KMeans(n_clusters=4).fit_predict(cust_array) pred = np.array(pred) print(pred) # Pandas のデータフレームにクラスタ番号を追加 columns = cust_df.columns.tolist() print(columns) ################## items = [(col, [cid]) for col, cid in zip(columns, pred)] df = pd.DataFrame.from_items(items) df.index= ['cluster_id'] cust_df = cust_df.append(df) print(cust_df) cust_df.to_csv('cluster_pandas_1_1.csv') ################## # 各クラスタに属するサンプル数の分布 id_df = pd.DataFrame([pred]) print(id_df.ix[0].value_counts()) id0 = list(filter(lambda b: b is not None, np.where(pred==0, columns, None))) id1 = list(filter(lambda b: b is not None, np.where(pred==1, columns, None))) id2 = list(filter(lambda b: b is not None, np.where(pred==2, columns, None))) id3 = list(filter(lambda b: b is not None, np.where(pred==3, columns, None))) # 各クラスタのデータの平均値 print(cust_df[id0].mean()) # クラスタ番号 = 0 print(cust_df[id1].mean()) # クラスタ番号 = 1 print(cust_df[id2].mean()) # クラスタ番号 = 2 print(cust_df[id3].mean()) # クラスタ番号 = 3 # 可視化（積み上げ棒グラフ） import matplotlib.pyplot as plt clusterinfo = pd.DataFrame() for i, id_i in enumerate((id0, id1, id2, id3)): print(cust_df[id_i].as_matrix().mean()) clusterinfo['cluster' + str(i)] = [cust_df[id_i].as_matrix().mean()] # clusterinfo = clusterinfo.drop('cluster_id') print(clusterinfo) my_plot = clusterinfo.T.plot(kind='bar', stacked=True, title="Mean Value of 4 Clusters") my_plot.set_xticklabels(my_plot.xaxis.get_majorticklabels(), rotation=0) plt.legend(loc='uppper right', bbox_to_anchor=(1.05, 0.5, 0.5, 10), borderaxespad=0.) plt.show()

akakage13

2017/07/30 04:58

MasashiKimura様、以下のソースコードで、なんとか動くようになりました。懇切丁寧な、御教示、本当にありがとうございました。 magichan氏の御回答の秀でた箇所も、少しではございますが分かったつもりでございます。今後とも、よろしくお願いいたします。 # -*- coding: utf-8 -*- import pandas as pd import numpy as np import matplotlib.lines as mlines import codecs from sklearn.cluster import KMeans f1 = codecs.open('claster_panda.csv', 'w', 'utf-8') # データセットを読み込み cust_df = pd.read_csv("odds_test_1_1.csv" , sep=",") print cust_df cust_array = cust_df.as_matrix().astype(np.int) # クラスタ分析を実行 (クラスタ数=?） pred = KMeans(n_clusters=6).fit_predict(cust_array) print pred # Pandas のデータフレームにクラスタ番号を追加 cust_df['cluster_id']=pred print cust_df cust_df.to_csv('claster_panda.csv', index=None) # 各クラスタに属するサンプル数の分布 cust_df['cluster_id'].value_counts() print cust_df['cluster_id'].value_counts() # 各クラスタのデータの平均値 cust_df[cust_df['cluster_id']==0].mean() # クラスタ番号 = 0 print cust_df[cust_df['cluster_id']==0].mean() # クラスタ番号 = 0 cust_df[cust_df['cluster_id']==1].mean() # クラスタ番号 = 1 print cust_df[cust_df['cluster_id']==1].mean() # クラスタ番号 = 1 cust_df[cust_df['cluster_id']==2].mean() # クラスタ番号 = 2 print cust_df[cust_df['cluster_id']==2].mean() # クラスタ番号 = 2 cust_df[cust_df['cluster_id']==3].mean() # クラスタ番号 = 3 print cust_df[cust_df['cluster_id']==3].mean() # クラスタ番号 = 3 cust_df[cust_df['cluster_id']==4].mean() # クラスタ番号 = 4 print cust_df[cust_df['cluster_id']==4].mean() # クラスタ番号 = 4 cust_df[cust_df['cluster_id']==5].mean() # クラスタ番号 = 5 print cust_df[cust_df['cluster_id']==5].mean() # クラスタ番号 = 5 # 可視化（積み上げ棒グラフ） import matplotlib.pyplot as plt clusterinfo = pd.DataFrame() for i in range(6): clusterinfo['cluster' + str(i)] = cust_df[cust_df['cluster_id'] == i].mean() clusterinfo = clusterinfo.drop('cluster_id') my_plot = clusterinfo.T.plot(kind='bar', stacked=True, title="Mean Value of 4 Clusters") my_plot.set_xticklabels(my_plot.xaxis.get_majorticklabels(), rotation=0) plt.legend(loc='uppper right', bbox_to_anchor=(1.05, 0.5, 0.5, 10), borderaxespad=0.,) plt.show() my_plot = clusterinfo.T.plot(kind='bar', stacked=True, title="Mean Value of 4 Clusters") my_plot.set_xticklabels(my_plot.xaxis.get_majorticklabels(), rotation=0) plt.show()

行動規範の内容に同意します

あなたの回答

tips

プレビュー

行動規範の内容に同意します

質問の解決につながる回答をしましょう。サンプルコードなど、より具体的な説明があると質問者の理解の助けになります。また、読む側のことを考えた、分かりやすい文章を心がけましょう。

15分調べてもわからないことは
teratailで質問しよう！

ただいまの回答率
85.30%

質問をまとめることで
思考を整理して素早く解決

テンプレート機能で
簡単に質問をまとめる

質問する

関連した質問