PythonでPCAを実施した際のエラー

pythonでPCAを実施を試みています。
Google CollaboratoryでTPUで実行しています。
下記を実行すると、

Python
1import math
2import numpy as np
3import pandas as pd
4import matplotlib.pyplot as plt
5from sklearn.decomposition import PCA
6from sklearn.preprocessing import scale
7
8df = pd.read_table('all_FPKMs_remove_zero_null.txt',index_col=0)
9df_scale = scale(df.T)
10
11pca = PCA()
12pca.fit(df_scale)
13
14print('主成分',pca.components_.round(4))
15print('平均',pca.mean_)
16print('共分散',pca.get_covariance())
17print('各次元の寄与率',pca.explained_variance_ratio_)
18print('累積寄与率',sum(pca.explained_variance_ratio_))
19print('標準偏差',[math.sqrt(u) for u in pca.explained_variance_])
20
21fuka = pca.components_*np.c_[np.sqrt(pca.explained_variance_)].T
22print('負荷\n',fuka.round(4))
23
24transformed = pca.fit_transform(df_scale)
25plt.scatter( [u[0] for u in transformed], [u[1] for u in transformed] )
26plt.title('PCA of RNA seq')
27plt.grid()
28plt.xlabel('pc1')
29plt.ylabel('pc2')
30plt.show()
31
32＞実行結果
33
34主成分 [[ 0.0031  0.0077 -0.0001 ...  0.0009  0.0007  0.0079]
35 [ 0.0018 -0.0054  0.0014 ...  0.0023  0.0005  0.0075]
36 [-0.0029 -0.0009  0.0076 ... -0.0004  0.0079 -0.0005]
37 ...
38 [ 0.0034  0.0007 -0.0008 ...  0.003  -0.0045  0.0001]
39 [-0.0035  0.0005 -0.0015 ...  0.0007 -0.001   0.0019]
40 [-0.0204 -0.0058 -0.0106 ...  0.008  -0.0003  0.0067]]
41平均 [-8.43311000e-18  3.04584692e-16  9.11922127e-16 ... -5.46995667e-16
42  7.02818867e-17 -4.11093644e-16]
43共分散 [[ 1.00073801  0.18042571 -0.10115732 ...  0.02654404  0.01211901
44   0.25330053]...
45 [ 0.02654404  0.03076802  0.01161014 ...  1.00073801  0.01276057
46   0.07838368]
47 [ 0.01211901  0.02966463  0.26174796 ...  0.01276057  1.00073801
48   0.02457716]
49 [ 0.25330053  0.38477469  0.01831294 ...  0.07838368  0.02457716
50   1.00073801]]
51各次元の寄与率 [1.57061901e-01 6.23013681e-02 4.14671478e-02 ... 4.44736120e-05
52 4.32692769e-05 7.07558086e-32]
53累積寄与率 0.9999999999999976
54標準偏差 [95.73584852801494, 60.295924396713296, 49.19163393272094, 38.433049498662456, 36.85838948---------------------
55
56
57ValueError                                Traceback (most recent call last)
58<ipython-input-2-00bfb4e1ed32> in <module>()
59     19 print('標準偏差',[math.sqrt(u) for u in pca.explained_variance_])
60     20 
61---> 21 fuka = pca.components_*np.c_[np.sqrt(pca.explained_variance_)].T
62     22 print('負荷\n',fuka.round(4))
63     23 
64
65ValueError: operands could not be broadcast together with shapes (1356,58312) (1,1356) 
66
67
68
69

エラーではなく、メモリの問題かもと思ったのですが、良く分からず。
どなたか分かる方ご教示頂ければと思います。

行動規範の内容に同意します

回答1件

ベストアンサー

fuka = pca.components_*np.c_[np.sqrt(pca.explained_variance_)].T

後半を転置しているからだと思います。

つまり、属性アクセスの優先順位のほうが高いので、

fuka = pca.components_*(np.c_[np.sqrt(pca.explained_variance_)].T)

と評価されます。

転置が不要なら

fuka = pca.components_*np.c_[np.sqrt(pca.explained_variance_)]

全体を転置したいのなら

fuka = (pca.components_*np.c_[np.sqrt(pca.explained_variance_)]).T

とするべきなはずです。

投稿2020/03/30 18:36

hayataka2049

総合スコア30935

iziz

2020/04/01 11:34

いつもありがとうございます。上記の通り修正しましたら正しく動作しました。コードについては分かっていない部分が多いので教えて頂きたいのですが、「データを見やすくするために転置する」という補足がテキストに気宛あります。これは要は行と列を入れ替えている、つまりこのケースではもともと行＝遺伝子名、列＝サンプルなので、それを入れ替えていると思いますが、どういう時に行と列を入れ替えるのでしょうか。今回のケースも転置しないでPCAを実施した方がずっと見やすいので、転置の意味がわかりません。実行結果についてまた分からないことがあるので、別途質問あげさせて頂きます。

行動規範の内容に同意します