X has 2 features per sample; expecting 2246のエラーの対処方法

現在、scikit-learnを使ってロジステっく回帰モデルのトレーニングをしています。

決定領域、トレーニングサンプル、テストサンプルをプロットしたいと考えいます。

ですが、

 plot_decision_regions(X_combined_std,Y_combined_std,classifier=lr, test_idx=range(105,150))
  File "sample.py", line 222, in plot_decision_regions
    Z = classifier.predict(np.array([xx1.ravel(), xx2.ravel()]).T)
  File "/Library/Python/2.7/site-packages/sklearn/linear_model/base.py", line 324, in predict
    scores = self.decision_function(X)
  File "/Library/Python/2.7/site-packages/sklearn/linear_model/base.py", line 305, in decision_function
    % (X.shape[1], n_features))
ValueError: X has 2 features per sample; expecting 2246

と出てしまい、表示できません。

全体のコード

python
1def age_machine_learning(db):
2    dates = []
3    age_labels = []
4    dictionary = get_dictionary(db)
5    for age in range(1, 3):
6        descriptions = []
7        for data in db.profile.find({"age": age*10}):
8            descriptions.append(data['description'].encode(
9                'utf-8')+data['screen_name'].encode('utf-8'))
10        tagger = MeCab.Tagger('-Ochasen')
11        a = list(descriptions)
12        for description in a:
13            words = []
14            nodes = tagger.parseToNode(description)
15            while nodes:
16                if nodes.feature.split(',')[0] == '名詞':
17                    word = nodes.surface.decode('utf-8')
18                    words.append(json.dumps(word, ensure_ascii=False))
19                nodes = nodes.next
20            age_labels.append(age)
21            tmp = dictionary.doc2bow(words)
22            dense = list(gensim.matutils.corpus2dense(
23                [tmp], num_terms=len(dictionary)).T[0])
24            dates.append(dense)
25    X_train, X_test, y_train, y_test = train_test_split(dates, age_labels, test_size=0.3)
26    lr = LogisticRegression(C=1000.0, random_state=0)
27    sc = StandardScaler()
28    sc.fit(X_train)
29    X_train_std = sc.transform(X_train)
30    X_test_std = sc.transform(X_test)
31    lr.fit(X_train_std, y_train)
32    X_combined_std = np.vstack((X_train_std, X_test_std))
33    Y_combined_std =np.hstack((y_train, y_test))
34    plot_decision_regions(X_combined_std,Y_combined_std,classifier=lr, test_idx=range(105,150))
35    plt.xlabel('petal length')
36    plt.ylabel('patal width')
37    plt.legend(loc='upper left')
38    plt.show()
39    '''
40    compare_classifiers(dates, age_labels)
41    '''
42
43def plot_decision_regions(X,y,classifier,test_idx,resolution=0.02):
44    markers = ('s', 'x', 'o', '^', 'V')
45    colors = ('red', 'blue', 'lightgreen', 'gray', 'cyan')
46    cmap = ListedColormap(colors[:len(np.unique(y))])
47
48    x1_min, x1_max = X[:,0].min() -1, X[:,0].max()+1
49    x2_min, x2_max = X[:,1].min() -1, X[:,1].max()+1
50
51    xx1, xx2 = np.meshgrid(np.arange(x1_min, x1_max,resolution),np.arange(x2_min, x2_max,resolution))
52    print np.array([xx1.ravel(), xx2.ravel()]).T
53    print len(np.array([xx1.ravel(), xx2.ravel()]).T)
54    Z = classifier.predict(np.array([xx1.ravel(), xx2.ravel()]).T)
55    Z = Z.reshape(xx1.shape)
56    plt.contourf(xx1,xx2,Z,alpha=0.4,cmap=cmap)
57    plt.xlim(xx1.min(), xx1.max())
58    plt.ylim(xx2.min(),xx2.max())
59
60    for idx,cl in enumerate(np, unique(y)):
61        plt.scatter(x=X[y==cl, 0],y=X[y==cl, 1],alpha=0.8, c = cmap(idx),marker = markers[idx], label=cl)
62
63    if test_idx:
64        X_test, Y_test = X[test_idx, :],y[test_idx]
65        plt.scatter(X_test[:, 0],X_test[:, 1], c='',alpha=1.0, linewidths=1, marker='o', s=55, label='test set')

呼び出しているのは、age_machine_learningメソッド、実際にプロットの処理はplot_decision_regionsです。

よろしくお願いします。

できれば早い解決をしたいと考えています。お手伝いをよろしくお願いします。

参考にしたサイト

行動規範の内容に同意します

回答1件

Z = classifier.predict(np.array([xx1.ravel(), xx2.ravel()]).T)
のサイズを確認してください。

X has 2 features per sample; expecting 2246
モデルを作ったときはサンプルあたり2246個値があったのに、今は2つしかありません。

投稿2018/01/30 14:37

mkgrei

総合スコア8562

退会済みユーザー

2018/01/31 06:53 編集

サイズが確認できなかったのですが、 (np.array([xx1.ravel(), xx2.ravel()])は2でした。調べたところ、モデルを作ったときのサンプルに配列が二次元配列になっていて len(X_combined_std) = 4697 len(X_combined_std[0]) = 2246 でした。 X_combined_std.shape[1]=2246を二次元の表で表すのは無理なのでしょうか？

mkgrei

2018/01/31 07:03

普通に考えると無理ですね。 2246次元のグラフを2次元にプロットすることはできません。学習の方法を変える必要があります。 https://pythondatascience.plavox.info/scikit-learn/scikit-learnで決定木分析 https://fisproject.jp/2016/07/regression-tree-in-python/ 決定木であれば入力の次元に依存しません。または、 http://scikit-learn.org/stable/modules/classes.html#module-sklearn.decomposition を使って次元を2次元に圧縮してから学習させることもできます。

退会済みユーザー

2018/01/31 07:55

ありがとうございます。一番精度がよかったためチャレンジしたのですが、残念です。 2次元に圧縮するというのは、メソッドを使ってですか？組み合わせてでしょうか？

mkgrei

2018/01/31 08:38

PCA主値を使った場合ですが、先にXを2本の特徴ベクトルで表現するように変換します。そして、その係数を取ることで2次元にすることができます。それをモデルの入力値として使用します。必ずしも上位2つを取らなくても良いかもしれません。 --- ロジスティック回帰で良い結果が出るのであれば決定木を使っても良い結果が得られるはずです。 BoWならそっちのほうが自然かもしれません。

行動規範の内容に同意します