K最近傍法アルゴリズムの認識精度について
Scikit-learnのdigitsデータセットについて、不均衡データの認識精度の検証を行うためラベルを「8」「8以外」の2値分類となるよう貼り替え、各種モデルのテストデータセットに対する認識精度を測定したところ、K最近傍法アルゴリズムが異常に高い認識精度だったので、原因が分かる方がいれば回答お願い致します。
ソースコード
訓練データ・テストデータの作成
Python
1from sklearn.datasets import load_digits 2from sklearn.model_selection import train_test_split 3 4digits = load_digits() 5# ラベルを「8である」「8でない」に貼り替える 6y = digits.target == 8 7 8X_train, X_test, y_train, y_test = train_test_split( 9 digits.data, y, random_state=42, stratify=y)
ダミー分類器
python
1from sklearn.dummy import DummyClassifier 2dummy_clf = DummyClassifier(strategy='most_frequent').fit(X_train, y_train) 3pred_dummy = dummy_clf.predict(X_test) 4print('test accuracy:', dummy_clf.score(X_test, y_test))
test accuracy: 0.9022222222222223
K最近傍法
python
1from sklearn.neighbors import KNeighborsClassifier 2from sklearn.model_selection import GridSearchCV 3 4param_grid = [{'n_neighbors': [1, 3, 5, 7, 9]}] 5 6grid_search = GridSearchCV(KNeighborsClassifier(), param_grid, cv=5) 7grid_search.fit(X_train, y_train) 8print(f"Best parameters: {grid_search.best_params_}") 9print(f"Best validation score {grid_search.best_score_}") 10 11knn_clf = grid_search.best_estimator_ 12print('test accuracy:', knn_clf.score(X_test, y_test))
Best parameters: {'n_neighbors': 1}
Best validation score 0.9948032665181886
test accuracy: 0.9933333333333333
決定木
python
1from sklearn.tree import DecisionTreeClassifier 2 3param_grid = [{'max_depth': [1, 3, 5, 7, 9]}] 4 5grid_search = GridSearchCV(DecisionTreeClassifier(random_state=42), param_grid, cv=5) 6grid_search.fit(X_train, y_train) 7print(f"Best parameters: {grid_search.best_params_}") 8print(f"Best validation score {grid_search.best_score_}") 9 10tree_clf = grid_search.best_estimator_ 11print('test accuracy:', tree_clf.score(X_test, y_test))
Best parameters: {'max_depth': 5}
Best validation score 0.9465478841870824
test accuracy: 0.94
ロジスティック回帰
python
1from sklearn.linear_model import LogisticRegression 2 3param_grid = [{'C': [0.001, 0.01, 0.1, 1, 10]}] 4 5grid_search = GridSearchCV(LogisticRegression(random_state=42, solver='lbfgs', max_iter=10000), param_grid, cv=5) 6grid_search.fit(X_train, y_train) 7print(f"Best parameters: {grid_search.best_params_}") 8print(f"Best validation score {grid_search.best_score_}") 9 10logreg_clf = grid_search.best_estimator_ 11print('test accuracy:', logreg_clf.score(X_test, y_test))
est parameters: {'C': 0.01}
Best validation score 0.9665924276169265
test accuracy: 0.9644444444444444
ランダムフォレスト
python
1from sklearn.ensemble import RandomForestClassifier 2 3forest_clf = RandomForestClassifier(n_estimators=400, random_state=42) 4forest_clf.fit(X_train, y_train) 5 6print('test accuracy:', forest_clf.score(X_test, y_test))
test accuracy: 0.9688888888888889
追記
t-SNE(多様体学習)により訓練データを2次元表現に変換し可視化。rbfカーネルのSVMによる認識精度を測った所、KNNに近い精度が確認できました。
t-SNEによる変換と可視化
python
1from sklearn.manifold import TSNE 2import matplotlib.pyplot as plt 3 4tsne = TSNE(random_state=42) 5X_train_tsne = tsne.fit_transform(X_train) 6 7plt.scatter(X_train_tsne[:, 0], X_train_tsne[:, 1], c=y_train)
SVC
python
1from sklearn.svm import SVC 2 3param_grid = [{'kernel': ['rbf'], 4 'C': [0.001, 0.01, 0.1, 1, 10, 100], 5 'gamma': [0.001, 0.01, 0.1, 1, 10, 100]}, 6 {'kernel': ['linear'], 7 'C': [0.001, 0.01, 0.1, 1, 10, 100]}] 8 9grid_search = GridSearchCV(SVC(random_state=42), param_grid, cv=5) 10grid_search.fit(X_train, y_train) 11print(f"Best parameters: {grid_search.best_params_}") 12print(f"Best validation score {grid_search.best_score_}") 13 14svc_clf = grid_search.best_estimator_ 15pred_svc = svc_clf.predict(X_test) 16print('test accuracy:', svc_clf.score(X_test, y_test))
Best parameters: {'C': 10, 'gamma': 0.001, 'kernel': 'rbf'}
Best validation score 0.9910913140311804
test accuracy: 0.9844444444444445
回答1件
あなたの回答
tips
プレビュー
バッドをするには、ログインかつ
こちらの条件を満たす必要があります。
2019/05/05 19:00 編集
2019/05/05 20:00
2019/05/05 20:39