csvデータの連結に関して

前提・実現したいこと

MacでVSCodeを利用し、numpyのcsvデータの連結("test2.csv"と"test_data"の"PassengerId"列)を行いたいのですが、上手くいかずお分かりの方がいれば、ご教示いただけますと幸いです。

※pandas3でcsvデータの前処理をし、svmで評価を行った際にnumpyに変換されていたりとデータ形式が原因になっているかもと、下記「発生している問題・エラーメッセージ」部分に、各データの一部を記載しております。
データ形式等、状況把握に必要なものがありましたら、ご連絡いただければと思います。

発生している問題・エラーメッセージ

/Library/Frameworks/Python.framework/Versions/3.8/bin/python3 /Users/name/python/実績フォルダ/taitanic_bunseki.py
             Pclass   Age  SibSp  Parch      Fare  Sex_female  Embarked_C  Embarked_Q
PassengerId                                                                          
627               2  57.0      0      0   12.3500           0           0           1
542               3   9.0      4      2   31.2750           1           0           0
809               2  39.0      0      0   13.0000           0           0           0
604               3  44.0      0      0    8.0500           0           0           0
266               2  36.0      0      0   10.5000           0           0           0
...             ...   ...    ...    ...       ...         ...         ...         ...
38                3  21.0      0      0    8.0500           0           0           0
660               1  58.0      0      2  113.2750           0           1           0
535               3  30.0      0      0    8.6625           1           0           0
862               2  21.0      1      0   11.5000           0           0           0
586               1  18.0      0      2   79.6500           1           0           0

[418 rows x 8 columns]
             Survived
PassengerId          
627                 0
542                 0
809                 0
604                 0
266                 0
...               ...
38                  0
660                 0
535                 0
862                 0
586                 1

[418 rows x 1 columns]
             Pclass   Age  SibSp  Parch      Fare  Sex_female  Embarked_C  Embarked_Q
PassengerId                                                                          
892               3  34.5      0      0    7.8292           0           0           1
893               3  47.0      1      0    7.0000           1           0           0
894               2  62.0      0      0    9.6875           0           0           1
895               3  27.0      0      0    8.6625           0           0           0
896               3  22.0      1      1   12.2875           1           0           0
...             ...   ...    ...    ...       ...         ...         ...         ...
1305              3  21.0      0      0    8.0500           0           0           0
1306              1  39.0      0      0  108.9000           1           1           0
1307              3  38.5      0      0    7.2500           0           0           0
1308              3  21.0      0      0    8.0500           0           0           0
1309              3  21.0      1      1   22.3583           0           1           0

[418 rows x 8 columns]
/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/sklearn/utils/validation.py:760: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
テストデータ：             Pclass   Age  SibSp  Parch      Fare  Sex_female  Embarked_C  Embarked_Q
PassengerId                                                                          
892               3  34.5      0      0    7.8292           0           0           1
893               3  47.0      1      0    7.0000           1           0           0
894               2  62.0      0      0    9.6875           0           0           1
895               3  27.0      0      0    8.6625           0           0           0
896               3  22.0      1      1   12.2875           1           0           0
...             ...   ...    ...    ...       ...         ...         ...         ...
1305              3  21.0      0      0    8.0500           0           0           0
1306              1  39.0      0      0  108.9000           1           1           0
1307              3  38.5      0      0    7.2500           0           0           0
1308              3  21.0      0      0    8.0500           0           0           0
1309              3  21.0      1      1   22.3583           0           1           0

[418 rows x 8 columns],予測ラベル：[0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0]
正解率= 0.5909090909090909
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 2646, in get_loc
    return self._engine.get_loc(key)
  File "pandas/_libs/index.pyx", line 111, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 138, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1618, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1626, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'PassengerId'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/name/python/実績フォルダ/taitanic_bunseki.py", line 32, in <module>
    np.concatenate([test_data1,test_data["PassengerId"]])
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/frame.py", line 2800, in __getitem__
    indexer = self.columns.get_loc(key)
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 2648, in get_loc
    return self._engine.get_loc(self._maybe_cast_indexer(key))
  File "pandas/_libs/index.pyx", line 111, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 138, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1618, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1626, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'PassengerId'

該当のソースコード

Python
1from sklearn import svm
2from sklearn.metrics import accuracy_score
3import pandas as pd
4import numpy as np
5
6#学習データとラベルを準備
7train_data=pd.read_csv("train1.csv",index_col=0)
8print(train_data)
9train_label=pd.read_csv("train_label1.csv",index_col=0)
10print(train_label)
11
12#テストデータを準備
13test_data = pd.read_csv("test1.csv",index_col=0)
14print(test_data)
15
16#アルゴリズムを指定
17clf = svm.SVC(C=1, gamma=10)
18
19#学習
20clf.fit(train_data,train_label)
21
22#テスト
23test_label = clf.predict(test_data)
24
25#テスト結果の表示
26print("テストデータ：{0},予測ラベル：{1}".format(test_data,test_label))
27print("正解率= {}".format(accuracy_score(train_label, test_label)))
28
29#csvデータの連結
30np.savetxt("test2.csv",test_label,fmt="%.0f",header="Survived",comments="")
31test_data1 = pd.read_csv("test2.csv")
32np.concatenate([test_data1,test_data["PassengerId"]])

行動規範の内容に同意します

回答2件

ベストアンサー

何故わざわざ numpy配列で結合する必要があるのかがよくわかりません。

テストデータを

Python
1#【テストデータを準備】
2test_data = pd.read_csv("test1.csv",index_col=0)

のように読み込んで、

Python
1#【テスト】
2test_label = clf.predict(test_data)

のように結果を得たのであれば、そのまま

Python
1# テストデータのDataFrameに結果を格納
2test_data['Survived'] = test_label
3# 上のDataFrameより、Index('PassengerId')と結果('Survived')のみを取得(Series型)
4res = test_data['Survived']
5print(res)

で良いのではないでしょうか。
結果(res)をSeriesデータではなく配列でほしいのであれば

Python
1res = test_data['Survived'].reset_index().values
2print(res)

となります

補足

Python
1from sklearn import svm
2from sklearn.metrics import accuracy_score
3import pandas as pd
4import numpy as np
5
6#学習データとラベルを準備
7train_data=pd.read_csv("train1.csv",index_col=0)
8print(train_data)
9train_label=pd.read_csv("train_label1.csv",index_col=0)
10print(train_label)
11
12#テストデータを準備
13test_data = pd.read_csv("test1.csv",index_col=0)
14print(test_data)
15
16#アルゴリズムを指定
17clf = svm.SVC(C=1, gamma=10)
18
19#学習
20clf.fit(train_data,train_label)
21
22#テスト
23test_label = clf.predict(test_data)
24
25#テスト結果の表示
26print("テストデータ：{0},予測ラベル：{1}".format(test_data,test_label))
27print("正解率= {}".format(accuracy_score(train_label, test_label)))
28
29#テストデータにテスト結果を結合
30test_data['Survived'] = test_label
31
32#提案１：単にCSVに吐き出したいならばこれで良い
33test_data['Survived'].to_csv('out.csv')
34#提案２：Indexと結果を結合した結果の配列を得たいのであればこうなる
35data = test_data['Survived'].reset_index().values
36print(data)

投稿2020/03/30 00:31

編集2020/03/30 09:58

magichan

総合スコア15898

yukicb

2020/03/30 09:30 編集

ご回答ありがとうございます！一点結合ではなく、連結の誤りでした、、大変失礼いたしました。 test_labelに関して、「#テスト」の処理を行った際に、形式がpandasからからnumpyに切り替わってしまうようで、そのままcsvで保存を実行すると、下記エラーが発生してしまいます。 `````` #テスト test_label = clf.predict(test_data) #csvで保存 test_label.to_csv("test2.csv") #エラーコード AttributeError: 'numpy.ndarray' object has no attribute 'to_csv' `````` 下記処理でも上手くいかなかったため、 `````` #pandasへの変換後保存 test_data1=pd.series(test_label) test_data1=test_data1.to_csv("test2.csv",header="Survived") #エラーコード AttributeError: module 'pandas' has no attribute 'series' `````` numpyでの保存(test_labelにはヘッダー"Survived"がついていない数字のみのデータなので、ここでheader="Survived"を追加)し、numpy形式での連結で「np.concatenate」を利用したのですが、上手くいかず。 `````` np.savetxt("test2.csv",test_label,fmt="%.0f",header="Survived",comments="") test_data1 = pd.read_csv("test2.csv") #csv連結 np.concatenate([test_data1,test_data["PassengerId"]]) `````` 上記、こちらでそもそも認識がズレている等がありましたらご指摘いただけますと幸いです????‍♂️ お忙しい中お手数ですが、何卒宜しくお願い致します。

magichan

2020/03/30 10:02 編集

ん？なんか回答に書いたことが伝わって無い気がします。もう少し具体的に解るように、質問のコードを改造したコードを追記しました。やりたいことは、こんな感じではないのですか？

yukicb

2020/03/30 10:27

度々、ありがとうございます！????‍♂️ 提案1にてcsv保存を実施した所、下記エラーコードが発生しました、！ `````` #テストデータにテスト結果を結合 test_data['Survived'] = test_label #提案１：単にCSVに吐き出したいならばこれで良い test_data['Survived'].to_csv('out.csv') #エラーコード AttributeError: module 'pandas' has no attribute 'series' `````` 改めてやりたいこととしては、下記2点となります。 ①test_labelのcsv保存(ヘッダーに'Survived'を追加) ②①とtest_data["PassengerId"]列の連結度々お手数ですが、何卒宜しくお願い致します????‍♂️

magichan

2020/03/30 10:51

AttributeError: module 'pandas' has no attribute 'series' そのエラーは私が書いたコードで発生したのではないのではないですか？誤）pd.series 正）pd.Series ですが、私の書いた部分に series を明示して使用しているところは有りません。

yukicb

2020/03/30 14:38 編集

大変失礼いたしました、。下記にて上手くいきました！！ーーーーーーーーーーーーー #テストデータにテスト結果を結合 test_data['Survived'] = test_label #提案１：単にCSVに吐き出したいならばこれで良い test_data['Survived'].to_csv('out.csv') ーーーーーーーーーーーーーお忙しい中度々ご丁寧な対応のほど、ありがとうございました！！

行動規範の内容に同意します