sklearn の Label Encoderでカテゴリカル変数の前処理をしたい

google colaboratoryを用いて、データ分析をしています。

その過程でカテゴリカル変数を数値に変換したいのですがエラーが出てしまい困っております。。

具体的には '北'、'南'、'東'、'西' といった方角（欠損値含む）を、python の LabelEncoder を用いて変換しようと思っています。

変換したいデータ

python3
1>> train['direction']
2
30         南東
41        NaN
52          南
63          南
74          南
8        ... 
996         南
1097         西
1198         南
1299         南
13100       南東
14Name: direction, Length: 101, dtype: object

コード

python3
1from sklearn.preprocessing import LabelEncoder
2le = LabelEncoder()
3le.fit(train['direction'])
4train['direction'] = le.transform(train['direction'])

出るエラー

python3
1---------------------------------------------------------------------------
2TypeError                                 Traceback (most recent call last)
3/usr/local/lib/python3.6/dist-packages/sklearn/preprocessing/label.py in _encode(values, uniques, encode)
4    104         try:
5--> 105             res = _encode_python(values, uniques, encode)
6    106         except TypeError:
7
83 frames
9TypeError: '<' not supported between instances of 'str' and 'float'
10
11During handling of the above exception, another exception occurred:
12
13TypeError                                 Traceback (most recent call last)
14/usr/local/lib/python3.6/dist-packages/sklearn/preprocessing/label.py in _encode(values, uniques, encode)
15    105             res = _encode_python(values, uniques, encode)
16    106         except TypeError:
17--> 107             raise TypeError("argument must be a string or number")
18    108         return res
19    109     else:
20
21TypeError: argument must be a string or number

お力を貸していただけるとありがたいです。よろしくお願いいたします。

行動規範の内容に同意します

回答1件

ベストアンサー

Series オブジェクト train['direction'] に NaN が含まれているのが原因です。
LabelEncoder を使う前に欠損値を文字に置き換えるか、除去してください。

python
1import pandas as pd
2from sklearn.preprocessing import LabelEncoder
3
4x = ["C", "A", "B", "A", "C", "B"]
5
6s = pd.Series(["東", "西", "北", "南", None])
7
8# NaN を文字に置き換える。
9s.fillna("missing", inplace=True)
10
11# または NaN を消す。
12#s.dropna(inplace=True)
13
14label = LabelEncoder().fit_transform(s)
15print(label)