s3から日本語の入ったcsvを文字化けせずに、pandasのDataframe型にしたい

###前提・実現したいこと
awsのs3に入っているcsvファイルを持ってきてそれを加工したいのですが、日本語が含まれていて、文字化けしてしますのでそれをなんとかしたいです

###該当のソースコード

python2
1s3 = boto3.resource('s3')
2client = s3.meta.client
3response = client.get_object(Bucket=bucket_name, Key=file_key)
4df = pd.read_csv(response['Body'])
5#ここでdfをprintすると、日本語が文字化けしています

###試したこと
body = response['Body'].read()
df = pd.read_csv(body)にしたのですが、終わらず、ずっとロードしてしまいました。

###補足情報(言語/FW/ツール等のバージョンなど)
ipython notebookでpython2.7でやっています。

行動規範の内容に同意します

回答1件

ベストアンサー

恐らく文字コードが原因なのではないかと推測の上で回答させていただきます。

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html

動作確認できる環境が無いので、未確認の点あらかじめご了承いただきたいですが、下記のような形で
encoding で文字コードの指定を行えそうですのでCSVファイルに含まれるデータの文字コードを指定されてはいかがでしょうか？

python
1df = pd.read_csv(response['Body'], encoding='utf-8')

指定可能な encoding の一覧はこちらのようです。
https://docs.python.org/3/library/codecs.html#standard-encodings

CSVファイルがshift_jisで保存されているならば、次のような形かと思われます。

python
1df = pd.read_csv(response['Body'], encoding='sjis')

こちらの方法が参考になるかもしれません。
http://qiita.com/sokutou-metsu/items/5ba7531117224ee5e8af#%E4%BD%8E%E3%83%AC%E3%83%99%E3%83%ABapi%E3%82%92%E4%BD%BF%E3%81%A3%E3%81%9F%E6%93%8D%E4%BD%9C

投稿2016/09/28 04:46

編集2016/09/28 05:36

退会済みユーザー

総合スコア0

ShouYama

2016/09/28 05:01

やってみたのですが、エラーが出てしまって、できませんでした。

退会済みユーザー

2016/09/28 05:02

どのようなエラーかもあわせて記載されるとよいと思われます。

ShouYama

2016/09/28 05:14

--------------------------------------------------------------------------- EmptyDataError Traceback (most recent call last) <ipython-input-172-faed54eb4763> in <module>() ----> 1 df = pd.read_csv(response['Body']) /Users/yamamshou/.pyenv/versions/anaconda3-4.0.0/envs/py2/lib/python2.7/site-packages/pandas/io/parsers.pyc in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, escapechar, comment, encoding, dialect, tupleize_cols, error_bad_lines, warn_bad_lines, skip_footer, doublequote, delim_whitespace, as_recarray, compact_ints, use_unsigned, low_memory, buffer_lines, memory_map, float_precision) 560 skip_blank_lines=skip_blank_lines) 561 --> 562 return _read(filepath_or_buffer, kwds) 563 564 parser_f.__name__ = name /Users/yamamshou/.pyenv/versions/anaconda3-4.0.0/envs/py2/lib/python2.7/site-packages/pandas/io/parsers.pyc in _read(filepath_or_buffer, kwds) 313 314 # Create the parser. --> 315 parser = TextFileReader(filepath_or_buffer, **kwds) 316 317 if (nrows is not None) and (chunksize is not None): /Users/yamamshou/.pyenv/versions/anaconda3-4.0.0/envs/py2/lib/python2.7/site-packages/pandas/io/parsers.pyc in __init__(self, f, engine, **kwds) 643 self.options['has_index_names'] = kwds['has_index_names'] 644 --> 645 self._make_engine(self.engine) 646 647 def close(self): /Users/yamamshou/.pyenv/versions/anaconda3-4.0.0/envs/py2/lib/python2.7/site-packages/pandas/io/parsers.pyc in _make_engine(self, engine) 797 def _make_engine(self, engine='c'): 798 if engine == 'c': --> 799 self._engine = CParserWrapper(self.f, **self.options) 800 else: 801 if engine == 'python': /Users/yamamshou/.pyenv/versions/anaconda3-4.0.0/envs/py2/lib/python2.7/site-packages/pandas/io/parsers.pyc in __init__(self, src, **kwds) 1211 kwds['allow_leading_cols'] = self.index_col is not False 1212 -> 1213 self._reader = _parser.TextReader(src, **kwds) 1214 1215 # XXX pandas/parser.pyx in pandas.parser.TextReader.__cinit__ (pandas/parser.c:5214)() EmptyDataError: No columns to parse from file

ShouYama

2016/09/28 05:14

こんな感じのエラーです

退会済みユーザー

2016/09/28 05:18

----> 1 df = pd.read_csv(response['Body']) encodingの指定はされてみましたか？

ShouYama

2016/09/28 05:27

--------------------------------------------------------------------------- EmptyDataError Traceback (most recent call last) <ipython-input-173-6e006a3f9135> in <module>() ----> 1 df = pd.read_csv(response['Body'],encoding='sjis') /Users/yamamshou/.pyenv/versions/anaconda3-4.0.0/envs/py2/lib/python2.7/site-packages/pandas/io/parsers.pyc in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, escapechar, comment, encoding, dialect, tupleize_cols, error_bad_lines, warn_bad_lines, skip_footer, doublequote, delim_whitespace, as_recarray, compact_ints, use_unsigned, low_memory, buffer_lines, memory_map, float_precision) 560 skip_blank_lines=skip_blank_lines) 561 --> 562 return _read(filepath_or_buffer, kwds) 563 564 parser_f.__name__ = name /Users/yamamshou/.pyenv/versions/anaconda3-4.0.0/envs/py2/lib/python2.7/site-packages/pandas/io/parsers.pyc in _read(filepath_or_buffer, kwds) 313 314 # Create the parser. --> 315 parser = TextFileReader(filepath_or_buffer, **kwds) 316 317 if (nrows is not None) and (chunksize is not None): /Users/yamamshou/.pyenv/versions/anaconda3-4.0.0/envs/py2/lib/python2.7/site-packages/pandas/io/parsers.pyc in __init__(self, f, engine, **kwds) 643 self.options['has_index_names'] = kwds['has_index_names'] 644 --> 645 self._make_engine(self.engine) 646 647 def close(self): /Users/yamamshou/.pyenv/versions/anaconda3-4.0.0/envs/py2/lib/python2.7/site-packages/pandas/io/parsers.pyc in _make_engine(self, engine) 797 def _make_engine(self, engine='c'): 798 if engine == 'c': --> 799 self._engine = CParserWrapper(self.f, **self.options) 800 else: 801 if engine == 'python': /Users/yamamshou/.pyenv/versions/anaconda3-4.0.0/envs/py2/lib/python2.7/site-packages/pandas/io/parsers.pyc in __init__(self, src, **kwds) 1211 kwds['allow_leading_cols'] = self.index_col is not False 1212 -> 1213 self._reader = _parser.TextReader(src, **kwds) 1214 1215 # XXX pandas/parser.pyx in pandas.parser.TextReader.__cinit__ (pandas/parser.c:5214)() EmptyDataError: No columns to parse from file すみません、別のエラーでした、こっちが本当のエラーです

退会済みユーザー

2016/09/28 05:38 編集

コメント欄だとリンクにならないため回答にも追記させていただきましたが、 http://qiita.com/sokutou-metsu/items/5ba7531117224ee5e8af#%E4%BD%8E%E3%83%AC%E3%83%99%E3%83%ABapi%E3%82%92%E4%BD%BF%E3%81%A3%E3%81%9F%E6%93%8D%E4%BD%9C こちらの方法だとどうでしょう。

ShouYama

2016/09/28 05:51

UnicodeDecodeError Traceback (most recent call last) <ipython-input-176-3cdf6cbd42ab> in <module>() 1 response = client.get_object(Bucket=bucket_name, Key=file_key) 2 body = response['Body'].read() ----> 3 print(body.decode('utf-8')) /Users/yamamshou/.pyenv/versions/anaconda3-4.0.0/envs/py2/lib/python2.7/encodings/utf_8.pyc in decode(input, errors) 14 15 def decode(input, errors='strict'): ---> 16 return codecs.utf_8_decode(input, errors, True) 17 18 class IncrementalEncoder(codecs.IncrementalEncoder): UnicodeDecodeError: 'utf8' codec can't decode byte 0x8a in position 58: invalid start byte その記事を参考にした時のエラーです

退会済みユーザー

2016/09/28 06:02

print body.decode('sjis') 文字コードをsjisにするとどうでしょう。

ShouYama

2016/09/28 06:11

UnicodeDecodeError: 'shift_jis' codec can't decode bytes in position 1546620-1546621: illegal multibyte sequence と成りました

退会済みユーザー

2016/09/28 06:15

print body.decode('shift_jisx0213') こちらだとどうでしょう。

ShouYama

2016/09/28 06:31

エラーは出なくなりましたが、 df = pd.read_csv(t)が永遠と終わらないのと tの中に日本語の文字列が入っていないことがわかりました

退会済みユーザー

2016/09/28 06:50

tとは、どこからでてきましたか？

ShouYama

2016/09/28 15:39

tは body.decode('shift_jisx0213')です！

退会済みユーザー

2016/09/29 03:09

body.decode(...)した値はstringですので、ドキュメントによると、そのままpd.read_csv()に渡してしまうとファイルパスとして認識され、そのようなファイルはないので読み込みが完了しないのではないかと思われます。 df = body.decode('shift_jisx0213')にすることでご期待される処理は完了していると思うのですが。

退会済みユーザー

2016/09/29 03:19

Dataframe型としてならこうでしたね。 df = pd.read_csv(response['Body'], encoding='shift_jisx0213')

行動規範の内容に同意します