Python 移動平均の算出について(ID別のくくりにて、処理していく方法)

前提・実現したいこと

Pythonにて、構造化データ(csv)のデータ処理・分析を行っています。
（データ量が数十万行～レベルのため、Excelでの処理は難しく、Pythonにて処理しています。）

基本的には、pandas、numpyを用いたデータ処理、分析で進めています。

今回は、データ処理として、ノイズ平滑化のために、移動平均を取りたいと考えています。

発生している課題

下記のように、ID列を持ったデータの、Score列の移動平均を取りたいと考えています。
Score列に対し、一律に移動平均を出す方法は調べてわかったのですが、
ID列が途中で切り替わった場合＝移動平均のくくりが変わる場合
にどのように処理していいかはわかっていません。

使用するデータ

使用データはcsvとして保存しています。
ID列はstr、Score列はintでデータを持っています。

ID	Score	moving_average
A	1
A	1
A	2
A	3
A	4
A	5
A	6
A	5
A	1
A	1
B	1
B	2
B	5
B	1
C	1
C	1
D	1
D	8
D	7
…	…	…

処理後のデータイメージ

ID	Score	moving_average
A	1
A	1
A	2	1.33
A	3	2
A	4	3
A	5	4
A	6	5
A	5	5.33
A	1	4
A	1	2.33
B	1
B	2
B	5	2.67
B	1	2.67
C	1
C	1
D	1
D	8
D	7	5.33
…	…	…

該当のソースコード・試したこと

ひとまず、IDを無視した移動平均は以下の通り出来ています。

Python
1import numpy as np
2import pandas as pd
3
4df = pd.read_csv("20190404_testdata_movingaverage.csv", header=0)
5
6df["moving_average"] = df["Score"].rolling(window=3).mean()
7
8df.to_csv("20190404_testdata_movingaverage_Processed.csv")

ID	Score	moving_average
A	1
A	1
A	2	1.33
A	3	2
A	4	3
A	5	4
A	6	5
A	5	5.33
A	1	4
A	1	2.33
B	1	1
B	2	1.33
B	5	2.67
B	1	2.67
C	1	2.33
C	1	1
D	1	1
D	8	3.33
D	7	5.33
…	…	…

ここから、IDが切り替わった場合に処理を分けたいと考えています。

補足情報（FW/ツールのバージョンなど）

他のデータ平滑化方法がいいのでは？
そもそももっと情報を・・・
等ありましたら、コメントお願い致します。

_Victorique__

2019/04/04 07:40

> 移動平均 IDによって幅が違うように見受けられますがそこらへんどうなっていますか？

shu_magi

2019/04/04 08:00

_Victorique__さん IDによって、レコード数が異なり、まちまちになっています。

shu_magi

2019/04/04 08:51

_Victorique__ 少し勘違いしていました。均す幅は一律で考えています。

行動規範の内容に同意します

回答3件

ベストアンサー

基本的には bamboo-nova さんが書かれております通り、groupby().rolling() を使うのが良いかと思います。
ただ、

Python
1df.groupby('ID')['Score'].rolling(window=3).mean()

と記述しますと、得られるデータは MultiIndexとなってしまいますので、

Python
1df['moving_average'] = df.groupby('ID')['Score'].rolling(window=3).mean()

の代入でエラーがでます。
そこで、groupby() のパラメータに group_keys=Falseをつけて

Python
1df['movint_average'] = df.groupby('ID',group_keys=False).rolling(window=3).mean()

のように記述すると良いのではないかと思います。

一応補足ですが、

元のデータフレームにIDとScore以外の列が存在するのであれば、

Python
1df['movint_average'] = df[['ID','Score']].groupby('ID',group_keys=False).rolling(window=3).mean()

または

Python
1df['movint_average'] = df.groupby('ID',group_keys=False).rolling(window=3).mean()['Score']

のように記述する必要があるかもしれません

投稿2019/04/04 11:44

編集2019/04/04 11:45

magichan

総合スコア15898

shu_magi

2019/04/05 01:27

magichanさんありがとうございます！サンプルデータについては、記載の方法にて問題なく動きました！仰る通り、本データでは他の列が存在するため、そちらで試してみます。

shu_magi

2019/04/05 04:35

magichanさん補足の１つめ df['movint_average'] = df[['ID','Score']].groupby('ID',group_keys=False).rolling(window=3).mean() にて、本データでも動作確認できました！ありがとうございました！

行動規範の内容に同意します

何のデータをどのように利用した以下にもよるかとは思います。

・アドバイス1：欠損対応
【現状】
df["Score"].rolling(window=3).mean()
⇒移動平均の幅の中に1つでもNANがあるとNANとして計算される

【対応】
df["Score"].rolling(window=3, min_periods=1).mean()

⇒min_periods=1を指定し、移動平均の幅の中に1つでもmin_periods個の有効な値があれば、平均を計算するようにしておくと良いかと思います。

・アドバイス2：ID別
どんなデバイスなのかどんな動物なのかどんなセンサーなのかそもそもどんなデータのIDなのかにもよりますが、異なる個体のデータであれば必ず別に処理するのがセオリーでしょう。

df_id_a = df.groupby('ID').get_group('A')
df_id_b = df.groupby('ID').get_group('B')
df_id_c = df.groupby('ID').get_group('C')
df_id_d = df.groupby('ID').get_group('D')

上記のように一旦フレームを分けて処理するのがわかりやすいかと思います。

投稿2019/04/04 07:51

mi2

総合スコア63

shu_magi

2019/04/04 08:55

mi23さん回答ありがとうございます。移動平均の結果のNAN値については、今回問題としていません。また、ID別については、仰る通りなのですが、対象IDが数千あるため、まとめて処理、ないしdf指定をこちらからすづのではなく、自動的に処理することを考えています。

行動規範の内容に同意します

groupby(~).rolling(~).mean(~)
で良いかと思います。詳しくはこちらなどを参考にしてみてください。
https://stackoverflow.com/questions/13996302/python-rolling-functions-for-groupby-object

投稿2019/04/04 07:47

bamboo-nova

総合スコア1408

shu_magi

2019/04/04 08:55 編集

bamboo-novaさんありがとうございます。確認してみます！不明な点があれば、再度コメントします。

shu_magi

2019/04/04 09:08

import numpy as np import pandas as pd df = pd.read_csv("20190404_testdata_movingaverage.csv") df["moving_average_ver3"] = df.groupby(['ID'])["Score"].rolling(window=3).mean() df.to_csv("20190404_testdata_movingaverage_Processed.csv") として、実行しましたが、エラーが出てしまいました。。 >>> --------------------------------------------------------------------------- ValueError Traceback (most recent call last) ~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\frame.py in reindexer(value) 3352 try: -> 3353 value = value.reindex(self.index)._values 3354 except Exception as e: ~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\series.py in reindex(self, index, **kwargs) 3324 def reindex(self, index=None, **kwargs): -> 3325 return super(Series, self).reindex(index=index, **kwargs) 3326 ~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\generic.py in reindex(self, *args, **kwargs) 3688 return self._reindex_axes(axes, level, limit, tolerance, method, -> 3689 fill_value, copy).__finalize__(self) 3690 ~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\generic.py in _reindex_axes(self, axes, level, limit, tolerance, method, fill_value, copy) 3701 new_index, indexer = ax.reindex(labels, level=level, limit=limit, -> 3702 tolerance=tolerance, method=method) 3703 ~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\indexes\multi.py in reindex(self, target, method, level, limit, tolerance) 2076 # hopefully? -> 2077 target = MultiIndex.from_tuples(target) 2078 ~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\indexes\multi.py in from_tuples(cls, tuples, sortorder, names) 1323 -> 1324 arrays = list(lib.tuples_to_object_array(tuples).T) 1325 elif isinstance(tuples, list): pandas/_libs/src\inference.pyx in pandas._libs.lib.tuples_to_object_array() ValueError: Buffer dtype mismatch, expected 'Python object' but got 'long long' During handling of the above exception, another exception occurred: TypeError Traceback (most recent call last) <ipython-input-10-c116fa2142f9> in <module>() 6 df["moving_average"] = df["Score"].rolling(window=3).mean() 7 df["moving_average_ver2"] =df["Score"].rolling(window=3, min_periods=1).mean() ----> 8 df["moving_average_ver3"] = df.groupby(['ID'])["Score"].rolling(window=3).mean() 9 10 df.to_csv("20190404_testdata_movingaverage_Processed.csv") ~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\frame.py in __setitem__(self, key, value) 3117 else: 3118 # set column -> 3119 self._set_item(key, value) 3120 3121 def _setitem_slice(self, key, value): ~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\frame.py in _set_item(self, key, value) 3192 3193 self._ensure_valid_index(value) -> 3194 value = self._sanitize_column(key, value) 3195 NDFrame._set_item(self, key, value) 3196 ~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\frame.py in _sanitize_column(self, key, value, broadcast) 3364 3365 if isinstance(value, Series): -> 3366 value = reindexer(value) 3367 3368 elif isinstance(value, DataFrame): ~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\frame.py in reindexer(value) 3359 3360 # other -> 3361 raise TypeError('incompatible index of inserted column ' 3362 'with frame index') 3363 return value TypeError: incompatible index of inserted column with frame index >>> 現在確認していますが、共有まで。