python コホート分析を行うためのデータを作りたい

実現したいこと

コホート分析を行うためにデータを加工したい
→現在いきづまっています。何卒よろしくお願いいたします。

使用データ

python
1#     FirstDate	UserId	TotalCharges	LastDate　　　　LifeTime
2#0	2009-01-11	47	  50.67	      2010-01-12       12
3#1	2009-01-20	48	  26.60	      2009-08-20       7
4#2	2009-02-03	49	  38.71	      2009-07-03       5
5#3	2009-04-06	50	  53.38	      2011-02-09       22
6#4	2009-05-06	51	  14.28	      2019-09-25       124

試したこと

python
1import pandas as pd
2import numpy as np
3import datetime as dt
4
5# 継続月数を計算
6df['LifeTime'] = (df['LastPeriod'].dt.year - df['FirstPeriod'].dt.year)*12 + df['LastPeriod'].dt.month - df['FirstPeriod'].dt.month
7

希望データ型

python
1#     FirstDate	UserId	TotalCharges	LastDate　　　　LifeTime   OrderDate
2#0	2009-01-11	47	  50.67	      2010-01-12       12        2009-01-11
3#1	2009-01-11	47	  50.67	      2010-01-12       12        2009-02-11
4#2	2009-01-11	47	  50.67	      2010-01-12       12        2009-03-11
5#3	2009-01-11	47	  50.67	      2010-01-12       12        2009-04-11
6#4	2009-01-11	47	  50.67	      2010-01-12       12        2009-05-11
7#5	2009-01-11	47	  50.67	      2010-01-12       12        2009-06-11
8#6	2009-01-11	47	  50.67	      2010-01-12       12        2009-07-11
9#7	2009-01-11	47	  50.67	      2010-01-12       12        2009-08-11
10#8	2009-01-11	47	  50.67	      2010-01-12       12        2009-09-11
11#9	2009-01-11	47	  50.67	      2010-01-12       12        2009-10-11
12#10	2009-01-11	47	  50.67	      2010-01-12       12        2009-11-11
13#11	2009-01-11	47	  50.67	      2010-01-12       12        2009-12-11
14#12	2009-01-11	47	  50.67	      2010-01-12       12        2010-01-11
15
16→コホート分析をするために、UserIDでユニークなレコードからLifeTimeを参考にOrderDateを追加して複数レコードを作りたいです。

今後行いたいこと

python
1df.set_index('UserId', inplace=True)
2df['CohortGroup'] = df.groupby(level=0)['OrderDate'].min().apply(lambda x: x.strftime('%Y-%m'))
3
4df.reset_index(inplace=True)
5grouped = df.groupby(['CohortGroup', 'OrderPeriod'])
6
7cohorts = grouped.agg({'UserId': pd.Series.nunique,
8                       'Total Charges': np.sum})
9cohorts.rename(columns={'UserId': 'TotalUsers'}, inplace=True)
10
11def cohort_period(df):
12cohorts = cohorts.groupby(level=0).apply(cohort_period)
13cohorts
14
15# reindex the DataFrame
16cohorts.reset_index(inplace=True)
17cohorts.set_index(['CohortGroup', 'CohortPeriod'], inplace=True)
18
19# create a Series holding the total size of each CohortGroup
20cohort_group_size = cohorts['TotalUsers'].groupby(level=0).first()
21cohort_group_size.head()
22
23cohorts['TotalUsers'].unstack(0).head()
24
25user_retention = cohorts['TotalUsers'].unstack(0).divide(cohort_group_size, axis=1)
26user_retention.head(10)
27
28import seaborn as sns
29sns.set(style='white')
30
31plt.figure(figsize=(12, 8))
32plt.title('Cohorts: User Retention')
33sns.heatmap(user_retention.T, cmap="RdBu_r" ,mask=user_retention.T.isnull(), annot=True, fmt='.0%');

meg_

2019/09/25 10:28

「'LastPeriod'」や「'FirstPeriod'」は掲載のデータには見当たりませんが、何か別のデータがあるのでしょうか？　検証可能なデータを提示してください。

行動規範の内容に同意します

回答1件

ベストアンサー

多少強引ではあるが、こんな感じで groupby().apply()にて行を必要な数だけコピーすると実装できますね。

Python
1import pandas as pd
2import io
3
4data = """
5FirstDate,UserId,TotalCharges,LastDate,LifeTime
62009-01-11,47,50.67,2010-01-12, 12
72009-01-20,48,26.60,2009-08-20,7
82009-02-03,49, 38.71,2009-07-03,5
92009-04-06,50,53.38,2011-02-09,22
102009-05-06,51,14.28,2019-09-25,124
11"""
12
13df = pd.read_csv(io.StringIO(data), parse_dates=['FirstDate','LastDate'])
14
15def f(d):
16    start_date = d.iloc[0, d.columns.get_loc('FirstDate')]
17    life_time = d.iloc[0, d.columns.get_loc('LifeTime')]
18    result = pd.concat([d] * life_time)
19    result['OrderDate'] = pd.date_range(start_date, freq='1M', periods=life_time)
20    return result
21
22res = df.groupby(df.index).apply(f).reset_index(drop=True)
23#     FirstDate  UserId  TotalCharges   LastDate  LifeTime  OrderDate
24#0   2009-01-11      47         50.67 2010-01-12        12 2009-01-31
25#1   2009-01-11      47         50.67 2010-01-12        12 2009-02-28
26#2   2009-01-11      47         50.67 2010-01-12        12 2009-03-31
27#3   2009-01-11      47         50.67 2010-01-12        12 2009-04-30
28#4   2009-01-11      47         50.67 2010-01-12        12 2009-05-31
29#5   2009-01-11      47         50.67 2010-01-12        12 2009-06-30
30#6   2009-01-11      47         50.67 2010-01-12        12 2009-07-31
31#7   2009-01-11      47         50.67 2010-01-12        12 2009-08-31
32#8   2009-01-11      47         50.67 2010-01-12        12 2009-09-30
33#9   2009-01-11      47         50.67 2010-01-12        12 2009-10-31
34#10  2009-01-11      47         50.67 2010-01-12        12 2009-11-30
35#11  2009-01-11      47         50.67 2010-01-12        12 2009-12-31
36#12  2009-01-20      48         26.60 2009-08-20         7 2009-01-31
37#13  2009-01-20      48         26.60 2009-08-20         7 2009-02-28
38#14  2009-01-20      48         26.60 2009-08-20         7 2009-03-31
39#..         ...     ...           ...        ...       ...        ...
40#155 2009-05-06      51         14.28 2019-09-25       124 2018-06-30
41#156 2009-05-06      51         14.28 2019-09-25       124 2018-07-31
42#157 2009-05-06      51         14.28 2019-09-25       124 2018-08-31
43#158 2009-05-06      51         14.28 2019-09-25       124 2018-09-30
44#159 2009-05-06      51         14.28 2019-09-25       124 2018-10-31
45#160 2009-05-06      51         14.28 2019-09-25       124 2018-11-30
46#161 2009-05-06      51         14.28 2019-09-25       124 2018-12-31
47#162 2009-05-06      51         14.28 2019-09-25       124 2019-01-31
48#163 2009-05-06      51         14.28 2019-09-25       124 2019-02-28
49#164 2009-05-06      51         14.28 2019-09-25       124 2019-03-31
50#165 2009-05-06      51         14.28 2019-09-25       124 2019-04-30
51#166 2009-05-06      51         14.28 2019-09-25       124 2019-05-31
52#167 2009-05-06      51         14.28 2019-09-25       124 2019-06-30
53#168 2009-05-06      51         14.28 2019-09-25       124 2019-07-31
54#169 2009-05-06      51         14.28 2019-09-25       124 2019-08-31
55#
56#[170 rows x 6 columns]