Combine several rows of time series in a single row with the Pandas

advertisements

I am using a recurrent neural network to consume time-series events (click stream). My data needs to be formatted such that a each row contains all the events for an id. My data is one-hot encoded, and I have already grouped it by the id. Also I limit the total number of events per id (ex. 2), so final width will always be known (#one-hot cols x #events). I need to maintain the order of the events, because they are ordered by time.

Current data state:

     id   page.A   page.B   page.C
0   001        0        1        0
1   001        1        0        0
2   002        0        0        1
3   002        1        0        0

Required data state:

     id   page.A1   page.B1   page.C1   page.A2   page.B2   page.C2
0   001        0         1         0         1         0         0
1   002        0         0         1         1         0         1

This looks like a pivot problem to me, but my resulting dataframes are not in the format I need. Any suggestions on how I should approach this?


The idea here is to reset_index within each group of 'id' to get a count which row of that particular 'id' we are at. Then follow that up with unstack and sort_index to get columns where they are supposed to be.

Finally, flatten the multiindex.

df1 = df.set_index('id').groupby(level=0) \
    .apply(lambda df: df.reset_index(drop=True)) \
    .unstack().sort_index(axis=1, level=1)  # Thx @jezrael for sort reminder

df1.columns = ['{}{}'.format(x[0], int(x[1]) + 1) for x in df1.columns]

df1