Propagate time series events with Pandas merge_asof

Posted on Sat 13 April 2019 • 5 min read

Imagine the following situation: you have two tables, one with event logs, and the other with status changes. To make this concrete, let’s take our events to be purchases by our users, and the status changes to be when the users attained a Gold Membership (wow so shiny). You want to use their membership status at the time of each purchase as a feature for some model, but you only have records of when the status changed, so you can’t just naively join the tables.

After inheriting some code that was performing this operation manually using groupbys and fancy indexing, I decided to check if Pandas had a built in function for it, and I was pleasantly surprised: pandas.merge_asof. Let’s check out how it works.

from datetime import datetime, timedelta
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# https://johnpaton.net/posts/custom-color-schemes-in-matplotlib/
plt.style.use('~/johnpaton.mplstyle')

We’ll begin by generating some fake sales data. We have three users, Alice, Bob, and Charlie, who made purchases over the last year. Sorry about the verbosity, there doesn’t seem to be any easier way to generate random dates in a range.

n_rows = 15
epoch = datetime.utcfromtimestamp(0)
end = (datetime.now() - epoch).days
start = end - 365

users = ['alice','bob','charlie']
df_sales = pd.DataFrame(data={
    'timestamp': pd.to_datetime(np.random.randint(start, end, n_rows), unit='D'),
    'user': np.random.choice(users, n_rows),
    'amount': np.random.randint(0,100, n_rows)
}).sort_values('timestamp').reset_index(drop=True)
df_sales
timestamp user amount
0 2018-04-14 charlie 74
1 2018-05-02 alice 61
2 2018-05-17 charlie 85
3 2018-06-25 bob 71
4 2018-06-30 alice 50
5 2018-07-04 charlie 40
6 2018-09-22 bob 64
7 2018-11-02 alice 7
8 2018-11-18 bob 57
9 2018-11-21 alice 37
10 2018-11-27 charlie 42
11 2019-01-19 alice 29
12 2019-01-22 alice 46
13 2019-03-17 bob 11
14 2019-04-02 alice 77

Next up, we need the records of when each of them attained their coveted Gold Membership:

df_flags = pd.DataFrame(data={
    'user': users,
    'timestamp': pd.to_datetime(np.random.randint(start, end, len(users)), unit='D'),
    'status': 'gold_member'
}).sort_values('timestamp').reset_index(drop=True)
df_flags
user timestamp status
0 charlie 2018-04-18 gold_member
1 bob 2018-06-09 gold_member
2 alice 2018-09-24 gold_member

Now it’s time to do our merge. We call pd.merge_asof, specifying the following arguments:

  • The left and right DataFrames
  • The column to merge on (this is generally a time column or some other ordering field)
  • The column(s) to group by. The by keyword is very nice, as we can use it to make sure that the status changes only get propagated across the correct users’ purchases. Otherwise we would have to ensure this manually.
df_merged = pd.merge_asof(
    df_sales,
    df_flags,
    on='timestamp',
    by='user',
)
df_merged
timestamp user amount status
0 2018-04-14 charlie 74 NaN
1 2018-05-02 alice 61 NaN
2 2018-05-17 charlie 85 gold_member
3 2018-06-25 bob 71 gold_member
4 2018-06-30 alice 50 NaN
5 2018-07-04 charlie 40 gold_member
6 2018-09-22 bob 64 gold_member
7 2018-11-02 alice 7 gold_member
8 2018-11-18 bob 57 gold_member
9 2018-11-21 alice 37 gold_member
10 2018-11-27 charlie 42 gold_member
11 2019-01-19 alice 29 gold_member
12 2019-01-22 alice 46 gold_member
13 2019-03-17 bob 11 gold_member
14 2019-04-02 alice 77 gold_member

The result is kind of like a left join, in that all “matching” rows are filled and unmatched rows are NaN. However, unlike in a left join (which looks for an exact match), in this case the join applied to all rows on the left that have a timestamp that comes after the paired rows on the right. To ensure that this before/after comparison is possible, the DataFrames must be sorted by the on column.

To create our Gold Membership flag, all we do is just replace the name of the event with a 1 and fill the NaNs (which are rows that came before the status change) with 0s.

df_merged['gold_member'] = df_merged['status']\
    .replace('gold_member', 1)\
    .fillna(0)\
    .astype(int)
df_merged
timestamp user amount status gold_member
0 2018-04-14 charlie 74 NaN 0
1 2018-05-02 alice 61 NaN 0
2 2018-05-17 charlie 85 gold_member 1
3 2018-06-25 bob 71 gold_member 1
4 2018-06-30 alice 50 NaN 0
5 2018-07-04 charlie 40 gold_member 1
6 2018-09-22 bob 64 gold_member 1
7 2018-11-02 alice 7 gold_member 1
8 2018-11-18 bob 57 gold_member 1
9 2018-11-21 alice 37 gold_member 1
10 2018-11-27 charlie 42 gold_member 1
11 2019-01-19 alice 29 gold_member 1
12 2019-01-22 alice 46 gold_member 1
13 2019-03-17 bob 11 gold_member 1
14 2019-04-02 alice 77 gold_member 1

To make sure there’s no funny business going on, we can also visualize exactly what happened:

fig, ax = plt.subplots(facecolor='white', figsize=(12,5))
for user in users:
    df_tmp = df_merged[df_merged['user'] == user]
    plt.scatter(df_tmp['timestamp'], df_tmp['gold_member'], label=user)

for i,row in df_flags.sort_values('user').reset_index().iterrows():
    plt.axvline(row['timestamp'], c=f'C{i}')
    plt.annotate(
        f" {row['user']} gets gold", (row['timestamp'], np.random.uniform(0.25,0.75))
    )
plt.yticks([0,1], ['regular', 'gold'])
plt.ylabel('Membership status')
plt.xlabel('Sale date')
plt.title('Sales by membership status')
plt.legend();

png

So now we have a nice feature that can be used in a training set, without any data leakage (no events from the future are visible before they happened in the training set).

This also works for multiple status changes. For example, say Bob got demoted on February 1st back to being a normal user.

df_flags = pd.concat([
    df_flags,
    pd.DataFrame(data={
        'user': ['bob'],
        'status': 'normal',
        'timestamp': pd.to_datetime('2019-02-01')
    })
])
df_flags
status timestamp user
0 gold_member 2018-04-18 charlie
1 gold_member 2018-06-09 bob
2 gold_member 2018-09-24 alice
0 normal 2019-02-01 bob

If we perform the merge_asof again, we see that Bob’s status changes twice, just how we would expect:

pd.merge_asof(
    df_sales,
    df_flags,
    on='timestamp',
    by='user',
)
timestamp user amount status
0 2018-04-14 charlie 74 NaN
1 2018-05-02 alice 61 NaN
2 2018-05-17 charlie 85 gold_member
3 2018-06-25 bob 71 gold_member
4 2018-06-30 alice 50 NaN
5 2018-07-04 charlie 40 gold_member
6 2018-09-22 bob 64 gold_member
7 2018-11-02 alice 7 gold_member
8 2018-11-18 bob 57 gold_member
9 2018-11-21 alice 37 gold_member
10 2018-11-27 charlie 42 gold_member
11 2019-01-19 alice 29 gold_member
12 2019-01-22 alice 46 gold_member
13 2019-03-17 bob 11 normal
14 2019-04-02 alice 77 gold_member

Finally, these have all been examples of so-called “backwards searches”. From the Pandas docs:

A “backward” search selects the last row in the right DataFrame whose on key is less than or equal to the left’s key.

Somewhat counter-intuitively, this causes the status changes to propagate forward in time. This is the default behavior of merge_asof. To do the reverse (a forward search, a.k.a. propagate changes backwards in time), you can provide the keyword argument direction="forward".