Groupby without aggregation in Pandas

Posted on Mon 17 July 2017 • 2 min read

Pandas has a useful feature that I didn't appreciate enough when I first started using it: groupbys without aggregation. What do I mean by that? Let's look at an example.

We'll borrow the data structure from my previous post about counting the periods since an event: company accident data. We have a list of workplace accidents for some company since 1980, including the time and location of the accident (no it's not real, I generated it, please don't send your lawyers to investigate a data breach):

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df.head()
location severity
time
1980-02-28 22:05:39 Birmingham 1
1980-03-01 02:12:20 Birmingham 3
1980-03-07 07:30:30 Amsterdam 1
1980-05-15 03:23:01 Amsterdam 1
1980-05-29 21:21:39 Birmingham 1

Say we want to add the total number of accidents at each location as a column in the dataset. We could start off by doing a regular groupby to get the total number of accidents per location:

gb = df.groupby('location').count()
gb
severity
location
Amsterdam 129
Birmingham 121

But now we have to separately add this information to the dataframe.

Instead, we have the option to directly operate on the whole group:

def accident_count(group):
    c = group['severity'].count()
    group['num_accidents'] = c

    return group

df = df.groupby('location').apply(accident_count)
df.head()
location severity num_accidents
time
1980-02-28 22:05:39 Birmingham 1 121
1980-03-01 02:12:20 Birmingham 3 121
1980-03-07 07:30:30 Amsterdam 1 129
1980-05-15 03:23:01 Amsterdam 1 129
1980-05-29 21:21:39 Birmingham 1 121

Now, in this simple case we could have just performed a left join. However, this kind of groupby becomes especially handy when you have more complex operations you want to do within the group, without interference from other groups.

As a more complex example, consider calculating the time between accidents at each location. Our dataframe is already sorted by accident time, so all we have to do is make a series out of the group's index (time) and take the difference between the rows to get the time differences between incidents. We insert this information directly into the group as a new column and return it:

def time_difference(group):
    # get the time differences and put them directly into the group
    group['time_since_previous'] = group.index.to_series().diff()

    return group

df.groupby('location').apply(time_difference).head()
location severity num_accidents time_since_previous
time
1980-02-28 22:05:39 Birmingham 1 121 NaT
1980-03-01 02:12:20 Birmingham 3 121 1 days 04:06:41
1980-03-07 07:30:30 Amsterdam 1 129 NaT
1980-05-15 03:23:01 Amsterdam 1 129 68 days 19:52:31
1980-05-29 21:21:39 Birmingham 1 121 89 days 19:09:19

We see that our dataframe maintains its original structure, but we now have information about each location that was calculated using only other datapoints from that location.