Groupby without aggregation in Pandas

Posted on Mon 17 July 2017 • 2 min read

Pandas has a useful feature that I didn’t appreciate enough when I first started using it: groupbys without aggregation. What do I mean by that? Let’s look at an example.

We’ll borrow the data structure from my previous post about counting the periods since an event: company accident data. We have a list of workplace accidents for some company since 1980, including the time and location of the accident (no it’s not real, I generated it, please don’t send your lawyers to investigate a data breach):

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

df.head()

	location	severity
time
1980-02-28 22:05:39	Birmingham	1
1980-03-01 02:12:20	Birmingham	3
1980-03-07 07:30:30	Amsterdam	1
1980-05-15 03:23:01	Amsterdam	1
1980-05-29 21:21:39	Birmingham	1

Say we want to add the total number of accidents at each location as a column in the dataset. We could start off by doing a regular groupby to get the total number of accidents per location:

gb = df.groupby('location').count()
gb

	severity
location
Amsterdam	129
Birmingham	121

But now we have to separately add this information to the dataframe.

Instead, we have the option to directly operate on the whole group:

def accident_count(group):
    c = group['severity'].count()
    group['num_accidents'] = c

    return group

df = df.groupby('location').apply(accident_count)
df.head()

	location	severity	num_accidents
time
1980-02-28 22:05:39	Birmingham	1	121
1980-03-01 02:12:20	Birmingham	3	121
1980-03-07 07:30:30	Amsterdam	1	129
1980-05-15 03:23:01	Amsterdam	1	129
1980-05-29 21:21:39	Birmingham	1	121

Now, in this simple case we could have just performed a left join. However, this kind of groupby becomes especially handy when you have more complex operations you want to do within the group, without interference from other groups.

As a more complex example, consider calculating the time between accidents at each location. Our dataframe is already sorted by accident time, so all we have to do is make a series out of the group’s index (time) and take the difference between the rows to get the time differences between incidents. We insert this information directly into the group as a new column and return it:

def time_difference(group):
    # get the time differences and put them directly into the group
    group['time_since_previous'] = group.index.to_series().diff()

    return group

df.groupby('location').apply(time_difference).head()

	location	severity	num_accidents	time_since_previous
time
1980-02-28 22:05:39	Birmingham	1	121	NaT
1980-03-01 02:12:20	Birmingham	3	121	1 days 04:06:41
1980-03-07 07:30:30	Amsterdam	1	129	NaT
1980-05-15 03:23:01	Amsterdam	1	129	68 days 19:52:31
1980-05-29 21:21:39	Birmingham	1	121	89 days 19:09:19

We see that our dataframe maintains its original structure, but we now have information about each location that was calculated using only other datapoints from that location.