Groupby without aggregation in Pandas
Posted on Mon 17 July 2017 • 2 min read
Pandas has a useful feature that I didn't appreciate enough when I first started using it: groupby
s without aggregation. What do I mean by that? Let's look at an example.
We'll borrow the data structure from my previous post about counting the periods since an event: company accident data. We have a list of workplace accidents for some company since 1980, including the time and location of the accident (no it's not real, I generated it, please don't send your lawyers to investigate a data breach):
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df.head()
location | severity | |
---|---|---|
time | ||
1980-02-28 22:05:39 | Birmingham | 1 |
1980-03-01 02:12:20 | Birmingham | 3 |
1980-03-07 07:30:30 | Amsterdam | 1 |
1980-05-15 03:23:01 | Amsterdam | 1 |
1980-05-29 21:21:39 | Birmingham | 1 |
Say we want to add the total number of accidents at each location as a column in the dataset. We could start off by doing a regular groupby
to get the total number of accidents per location:
gb = df.groupby('location').count()
gb
severity | |
---|---|
location | |
Amsterdam | 129 |
Birmingham | 121 |
But now we have to separately add this information to the dataframe.
Instead, we have the option to directly operate on the whole group:
def accident_count(group):
c = group['severity'].count()
group['num_accidents'] = c
return group
df = df.groupby('location').apply(accident_count)
df.head()
location | severity | num_accidents | |
---|---|---|---|
time | |||
1980-02-28 22:05:39 | Birmingham | 1 | 121 |
1980-03-01 02:12:20 | Birmingham | 3 | 121 |
1980-03-07 07:30:30 | Amsterdam | 1 | 129 |
1980-05-15 03:23:01 | Amsterdam | 1 | 129 |
1980-05-29 21:21:39 | Birmingham | 1 | 121 |
Now, in this simple case we could have just performed a left join. However, this kind of groupby
becomes especially handy when you have more complex operations you want to do within the group, without interference from other groups.
As a more complex example, consider calculating the time between accidents at each location. Our dataframe is already sorted by accident time, so all we have to do is make a series out of the group's index (time
) and take the difference between the rows to get the time differences between incidents. We insert this information directly into the group as a new column and return it:
def time_difference(group):
# get the time differences and put them directly into the group
group['time_since_previous'] = group.index.to_series().diff()
return group
df.groupby('location').apply(time_difference).head()
location | severity | num_accidents | time_since_previous | |
---|---|---|---|---|
time | ||||
1980-02-28 22:05:39 | Birmingham | 1 | 121 | NaT |
1980-03-01 02:12:20 | Birmingham | 3 | 121 | 1 days 04:06:41 |
1980-03-07 07:30:30 | Amsterdam | 1 | 129 | NaT |
1980-05-15 03:23:01 | Amsterdam | 1 | 129 | 68 days 19:52:31 |
1980-05-29 21:21:39 | Birmingham | 1 | 121 | 89 days 19:09:19 |
We see that our dataframe maintains its original structure, but we now have information about each location that was calculated using only other datapoints from that location.