Pandas has a useful feature that I didn't appreciate enough when I first started using it:
groupbys without aggregation. What do I mean by that? Let's look at an example.
We'll borrow the data structure from my previous post about counting the periods since an event: company accident data. We have a list of workplace accidents for some company since 1980, including the time and location of the accident (no it's not real, I generated it, please don't send your lawyers to investigate a data breach):
import pandas as pd import numpy as np import matplotlib.pyplot as plt
Say we want to add the total number of accidents at each location as a column in the dataset. We could start off by doing a regular
groupby to get the total number of accidents per location:
gb = df.groupby('location').count() gb
But now we have to separately add this information to the dataframe.
Instead, we have the option to directly operate on the whole group:
def accident_count(group): c = group['severity'].count() group['num_accidents'] = c return group df = df.groupby('location').apply(accident_count) df.head()
Now, in this simple case we could have just performed a left join. However, this kind of
groupby becomes especially handy when you have more complex operations you want to do within the group, without interference from other groups.
As a more complex example, consider calculating the time between accidents at each location. Our dataframe is already sorted by accident time, so all we have to do is make a series out of the group's index (
time) and take the difference between the rows to get the time differences between incidents. We insert this information directly into the group as a new column and return it:
def time_difference(group): # get the time differences and put them directly into the group group['time_since_previous'] = group.index.to_series().diff() return group df.groupby('location').apply(time_difference).head()
|1980-03-01 02:12:20||Birmingham||3||121||1 days 04:06:41|
|1980-05-15 03:23:01||Amsterdam||1||129||68 days 19:52:31|
|1980-05-29 21:21:39||Birmingham||1||121||89 days 19:09:19|
We see that our dataframe maintains its original structure, but we now have information about each location that was calculated using only other datapoints from that location.