John Paton

Introducing airbase: A Python client for the European Air Quality e-Reporting Database

Posted on Mon 03 February 2020 • Tagged with python, open source, data, time series • 3 min read

The European Environment Agency (EEA) provides a selection of datasets about air quality in Europe. The data is available for download at the portal, but the interface makes it a bit time consuming to do bulk downloads. Hence, an easy Python-based interface.

Propagate time series events with Pandas merge_asof

Posted on Sat 13 April 2019 • Tagged with python, data, time series, pandas • 5 min read

I recently discovered that Pandas has a function to propagate time series events forward (or backward) in time across a DataFrame. Here’s how it works.

Cleaner Spark UDF definitions with a little decorator

Posted on Thu 16 November 2017 • Tagged with spark, python, data, snippets • 3 min read

One of the handy features that makes (Py)Spark more flexible than database tools like Hive even for just transforming tabular data is the ease of creating User Defined Functions (UDFs). However, one thing that still remains a little annoying is that you have to separately define a function and declare it as a UDF. With four lines of code you can clean those definitions right up.

Forward-fill missing data in Spark

Posted on Fri 22 September 2017 • Tagged with python, spark, data, pandas, time series • 4 min read

Since I’ve started using Apache Spark, one of the frequent annoyances I’ve come up against is having an idea that would be very easy to implement in Pandas, but turns out to require a really verbose workaround in Spark. A recent example of this is doing a forward fill (filling null values with the last known non-null value).

Groupby without aggregation in Pandas

Posted on Mon 17 July 2017 • Tagged with python, pandas, data, time series • 2 min read

Pandas has a useful feature that I didn’t appreciate enough when I first started using it: groupbys without aggregation. What do I mean by that? Let’s look at an example.

Counting the number of periods since time-series events with Pandas

Posted on Sat 15 July 2017 • Tagged with python, pandas, data, time series • 4 min read

This is a cute trick I discovered the other day for quickly computing the time since an event on regularly spaced time series data (like monthly reporting), without looping over the data.