Receiving a Google Open Source Peer Bonus award

Posted on Thu 07 May 2020 • Tagged with python, open source, pandas • 3 min read

Over the past few years I’ve increasingly tried to make small contributions to open source projects that I use. I’m not on the core team of any one project, so usually my contributions are very small. That’s why I was very surprised when I got an email from Google’s Open Source Peer Bonus program, letting me know that I had been nominated!


Introducing airbase: A Python client for the European Air Quality e-Reporting Database

Posted on Mon 03 February 2020 • Tagged with python, open source, data, time series • 3 min read

The European Environment Agency (EEA) provides a selection of datasets about air quality in Europe. The data is available for download at the portal, but the interface makes it a bit time consuming to do bulk downloads. Hence, an easy Python-based interface.


Schedule the interruption of hung Python processes with signals

Posted on Sat 13 July 2019 • Tagged with python, snippets • 3 min read

A lightweight method to interrupt (hung) Python processes after a set time using the signal library.


Generating fake whiskey reviews with GPT-2

Posted on Sun 23 June 2019 • Tagged with python, deep learning, natural language • 11 min read

I’ve enjoyed whiskey for a while now, but I can never vocalize all the flavors present in a bottle. I read all these flowery reviews and tasting notes online and I just have no idea how these people come up with descriptions like “caramels, dried peats, elegant cigar smoke, seeds scraped from vanilla beans, brand new pencils, peppercorn, coriander seeds, and star anise”… until now.


Redirect standard out to Python’s logging module with contextlib

Posted on Wed 22 May 2019 • Tagged with python, snippets • 5 min read

Python’s logging functionality is very nice once you get the hang of it, but many people either disagree or don’t bother to use it. Learn how to redirect other people’s pesky print statements into your nice logging setup.


Propagate time series events with Pandas merge_asof

Posted on Sat 13 April 2019 • Tagged with python, data, time series, pandas • 5 min read

I recently discovered that Pandas has a function to propagate time series events forward (or backward) in time across a DataFrame. Here’s how it works.


Getting Calvin home on time: a statistics puzzle

Posted on Thu 19 July 2018 • Tagged with python, statistics, puzzles • 10 min read

I found this puzzle a while ago and couldn’t get it out of my head, so I decided to write up a solution. “Calvin has to cross several signals when he walks from his home to school. Each of these signals operate independently. They alternate every 80 seconds between green light and red light. At each signal, there is a counter display that tells him how long it will be before the current signal light changes. Calvin has a magic wand which lets him turn a signal from red to green instantaneously. However, this wand comes with limited battery life, so he can use it only for a specified number of times.”


Cleaner Spark UDF definitions with a little decorator

Posted on Thu 16 November 2017 • Tagged with spark, python, data, snippets • 3 min read

One of the handy features that makes (Py)Spark more flexible than database tools like Hive even for just transforming tabular data is the ease of creating User Defined Functions (UDFs). However, one thing that still remains a little annoying is that you have to separately define a function and declare it as a UDF. With four lines of code you can clean those definitions right up.


Forward-fill missing data in Spark

Posted on Fri 22 September 2017 • Tagged with python, spark, data, pandas, time series • 4 min read

Since I’ve started using Apache Spark, one of the frequent annoyances I’ve come up against is having an idea that would be very easy to implement in Pandas, but turns out to require a really verbose workaround in Spark. A recent example of this is doing a forward fill (filling null values with the last known non-null value).


Groupby without aggregation in Pandas

Posted on Mon 17 July 2017 • Tagged with python, pandas, data, time series • 2 min read

Pandas has a useful feature that I didn’t appreciate enough when I first started using it: groupbys without aggregation. What do I mean by that? Let’s look at an example.


Counting the number of periods since time-series events with Pandas

Posted on Sat 15 July 2017 • Tagged with python, pandas, data, time series • 4 min read

This is a cute trick I discovered the other day for quickly computing the time since an event on regularly spaced time series data (like monthly reporting), without looping over the data.


Custom color schemes in Matplotlib

Posted on Mon 01 May 2017 • Tagged with python, dataviz, matplotlib • 2 min read

At KPMG, like (I imagine) at most companies, we have a custom color palette that presentations and other materials are supposed to conform to. I actually quite like it when things I produce have a consistent look and feel, so I decided to find out how to make a custom color palette in matplotlib. Turns out that it’s super easy.


engl_ish: Simulate your language. ish.

Posted on Sat 04 February 2017 • Tagged with python, markov, natural language, open source • 18 min read

Quite a while ago I saw a short film called Skwerl, meant to demonstrate “how English sounds to non-English speakers”. As a native English speaker, watching it is quite surreal. The sounds and accents are totally familiar, and there are definitely words in there that you recognize, but there is no discernible overall meaning whatsoever. It’s actually kind of hard to listen to. All you’ve got to hang onto is that what you’re hearing somehow feels like English. And that’s the point. Skwerl gave me the idea to attempt to create a similar effect, but with reading instead of listening. I wanted to see how English looks to non-English readers. And so I created engl_ish.