Remapping the world with word vectors

Posted on Sat 02 December 2017 • Tagged with natural language, dataviz, d3.js • 12 min read

Everyone is used to the map. Most people could make a reasonable attempt at drawing one from memory (well, sort of). But what would it look like if we positioned the countries not by geographical location, but by our own perceived relationships between them? Armed with Conceptnet Numberbatch, I decided to try just that.

Continue reading...

Cleaner Spark UDF definitions with a little decorator

Posted on Thu 16 November 2017 • Tagged with spark, python, data, snippets • 3 min read

One of the handy features that makes (Py)Spark more flexible than database tools like Hive even for just transforming tabular data is the ease of creating User Defined Functions (UDFs). However, one thing that still remains a little annoying is that you have to separately define a function and declare it as a UDF. With four lines of code you can clean those definitions right up.

Continue reading...

Forward-fill missing data in Spark

Posted on Fri 22 September 2017 • Tagged with python, spark, data, pandas, time series • 4 min read

Since I've started using Apache Spark, one of the frequent annoyances I've come up against is having an idea that would be very easy to implement in Pandas, but turns out to require a really verbose workaround in Spark. A recent example of this is doing a forward fill (filling null values with the last known non-null value).

Continue reading...

Creating a responsive bar chart for my tags

Posted on Fri 21 July 2017 • Tagged with web, css, pelican, jinja, html • 5 min read

Today I decided that, since I'm a data kind of guy, I would like my tags page to show a bar chart of how many posts per tag I've made. The idea was to basically have a list of tags on the left, with a bar chart on the right showing how many articles are tagged with that tag, and the bars scaling to the window size. It turned out to be more complicated than I was expecting.

Continue reading...

Groupby without aggregation in Pandas

Posted on Mon 17 July 2017 • Tagged with python, pandas, data, time series • 2 min read

Pandas has a useful feature that I didn't appreciate enough when I first started using it: groupbys without aggregation. What do I mean by that? Let's look at an example.

Continue reading...

Counting the number of periods since time-series events with Pandas

Posted on Sat 15 July 2017 • Tagged with python, pandas, data, time series • 4 min read

This is a cute trick I discovered the other day for quickly computing the time since an event on regularly spaced time series data (like monthly reporting), without looping over the data.

Continue reading...

Custom color schemes in Matplotlib

Posted on Thu 11 May 2017 • Tagged with python, dataviz, matplotlib • 2 min read

At KPMG, like (I imagine) at most companies, we have a custom color palette that presentations and other materials are supposed to conform to. I actually quite like it when things I produce have a consistent look and feel, so I decided to find out how to make a custom color palette in matplotlib. Turns out that it's super easy.

Continue reading...

engl_ish: Simulate your language. ish.

Posted on Sat 04 February 2017 • Tagged with python, markov, natural language • 18 min read

Quite a while ago I saw a short film called Skwerl, meant to demonstrate "how English sounds to non-English speakers". As a native English speaker, watching it is quite surreal. The sounds and accents are totally familiar, and there are definitely words in there that you recognize, but there is no discernible overall meaning whatsoever. It's actually kind of hard to listen to. All you've got to hang onto is that what you're hearing somehow feels like English. And that's the point. Skwerl gave me the idea to attempt to create a similar effect, but with reading instead of listening. I wanted to see how English looks to non-English readers. And so I created engl_ish.

Continue reading...