Ibis and BigQuery: Scalable Analytics with the Comfort of Python

Two Sigma engineer Phillip Cloud and Google’s Tim Swast recently collaborated on improving BigQuery support in Ibis, a flexible Python interface for composing SQL queries. Below, they summarize their experiences.

Today, live or up-to-date third-party datasets provide a wealth of information that needs to be correctly analyzed at a scale that has traditionally required custom solutions and many person-hours of work. On top of that, sharing these analyses is critical for avoiding duplicate work.

This blog post demonstrates how Ibis works with BigQuery to provide an elegant and flexible Python interface for composing SQL queries.

As an example, consider the following SQL query that computes the acquisition rate of a number of different companies. Here the acquisition rate is defined as the number of acquisitions (or IPOs) divided by the number of investments a particular venture capital (VC) firm has made. It’s a complex query that proves to be as difficult to debug as it is to compose:

There’s a bug in this query, and it’s pretty difficult to spot.

Spoiler alert: the tables in the “t” subquery’s FROM clause are missing aliases that are referenced elsewhere in the query.

The error in this query is difficult to spot because it’s a single character in a sea of text. SQL has evolved through years of engineering to take on more capabilities with more syntax and while it has grown more powerful, it has also grown in complexity.

However, SQL isn’t going anywhere anytime soon. There are some extremely powerful analytic SQL engines available including Apache Hive, Cloudera Impala, Amazon’s Redshift, and of course: Google BigQuery. The common thread among all of these systems is SQL.

One of the downsides of using SQL is that it’s a closed system: you can’t easily use another language to solve the same problem at the same scale in the same amount of time. This leads to code that passes opaque SQL strings around, often to the chagrin of the folks reading and writing it. Imagine the query above but with template parameters and additional string processing. It quickly becomes hard to manage this kind of code, let alone turn it into something that can be trusted and widely used across a large company.

Over in the Python world, libraries such as pandas provide convenient and performant APIs for wrangling data. From joins to time series analysis, pandas makes it convenient to do complex data wrangling and analysis from start to finish. One downside of pandas is that it’s challenging to use as you encounter the need to distribute your computations.

How Ibis helps you scale

Ibis is a Python analytics library designed to provide the convenience of pandas’ APIs with the scalability of analytic SQL engines like BigQuery. It does this in a type-safe way, letting you build analytics expressions that compile to SQL and run on your favorite large-scale SQL engine. When you execute Ibis expressions, they turn into pandas DataFrames, which gives you access to the ecosystem of Python data libraries once you don’t need the scale that distributed SQL provides.

With Ibis, you no longer need to think about certain classes of bugs like the aliasing problem we saw earlier. Invalid operations (like taking the sum of a string column) are simply not available in the API.

Passing large, templated SQL strings around is a thing of the past.

Another goal of Ibis is that the expressions it creates should be composable. In this scenario, composable means that expression methods return other expressions. The following Ibis code chains together expressions to compute the percentage of GitHub repositories created between the hours of midnight and 4 AM, relative to the time zone reported in the created_at field of the table:

Looking at df—the result of executing the expression—gives the following result:

This example shows the composability of Ibis and just how concisely you can write queries.

BigQuery and Ibis comprise an extremely powerful combination. The Ibis team closely collaborates with the folks working on the BigQuery Python interface and we aim to keep it that way!

Check out the documentation here, and the community tutorial to get a feel for what it’s like to use Ibis.

A version of this post originally appeared on the Google Cloud blog.

This article is not an endorsement by Two Sigma of the papers discussed, their viewpoints or the companies discussed. The views expressed above reflect those of the authors and are not necessarily the views of Two Sigma Investments, LP or any of its affiliates (collectively, “Two Sigma”). The information presented above is only for informational and educational purposes and is not an offer to sell or the solicitation of an offer to buy any securities or other instruments. Additionally, the above information is not intended to provide, and should not be relied upon for investment, accounting, legal or tax advice. Two Sigma makes no representations, express or implied, regarding the accuracy or completeness of this information, and the reader accepts all risks in relying on the above information for any purpose whatsoever. Click here for other important disclaimers and disclosures.

Related Articles

This section links out to multiple articles. To read the article, click the headline.