Introducing Pandas UDFs for PySpark

A Two Sigma researcher introduces the Pandas UDFs feature in the upcoming Apache Spark 2.3 release, which substantially improves the performance and usability of user-defined functions (UDFs) in Python.

Rademacher Averages: Theory and Practice

An overview of Rademacher Averages, a fundamental concept from statistical learning theory that can be used to derive uniform sample-dependent bounds to the deviation of samples averages from their expectations.

Graph Summarization with Quality Guarantees

Given a large graph, the authors we aim at producing a concise lossy representation (a summary) that can be stored in main memory and used to approximately answer queries about the original graph much faster than by using the exact representation.

Life at Two Sigma

We’re rigorous about our work and developing our people.

Learn More