Sign up to receive the latest Insights posts in your inbox.

Subscribe

Introducing Pandas UDFs for PySpark

A Two Sigma researcher introduces the Pandas UDFs feature in the upcoming Apache Spark 2.3 release, which substantially improves the performance and usability of user-defined functions (UDFs) in Python.

Responsive and Scalable Real-time Data Analytics

Designing a system that can extract immediate insights from large amounts of data in real-time requires a special way of thinking. This talk presents a “reactive” approach to designing real-time, responsive, and scalable data applications that can continuously compute analytics on-the-fly. It also highlights a case study as an example of reactive design in action.

Archival Storage at Two Sigma

The author presents CelFS, Two Sigma’s geo-distributed file system. Although CelFS has scaled to serve tens of petabytes of data, it uses physical partitioning to provide quality of service guarantees, it has a high replication overhead, and cannot take advantage of outsourced cold storage The talk further describes our response to these limitations in Jaks, a new storage system to reduce the TCO of CelFS and serve as the backend for other systems at Two Sigma.

JSM 2017 Experiences

A group of Two Sigma statisticians highlight a selection of interesting talks and presentations from the 2017 Joint Statistical Meeting.