The authors provide a solution for the Efficient Server Audit Problem based on several new techniques, including simultaneous replay and efficient verification of concurrent executions, implemented for PHP web applications.
Designing a system that can extract immediate insights from large amounts of data in real-time requires a special way of thinking. This talk presents a “reactive” approach to designing real-time, responsive, and scalable data applications that can continuously compute analytics on-the-fly. It also highlights a case study as an example of reactive design in action.
The author presents CelFS, Two Sigma’s geo-distributed file system. Although CelFS has scaled to serve tens of petabytes of data, it uses physical partitioning to provide quality of service guarantees, it has a high replication overhead, and cannot take advantage of outsourced cold storage The talk further describes our response to these limitations in Jaks, a new storage system to reduce the TCO of CelFS and serve as the backend for other systems at Two Sigma.
This presentation discusses the design and implementation of Smooth at Two Sigma, our experience running it over the past two years, ongoing challenges, and future directions.
The Vera Institute of Justice (Vera) partnered with with Two Sigma’s Data Clinic, a volunteer-based program that leverages employees’ data science expertise, to uncover the factors contributing to continued jail growth in rural areas.
The authors introduce a novel context-dependent simplification technique that improves the scalability of string solvers on challenging constraints coming from real-world problems.
The authors present TRIÈST, a suite of one-pass streaming algorithms to compute unbiased, low-variance, high-quality approximations of the global and local number of triangles in a fully-dynamic graph represented as an adversarial stream of edge insertions and deletions.
This presentation discusses each part of the durable storage stack, from the hardware on up, and how usage numbers can take on different meanings at each layer. It covers what's important to know at each layer, and how to think about and talk about concepts like compression, fragmentation, write amplification, and wear leveling. Finally, it examines different ways benchmarketers present data deceptively, and provides some techniques for identifying and cutting through those kinds of misrepresentations.
Apache Arrow-based interconnection between various big data tools (SQL, UDFs, machine learning, big data frameworks, etc.) enables you to use them together seamlessly and efficiently,