Treating Data as Code at Two Sigma

Applying software engineering principles to data management has transformed how Two Sigma handles growing datasets, writes Effie Baram, Head of Foundational Data Engineering

As research and trading platforms grow in size and need to process exponentially increasing amounts of data, traditional approaches to data handling can quickly reveal their limitations. At Two Sigma, we’ve experienced this firsthand as data scientists faced onerous waits for new datasets, in part due to the increased complexity of our data landscape.

Such challenges are common at any growing, data-centric company, and different organizations approach them in various ways. At Two Sigma, we recognized the value of an approach that has been gaining traction in data engineering circles: treating our data as code.

Doing so has helped transform how we operate and reflects a broader industry evolution in applying proven software development principles—version control, automated testing, reproducibility, and continuous integration and deployment (CI/CD)—to data management.

In practice, this means using infrastructure-as-code tools like Terraform to declaratively define pipelines, versioning both datasets and processing infrastructure to enable replay capabilities, using dbt (data build tool) to test SQL and implementing CI/CD workflows and data quality platforms running data checks that provide coverage metrics as a natural byproduct.

This article explains what treating data as code means, why it’s effective, and the lessons we’ve learned while adopting it.

The Shift to Data as Code

Over the past decade, data engineering has evolved from a primarily infrastructure-focused discipline to a comprehensive practice centered on data reliability, observability, and governance. While the recent emergence of large language models (LLMs) and AI applications has intensified the demand for high-quality data, this transformation began much earlier with the recognition that data systems require the same engineering rigor as software systems.

Today’s data engineering emphasizes treating data as a product…with robust monitoring systems that can detect issues before they impact downstream consumers.

A decade ago, many companies’ data engineering efforts concentrated on infrastructure: building pipelines to extract, transform, and load data (ETLs), with data quality a secondary focus. Today’s data engineering emphasizes treating data as a product, implementing comprehensive testing frameworks, establishing data contracts between data engineers and consumers, and building robust monitoring systems that can detect issues before they impact downstream consumers. In other words, treating data the way you’d treat code.

However, many organizations still struggle with this transition, which is admittedly highly complex. Companies depending on external data sources frequently encounter unpredictable schema changes, delivery delays, and quality issues that can cascade through their systems.

Without proper data platform capabilities—including automated quality checks, data lineage tracking, and proactive alerting—these issues can result in silent failures or broken downstream applications that erode trust in data-driven decision making.

Two Sigma’s Transformation Journey

As a company that relies on thousands of data sources, Two Sigma has not been immune to such challenges. In recent years, as some trading platforms scaled, our data engineers faced the need to manage operational costs while handling increasing data volumes and complexities. For example, traditional approaches, such as relying on database snapshots to reproduce and move data, created bottlenecks and maintenance overhead that ultimately couldn’t keep pace with business demands.

Migrating to BigQuery helped eliminate our fragmented infrastructure, enabling automatic scaling and cost optimization through its serverless architecture. More importantly, it shifted our focus from managing infrastructure to building centralized analytics-ready datasets.

With data centralized, we needed to manage large numbers of SQL transformations scattered across various systems. We implemented dbt—a framework that brings software engineering practices to SQL. This allowed us to version-control our transformations, automatically test data quality, and treat our data pipelines as code, making them reliable and maintainable.

Developing internal tools

We developed several critical internal tools that were pivotal in this transformation journey: automated data quality monitoring systems that detect anomalies before they impact downstream users, streamlined platforms for data discovery and documentation that help teams understand available datasets, and sophisticated orchestration systems for managing complex computational workflows.

Together, these capabilities fundamentally changed how data engineering operates at Two Sigma. Instead of spending cycles on infrastructure provisioning and manual data movement, teams could focus on data modeling, quality assurance, and building self-service capabilities that empower business users to derive insights independently.

Key Learnings

Through our transformation, we’ve identified several key dimensions for producing information in the modern data era:

Lifecycle Management and Quality Assurance

Traditional lifecycle management operated more naively, focusing on basic questions such as “Did the data arrive?” and “Did it arrive on time?” Today, as datasets are enriched, transformed, and consumed by a growing number of users, attestation should consider the evolution of data over time and, more importantly, the insights the data delivers. Modern data observability platforms provide automated data quality monitoring with statistical anomaly detection.

Context and Lineage

Understanding how data pipeline dependencies (DAGs, or directed acyclic graphs) change over time and tracking data lineage has become foundational. By treating data as code and validating the process using software development practices like CI/CD, we gain visibility into how data is built, what it depends on, and whether it remains stable over time.

LLMs are accelerating data preparation, transformation, and analysis, enabling natural language queries and automated documentation that help bridge the gap between technical and business users.

Beyond Infrastructure

While the migration to modern cloud platforms provided the scalable foundation for treating data as code, the next frontier involves two key developments.

Data contracts are becoming essential—formal agreements between teams that prevent breaking changes and ensure reliability as organizations scale. The explicit formal agreement in the data contract enables downstream use of data by LLMs.

Meanwhile, LLMs are accelerating data preparation, transformation, and analysis, enabling natural language queries and automated documentation that help bridge the gap between technical and business users.

Summing Up

Treating data as code has enabled Two Sigma to build resilient, adaptable data systems while reducing operational overhead and improving quality. This evolution—from data as a technical artifact to a strategic product with clear ownership and quality metrics—represents the future of data engineering, where teams enable insights rather than manage infrastructure.

Curious about roles in data engineering or related fields at Two Sigma? Visit our Careers page to learn more about opportunities, our culture, and how to apply.

Topics

Treating Data as Code at Two Sigma

The Shift to Data as Code

Two Sigma’s Transformation Journey

Developing internal tools

Key Learnings

Lifecycle Management and Quality Assurance

Context and Lineage

Beyond Infrastructure

Summing Up

Related Reading

Platform Thinking: Three Views from Two Sigma Leaders

ICML 2025: Key Ideas on LLMs, Human-AI Alignment, and More

Why Human Intuition Is Essential in Machine Learning