Treating Data as Code at Two Sigma

Applying software engineering principles to data management has transformed how Two Sigma handles growing datasets, writes Effie Baram, Head of Foundational Data Engineering

As research and trading platforms grow in size and need to process exponentially increasing amounts of data, traditional approaches to data handling can quickly reveal their limitations. At Two Sigma, we’ve experienced this firsthand as data scientists faced onerous waits for new datasets, in part due to the increased complexity of our data landscape.

Such challenges are common at any growing, data-centric company, and different organizations approach them in various ways. At Two Sigma, we recognized the value of an approach that has been gaining traction in data engineering circles: treating our data as code.

Doing so has helped transform how we operate and reflects a broader industry evolution in applying proven software development principles—version control, automated testing, reproducibility, and continuous integration and deployment (CI/CD)—to data management.

In practice, this means using infrastructure-as-code tools like Terraform to declaratively define pipelines, versioning both datasets and processing infrastructure to enable replay capabilities, using dbt (data build tool) to test SQL and implementing CI/CD workflows and data quality platforms running data checks that  provide coverage metrics as a natural byproduct.

This article explains what treating data as code means, why it’s effective, and the lessons we’ve learned while adopting it.

The Shift to Data as Code

Over the past decade, data engineering has evolved from a primarily infrastructure-focused discipline to a comprehensive practice centered on data reliability, observability, and governance. While the recent emergence of large language models (LLMs) and AI applications has intensified the demand for high-quality data, this transformation began much earlier with the recognition that data systems require the same engineering rigor as software systems.

Today’s data engineering emphasizes treating data as a product…with robust monitoring systems that can detect issues before they impact downstream consumers.

A decade ago, many companies’ data engineering efforts  concentrated on infrastructure:  building pipelines to extract, transform, and load data (ETLs), with data quality a secondary focus. Today’s data engineering emphasizes treating data as a product, implementing comprehensive testing frameworks, establishing data contracts between data engineers and consumers, and building robust monitoring systems that can detect issues before they impact downstream consumers. In other words, treating data the way you’d treat code.

However, many organizations still struggle with this transition, which is admittedly highly complex. Companies depending on external data sources frequently encounter unpredictable schema changes, delivery delays, and quality issues that can cascade through their systems.

Without proper data platform capabilities—including automated quality checks, data lineage tracking, and proactive alerting—these issues can result in silent failures or broken downstream applications that erode trust in data-driven decision making.

Two Sigma’s Transformation Journey

As a company that relies on thousands of data sources, Two Sigma has not been immune to such challenges. In recent years, as some trading platforms scaled, our data engineers faced the need to manage operational costs while handling increasing data volumes and complexities. For example, traditional approaches, such as relying on database snapshots to reproduce and move data, created bottlenecks and maintenance overhead that ultimately couldn’t keep pace with business demands.

Migrating to BigQuery helped eliminate our fragmented infrastructure, enabling automatic scaling and cost optimization through its serverless architecture. More importantly, it shifted our focus from managing infrastructure to building centralized analytics-ready datasets.

With data centralized, we needed to manage large numbers of SQL transformations scattered across various systems. We implemented dbt—a framework that brings software engineering practices to SQL. This allowed us to version-control our transformations, automatically test data quality, and treat our data pipelines as code, making them reliable and maintainable.

Developing internal tools

We developed several critical internal tools that were pivotal in this transformation journey: automated data quality monitoring systems that detect anomalies before they impact downstream users, streamlined platforms for data discovery and documentation that help teams understand available datasets, and sophisticated orchestration systems for managing complex computational workflows.

Together, these capabilities fundamentally changed how data engineering operates at Two Sigma. Instead of spending cycles on infrastructure provisioning and manual data movement, teams could focus on data modeling, quality assurance, and building self-service capabilities that empower business users to derive insights independently.

Key Learnings

Through our transformation, we’ve identified several key dimensions for producing information in the modern data era:

Lifecycle Management and Quality Assurance

Traditional lifecycle management operated more naively, focusing on basic questions such as “Did the data arrive?” and “Did it arrive on time?” Today, as datasets are enriched, transformed, and consumed by a growing number of users, attestation should consider the evolution of data over time and, more importantly, the insights the data delivers. Modern data observability platforms provide automated data quality monitoring with statistical anomaly detection.

Context and Lineage

Understanding how data pipeline dependencies (DAGs, or directed acyclic graphs) change over time and tracking data lineage has become foundational. By treating data as code and validating the process using software development practices like CI/CD, we gain visibility into how data is built, what it depends on, and whether it remains stable over time.

LLMs are accelerating data preparation, transformation, and analysis, enabling natural language queries and automated documentation that help bridge the gap between technical and business users.

Beyond Infrastructure

While the migration to modern cloud platforms provided the scalable foundation for treating data as code, the next frontier involves two key developments.

Data contracts are becoming essential—formal agreements between teams that prevent breaking changes and ensure reliability as organizations scale. The explicit formal agreement in the data contract enables downstream use of data by LLMs.

Meanwhile, LLMs are accelerating data preparation, transformation, and analysis, enabling natural language queries and automated documentation that help bridge the gap between technical and business users.

Summing Up

Treating data as code has enabled Two Sigma to build resilient, adaptable data systems while reducing operational overhead and improving quality. This evolution—from data as a technical artifact to a strategic product with clear ownership and quality metrics—represents the future of data engineering, where teams enable insights rather than manage infrastructure.

Curious about roles in data engineering or related fields at Two Sigma? Visit our Careers page to learn more about opportunities, our culture, and how to apply.

This article is not an endorsement by Two Sigma of the papers discussed, their viewpoints or the companies discussed. The views expressed above reflect those of the authors and are not necessarily the views of Two Sigma Investments, LP or any of its affiliates (collectively, “Two Sigma”). The information presented above is only for informational and educational purposes and is not an offer to sell or the solicitation of an offer to buy any securities or other instruments. Additionally, the above information is not intended to provide, and should not be relied upon for investment, accounting, legal or tax advice. Two Sigma makes no representations, express or implied, regarding the accuracy or completeness of this information, and the reader accepts all risks in relying on the above information for any purpose whatsoever. Click here for other important disclaimers and disclosures.

Related Reading

This section links out to multiple articles. To read the article, click the headline.