If you’ve spent the last two years building with LLMs or agents, you’ve probably noticed the emerging gap between models and systems. With the rise of agents like Claude Code, and competitors like Cursor, Codex, OpenHands, Hermes, OpenClaw, and more, there is a growing awareness that across domains, there is a lot of power wielded by the so-called harness, or the compound AI system around the model itself. With each passing month, there is more and more systems work happening around these models, but no single venue where the people building agentic systems and the people studying how to build them well are in the same room.
It’s also how we think about AI at Two Sigma. Platforms are, by their nature, compound systems, and an AI-native platform is a platform made of compound AI systems. The systems-around-the-model layer is where most of the durable engineering questions live. It’s a big part of why I’m excited to be helping build a home for this research community.
ACM (Association for Computing Machinery) CAIS 2026 is an attempt to fix that. The inaugural ACM Conference on AI and Agentic Systems takes place May 27-29 in San Jose, with me and Matei Zaharia (UC Berkeley, Databricks) serving as its first general chairs. I’ve also been serving as one of the program chairs alongside Avi Sil (Oracle) and Omar Khattab (MIT). Read on to learn about some of the work I’m the most excited about – from the 61 accepted research papers, 46 system demos, and a new paper category we’re calling operational experience papers.

Why a new conference?
The research community building and studying AI systems and agents is currently scattered across ML conferences (NeurIPS, ICML, ICLR), NLP conferences (ACL, EMNLP, NAACL), systems conferences (OSDI, SOSP, NSDI), software engineering venues (ICSE, FSE), and a growing number of workshops. Each of those venues evaluates agent work through its own lens, and each community’s standards are incredibly different from one another! A systems paper about agent infrastructure gets reviewed against OSDI novelty standards. An evaluation paper about agent reliability gets reviewed against NeurIPS methodology norms. Neither set of norms is wrong, but neither fully fits work that sits at the intersection.
Instead of traditional ML or systems tracks, CAIS is organized around five pillars:
- Architectural Patterns & Composition – How multiple models, tools, and retrievers are composed into coherent systems. Research in this space advances inference-time scaling, studies the generation/verification asymmetry behind verifier-based architectures, explores retrieval-augmented, multi-agent, and tool-augmented designs, and asks what principled modular composability looks like in practice, among other directions.
- Evaluation & Benchmarking – Characterizing the behavior of compound AI systems in realistic conditions, including the failure modes that are hardest to detect. Topics include new benchmarks, end-to-end metrics, evaluation methodology that holds up as underlying models grow more capable, and more.
- Security & Privacy – Understanding and mitigating threats when agents execute tools on real systems. Topics include threat models for prompt injection and tool misuse, defenses against adaptive attackers, alignment for AI systems with real-world consequences, and more.
- System Optimization & Efficiency – Making compound AI systems faster and cheaper while preserving their capabilities. Topics include end-to-end optimization of non-differentiable pipelines, principled caching, routing, serving at agent scale, and more.
- Engineering & Operations – Making compound AI systems reliable in production. Topics include observability for long agent traces, deployment pipelines for compound AI systems, developer tools the field still needs to invent, and more.
The goal was to evaluate each paper on the merits most relevant to what it’s actually contributing.
A review process designed for the work that actually exists
Most conferences tell their reviewers to evaluate novelty, soundness, and significance, then leave the interpretation up to field norms. CAIS couldn’t do that. The program committee includes professors and industry practitioners from at least six different academic disciplines. The submissions range from GPU kernel optimizations to user studies to production retrospectives. No single community’s reviewing norms apply.
So the reviewer instructions try to make the expectations explicit. They define eight contribution types: novel architectures, system optimizations, benchmarks, production deployment reports, formal guarantees about compound AI system behavior, frameworks, empirical studies, and security attacks or defenses in an AI system. For each one of these types of contributions, we lay out what “good” might look like in terms specific to that kind of work. The idea is that a reviewer looking at a production deployment report knows they should be asking “are these lessons transferable and concrete?” rather than reaching for the novelty bar they’d apply at SOSP.
That last category has possibly the most important line in the instructions: “Do NOT penalize for ‘lacking novelty.’ The novelty is in the evidence and the lessons.” Over time, different academic sub-communities have lamented that the standard academic novelty bar can actively filter out some of the most useful knowledge in the field. Google’s MapReduce, GFS, and Bigtable papers were fundamentally descriptions of production infrastructure, and they became some of the most influential systems papers ever published. The AI systems and agent systems field doesn’t have these equivalents yet, and CAIS is trying to make a space for work like this to exist.
The instructions also say something worth quoting in full: “A paper from an industry team describing deployment lessons learned is not ‘just engineering,’ and neither is an academic paper formalizing agent communication ‘just theory.'” The ambition is a venue where both of those contributions are evaluated on their own terms.
Whether this actually works in practice is something we’ll learn over time. Calibrating reviewers across six different research communities is genuinely hard. But the explicit attempt to identify different contribution types, and to try to provide some guidance to reviewers that may come from a different community about what “good” looks like for each type of contribution is arguably the most interesting structural experiment in the review process.
Keynotes
Before I get into the new operational experience track, and some of the research and demos that I find exciting, I’ve first got to say a few words about the exciting keynotes we have headlining the conference.
Thariq Shihipar is a Member of Technical Staff on the Claude Code team at Anthropic. His “Lessons from Building Claude Code” series has become something of a practitioner canon for agent design. His viral posts cover how Anthropic iteratively redesigned Claude Code’s tool interfaces, moved from RAG to agent-driven context building through progressive disclosure, and evolved from todo lists to task-based coordination as model capabilities improved. A recurring theme: the right tool design depends on the model’s current abilities, and those abilities keep changing, so your tools need to change with them.
Andy Konwinski co-founded Databricks (where he was part of the team that created Apache Spark), Perplexity AI, and Laude Institute, a $100M effort to fund open-source AI research. Laude’s model is distinctive: rather than traditional grants, they organize Moonshot, Slingshot, and Open Frontier programs designed to get research artifacts shipped as usable open-source tools. Their first Slingshot project, Terminal-Bench, is an agent benchmark for command-line tasks that reached Anthropic’s Claude 4 model card 126 days after its inception and has since become a standard evaluation for measuring terminal-based agent performance. Andy’s perspective bridges the gap between academic research and production impact in a way that’s directly relevant to what CAIS is trying to do as a venue.
Percy Liang is a Professor of Computer Science at Stanford, founding director of the Center for Research on Foundation Models (CRFM), and co-founder of Together Compute. There’s a through-line in Percy’s work: building infrastructure that forces the AI ecosystem to be more rigorous and more open. His HELM framework became the standard for holistic evaluation of language models. His Foundation Model Transparency Index, now in its third year, scores every major AI lab against 100 transparency indicators, and the results have visibly changed how companies disclose. His latest project, Marin, takes this even further: an open lab where every experiment is declared in code, tracked as a GitHub issue, and is watchable live as it runs. Failed experiments stay public. And you can see their 8B model beat Llama 3.1 8B Base on 14 of 19 standard evals, and you can step through every decision trace their model worked through to achieve this result.
Operational experience reports: hearing from the builders
The part of the program I’m personally most excited about is new to academic conferences. We invited teams from companies building production agent systems to present operational experience reports. These aren’t traditional research papers, rather, they’re structured accounts of what it actually looks like to operate these systems at scale.
Three teams accepted:
- Lance Martin from Anthropic on Claude’s managed agents infrastructure.
- Raluca Popa from Google on security for the Gemini team.
These are talks I’d have killed to attend two years ago. The agent systems field is still waiting for its MapReduce papers– the detailed, honest accounts of production infrastructure that an entire generation of engineers and researchers ends up being inspired by, and building on. We’re hoping these are the start.
Papers worth reading
Rather than trying to summarize all 61 papers, here are a few clusters that stuck out to me while watching the review process unfold.
How should agents use tools?
Tool calling is the interface between an agent’s reasoning and the outside world, and it turns out we’re still figuring out the basics.
“Do Agents Need to Plan Step-by-Step?” (Otani et al., Megagon Labs) runs a clean experiment overturning a default assumption in most agent frameworks: for data-centric tasks, generating a complete plan before executing any tool calls consistently outperforms the incremental think-then-act loop. If you’ve been building agents with ReAct-style step-by-step execution, this paper is worth reading before your next design decision.
“OpaqueToolsBench” (Hallinan et al., USC/Samaya AI) asks what happens when tool documentation is incomplete or wrong. Most benchmarks assume tools are well-specified. Real-world tools are not. The paper finds that current agents are surprisingly bad at learning tool behavior through interaction, even when they can update their own documentation.
“XGrammar++” (Li et al., SJTU/CMU) tackles the infrastructure layer: when an agent needs to call tools with structured output, how do you enforce schema compliance without destroying latency? Their engine handles dynamic, variable tool-calling schemas with tag-triggered structure switching, which matters as soon as your agent is switching between different tool providers within a single conversation.
When agents go wrong
Several papers address the question of what happens when agent autonomy meets real-world consequences.
“The Verifier Tax” (Sah et al.) quantifies something practitioners have felt intuitively: adding runtime safety enforcement to tool-using agents reduces task success rates, and the effect gets worse as interaction horizons grow. The paper identifies a “Safety-Capability Gap” at 15-30 turn horizons that’s model-dependent. If you’re building agents that need to be both safe and effective over long task horizons, this gives you concrete numbers for the tradeoff.
“Willful Disobedience” (Sharma et al., UW/Microsoft Research) introduces AgentPex, which extracts behavioral rules from agent prompts and checks entire execution traces against them. It catches failures that outcome-only evaluation misses. The practical insight: your agent might be producing correct final answers while violating the process constraints you thought you were enforcing.
“Malice in Agentland” (Boisvert et al., ServiceNow/Mila) demonstrates backdoor attacks at three layers of the agent supply chain, including a novel environment poisoning vector where an agent’s deployment environment is the attack surface. If you’re deploying agents that interact with user-provided data or environments, the threat model here is worth understanding.
Making agents cheaper and faster
“Constant-Memory Retrieval via Koopman Operator Estimation for Mamba-3” (Johansen & Sridhar, Stanford) eliminates the memory cliff where retrieval accuracy collapses for long sequences in state-space models, while maintaining constant memory. If you’re running agents on extended traces where KV cache memory is the bottleneck, this is directly relevant.
“AgentStop” (Pham et al., UMass/Brave) predicts task completion likelihood mid-execution and terminates unproductive branches early, saving substantial energy on consumer devices. The core idea applies beyond consumer hardware: knowing when to stop is underappreciated in agent system design.
“Robust Batch-Level Query Routing” (Markovic-Voronov et al., LinkedIn) jointly assigns models to an entire request batch under cost, GPU, and concurrency constraints, rather than routing each query independently. The robust variant explicitly accounts for uncertainty in predicted model quality. If you’re running heterogeneous LLM deployments, batch-level routing is likely how you should be thinking about cost optimization.
Evaluation as a first-class problem
“ViBench” (Zhong et al., Replit/CMU) and “Vibe Code Bench” (Tran et al., Vals AI) both tackle end-to-end coding agent evaluation. ViBench is derived from production traces across 15 applications. Vibe Code Bench uses an autonomous browser agent to verify deployed applications against behavioral specifications. Both find that frontier models complete only about half of realistic application development tasks. The benchmarks themselves are open-source and useful for anyone evaluating coding agents.
“Trace-Level Analysis of Information Contamination” (Mazhar et al., Cornell/UIUC) shows that uncertainty in input artifacts (PDFs, spreadsheets, slide decks) propagates and amplifies through multi-agent workflows in ways that outcome-only evaluation misses. If you’re building multi-agent pipelines that process heterogeneous documents, this paper explains why your end-to-end accuracy numbers might be misleading.
The demos
The 46 accepted demos include several that are worth seeing live:
- Sherlock (Navan) traces production errors from Jira through New Relic to GitHub and opens fix PRs autonomously, resolving 41% of tickets in about 9 minutes versus a 4.2-hour manual baseline.
- SkyDiscover (UC Berkeley) is a modular framework for AI-driven algorithmic discovery that matches AlphaEvolve on many tasks.
- Agent 4 (Replit) demonstrates multi-agent vibe coding with DAG-based task decomposition and isolated forked environments.
- Context Viewer (CMU/nilenso) is a visual analytics tool for inspecting LLM contexts and debugging agent failures, which is the kind of tooling the field desperately needs.
Looking ahead
This is year one. We don’t know yet whether a single venue can hold together six research communities, production engineers, and a reviewing process that’s trying to be fair to all of them. But the submissions surprised us: 219 papers from teams that didn’t have an obvious place to send this work suggests the demand is real, and the quality of the 61 accepted papers kind of blew us away. ACM CAIS really seems to have a place in this field’s future. Year two is going to be something.
This framing, the idea that the systems around the model are where most of the durable value gets built, is a foundation for how we think about AI at Two Sigma and guides us as we build the future. Two Sigma is a platform company at its core, and platforms are compound AI systems by definition. An AI-native platform is one made of compound AI systems, which means the questions CAIS is organized around are the same questions we deeply care about at Two Sigma.
If you’re building agent systems, I’d encourage you to look through the accepted papers even if you can’t attend. The review process surfaced real signal in a space that’s still figuring out its values.