ICML 2025: Key Ideas on LLMs, Human-AI Alignment, and More

A Two Sigma researcher highlights a selection of standout papers, talks, and tutorials from ICML 2025.

The 42nd International Conference on Machine Learning (ICML 2025) convened at the Vancouver Convention Centre in July, drawing researchers, engineers, and practitioners worldwide to discuss the latest in machine learning. As a key part of the “Big Three” conferences with NeurIPS and ICLR, ICML remains central to AI research.

This year’s event emphasized cutting-edge areas like large language models, diffusion models, and robotics, accepting 3,339 papers. Topics such as AI phenomenology and AI-human alignment highlighted shifting field priorities, while classical areas like algorithms and inference were less prominent, reflecting their maturity.

As it has for many years, Two Sigma proudly sponsored ICML, illustrating our long-term commitment to advancing machine learning and engaging with the academic community. Our attendees were impressed by the event’s quality and diversity.

This post highlights a selection of standout papers and sessions from the conference, offering insights into emerging and future machine learning developments. Where noted, abstracts have been abridged for length.

Noteworthy Tutorials

Game-theoretic Statistics and Sequential Anytime-Valid Inference

Aaditya Ramdas

This tutorial provided an interesting perspective on statistics as a gambling problem. The speaker discussed using results from game theory and betting/gambling theory (e-values and e-processes) as an alternative to classical statistical tests (such as p-value tests) to conduct valid inference in sequential decision making problems. He also described a variety of tests for equal means (p-value type), stationarity, symmetry, and independence, etc.

Abstract (abridged): Sequential anytime-valid inference (SAVI) provides measures of statistical evidence and uncertainty — e-values and e-processes for testing and confidence sequences for estimation — that remain valid at all stopping times. These allow for continuous monitoring and analysis of accumulating data and optional stopping for any reason. These methods crucially rely on nonnegative martingales, which are wealth processes of a player in a betting game, thus yielding the area of “game-theoretic statistics”. This tutorial presents the game-theoretic philosophy, intuition, language and mathematics behind SAVI, as summarized in https://arxiv.org/pdf/2410.23614.

Modern Methods in Associative Memory

Dmitry Krotov, Benjamin Hoover, Parikshit Ram

This was a good primer on and review of the latest research in associative memory (going beyond Hopfield networks and transformers), and may be worthwhile for anyone who has an interest in these topics.

Abstract (abridged): Associative Memories, such as Hopfield Networks, model fully recurrent neural networks for information storage and retrieval. Recent theoretical advancements have renewed interest in them, highlighting their links to leading AI architectures like Transformers and Diffusion Models. These associations allow for reinterpretation of traditional AI computations via Associative Memories. Novel Lagrangian formulations enable the creation of robust distributed models and guide new architectural designs. This tutorial offers a clear introduction to Associative Memories, with an emphasis on contemporary research methods and practical mathematical derivations and coding exercises.

Training Neural Networks at Any Scale

Leena Chennuru Vankadara, Volkan Cevher

This tutorial provided a review of scaling laws and different approaches for training / optimizing large scale ML systems, and of approaches for hyperparameter tuning of such systems. It could serve as good reference material for a system architect to keep in mind when embarking on expensive (in time and money) processes of training such models, though the content is less oriented toward an applied practitioner.

Abstract (abridged): Deep learning’s impact pivots on scale, involving data, computational resources, and their integration with network architectures. This scaling poses challenges like training instability and costly tuning. Addressing these requires high-confidence scaling hypotheses, built on rigorous research. This tutorial covers advances in scaling theory, its history, breakthroughs, and training implications for large models. It connects theory to practice by examining the numerical algorithms vital to deep learning across various domains, organized under a master template to clarify foundational principles. This discussion highlights strategic scaling, aiming to advance the field while optimizing resource use.

Vancouver, British Columbia

Invited Talks

Adaptive Alignment: Designing AI for a Changing World

Frauke Kreuter

This was an eye-opening talk on the active field of survey research, the challenges of building and conducting high quality surveys, and how AI may be able to help. The speaker also reviewed existing sources of high-quality economic and demographic surveys that have been under-utilized by the AI/ML community and that could have significant impact on the models and their understanding of human preferences (i.e., alignment).

Abstract (abridged): As AI becomes integrated into our lives, aligning these systems with human values and societal norms is both crucial and complicated. To achieve this, we can draw from historical measures of public preferences, despite challenges like measurement errors and unrepresentative samples. This talk highlights underused datasets and stresses learning from social science to improve human feedback loops in AI, avoiding common pitfalls. The focus is on developing adaptive alignment strategies for evolving societal norms, necessitating collaboration between social scientists and machine learning experts. This interdisciplinary effort aims to ensure that updates to human values are constantly reflected in AI system design.

Closing the Loop: Machine Learning for Optimization and Discovery

Andreas Krause

This talk provided a good overview of the application of (classical) ML for design of experiments and hyper-parameter optimization using active learning (a.k.a. sequential design of experiments).

Abstract (abridged): How can we accelerate scientific discovery when experiments are costly and uncertainty high? From protein engineering to robotics, data efficiency is key, yet lab automation and foundation models offer fresh opportunities for exploration. This talk discusses work on closing the loop between learning and experimentation using active learning, Bayesian optimization, and reinforcement learning. It covers guiding exploration in high-dimensional spaces, using meta-learned generative priors for quick adaptation from simulation to reality, and employing foundation models to reduce uncertainty. The talk concludes by addressing challenges and opportunities for machine learning in optimizing and advancing science and engineering.

Test of Time Award

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Sergey Ioffe, Christian Szegedy

Batch normalization has become a workhorse technology for neural network training, and this paper justifiably deserves the 2025 Test of Time award.

Abstract: Training Deep Neural Networks is complicated by the fact that the distribution of each layer’s inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities. We refer to this phenomenon as internal covariate shift, and address the problem by normalizing layer inputs. Our method draws its strength from making normalization a part of the model architecture and performing the normalization for each training mini-batch. Batch Normalization allows us to use much higher learning rates and be less careful about initialization. It also acts as a regularizer, in some cases eliminating the need for Dropout. Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin. Using an ensemble of batch-normalized networks, we improve upon the best published result on ImageNet classification: reaching 4.9% top-5 validation error (and 4.8% test error), exceeding the accuracy of human raters.

Notable Papers

Score Matching with Missing Data

Josh Givens, Song Liu, Henry Reeve

The paper adapts score matching for scenarios with missing data, proposing importance weighting (IW) and variational methods. They suggest IW excels in small, low-dimensional cases, while the variational approach performs better in high-dimensional settings. From the examples, it was hard to say how relevant the approach is, but might be worth testing quickly.

Abstract (abridged): Score matching is essential for learning data distributions used in various fields like diffusion processes and energy-based modeling. However, its application to incomplete data has been understudied. We address this by adapting score matching and its extensions for scenarios with partially missing data. We introduce two variations: an importance weighting (IW) approach and a variational approach. The IW method shows strong performance in small sample, lower-dimensional settings, while the variational approach excels in high-dimensional contexts. These methods are demonstrated on graphical model estimation tasks, using both real and simulated datasets.

Conformal Prediction as Bayesian Quadrature

Jake Snell, Thomas Griffiths

The paper formulates conformal prediction through a Bayesian lens, critiquing frequentist limitations and proposing Bayesian quadrature as an alternative for uncertainty quantification. They claim it offers richer, interpretable guarantees, but it’s unclear whether their approach provides new insights or is a case of torturing an idea into Bayesian formalisms.

Abstract: As machine learning-based prediction systems are increasingly used in high-stakes situations, it is important to understand how such predictive models will perform upon deployment. Distribution-free uncertainty quantification techniques such as conformal prediction provide guarantees about the loss black-box models will incur even when the details of the models are hidden. However, such methods are based on frequentist probability, which unduly limits their applicability. We revisit the central aspects of conformal prediction from a Bayesian perspective and thereby illuminate the shortcomings of frequentist guarantees. We propose a practical alternative based on Bayesian quadrature that provides interpretable guarantees and offers a richer representation of the likely range of losses to be observed at test time.

Roll the Dice & Look Before you Leap: Going Beyond the Creative Limits of Next-token Prediction

Vaishnavh Nagarajan, Chen Wu, Charles Ding, Aditi Raghunathan

The paper proposes a set of tasks designed to assess the creative limits of LLMS in open-ended scenarios, requiring abstract connections or pattern creation. They claim next-token prediction to be myopic, while multi-token approaches, like teacherless training and diffusion models, produce more diverse outputs.

Abstract (abridged): The authors design minimal algorithmic tasks to quantify the creative limits of language models, mirroring open-ended real-world tasks. These tasks require a stochastic planning step to discover abstract connections or construct new patterns, demonstrating that next-token learning is limited and overly reliant on memorization. In contrast, multi-token approaches like teacherless training and diffusion models excel in generating diverse outputs. The authors introduce seed-conditioning, which adds noise at the input layer, effectively enhancing randomness while preserving coherence. This work offers a principled test-bed for evaluating open-ended creativity and highlights the potential for moving beyond next-token learning. Part of the code available under https://github.com/chenwu98/algorithmic-creativity.

MetaOptimize: A Framework for Optimizing Step Sizes and Other Meta-parameters

Arsalan Sharifnassab, Saber Salehkaleybar, Rich Sutton

This paper provides some leads towards techniques to automate learning rate tuning based on intrinsic properties of the optimization process (i.e., not using any external data). They pose this as a nested meta-optimization problem, and then arrive at an exponentially updated online algorithm for learning rate tuning that effectively tracks the autocorrelation of the gradients, increasing the learning rate when autocorrelation goes up.

Abstract (abridged): The authors tackle the challenge of optimizing machine learning hyperparameters, crucial for efficient training and high performance, by presenting MetaOptimize. This dynamic approach adjusts meta-parameters, like learning rates, during training without expensive search methods. MetaOptimize can integrate with any first-order optimization algorithm, tuning step sizes dynamically to minimize training regret by considering the cumulative impact of these adjustments. The authors introduce simpler MetaOptimize variants that adapt to various optimization algorithms and perform comparably to the best manually crafted learning rate schedules across diverse tasks. This framework offers an efficient alternative for hyperparameter optimization.

Depth Degeneracy in Neural Networks: Vanishing Angles in Fully Connected ReLU Networks on Initialization

Cameron Jakub, Mihai Nica

This paper provides experimental confirmation and a theoretical framework to explain an observation by many ML practitioners of deep networks failing to train and being stuck close to initialization, especially as the networks get deeper.

Abstract (abridged): The authors explore the depth degeneracy phenomenon in deep neural networks, where deeper networks tend toward constant functions at initialization. This paper examines how the angle between two inputs in a ReLU network evolves with layers, using combinatorial expansions to derive formulas showing how the angle diminishes as depth increases. These formulas reveal microscopic fluctuations not captured by infinite width limits and lead to different predictions. Theoretical results are validated with Monte Carlo experiments, accurately reflecting finite network behavior. The impact of depth degeneracy on training is empirically studied, with derived formulas expressed via mixed moments of correlated Gaussians and linked to Bessel numbers.

Summing Up

ICML 2025 provided invaluable insights into the future of machine learning, with significant focus on large language models, AI-human alignment, and other burgeoning areas like game-theoretic statistics. By addressing both foundational advancements and emerging challenges, the conference underscored the evolving priorities of the AI community and the need for interdisciplinary collaboration in designing effective systems.

This article is not an endorsement by Two Sigma of the papers discussed, their viewpoints or the companies discussed. The views expressed above reflect those of the authors and are not necessarily the views of Two Sigma Investments, LP or any of its affiliates (collectively, “Two Sigma”). The information presented above is only for informational and educational purposes and is not an offer to sell or the solicitation of an offer to buy any securities or other instruments. Additionally, the above information is not intended to provide, and should not be relied upon for investment, accounting, legal or tax advice. Two Sigma makes no representations, express or implied, regarding the accuracy or completeness of this information, and the reader accepts all risks in relying on the above information for any purpose whatsoever. Click here for other important disclaimers and disclosures.

Related Reading

This section links out to multiple articles. To read the article, click the headline.