Two Sigma researchers share highlights from NIPS 2017

As in previous years, several Two Sigma researchers attended 2017’s Conference on Neural Information Processing Systems (better known as NIPS), a once-sleepy annual event that now boasts upwards of 8,000 attendees. As has increasingly been the case, machine learning in general, and deep learning in particular, were prominent themes.

NIPS 2017 featured 3,200 submissions, with 679 accepted papers. Of these, 40 were accepted for full-length presentations, and roughly another 100 for five-minute spotlights. Of course, the event was also brimming with various posters, workshops, invited talks, and more.

Needless to say, it was (pardon the pun) a lot of information to process. To help narrow it down, Two Sigma researchers Firdaus Janoos and Eric Allen selected 25 of their favorite papers and presentations from NIPS 2017 and its accompanying workshops. Their selection criteria were an informal combination of relevance, fundamental insight, compelling technique and/or results, and ingenuity.

So, without further ado, 25 highlights from NIPS 2017:

## Deep Learning and Reinforcement Learning

**1. **Mastering Games with Deep Reinforcement Learning (Silver): After discussing a history of AlphaZero, including an overview of Monte-Carlo Tree Search (MCTS), David Silver went on to discuss AlphaZero’s application to Go, Chess, and Shogi. AlphaZero outperformed Stockfish (the 2016 Chess world champion, based on minimax search) after four hours of training. It outperformed Elmo (the Shogi world champion) in two hours. Amazingly, AlphaZero analyzes only 80,000 chess positions per second, vs 70,000,000 for Stockfish. Silver speculates that minimax propagates the worst errors in an evaluator to the root of the tree, whereas MCTS tends to cancel out the worst errors.

**2.** Preventing Gradient Explosions in Gated Recurrent Units (Kanai et al.): One of the many papers presenting solutions that mitigate the exploding gradients problem in deep networks. They find a condition under which the dynamics of GRUs change and propose a learning method to address the exploding gradient problem.

**3.** Deep Exploration Via Randomized Value Functions (Van Roy): This work proposes a principled, scalable approach to deep exploration via Thompson Sampling. Reinforcement learning can be thought of as active learning for Markov Decision Processes (MDPs). Thompson Sampling (from the 1930s) selects the posterior of the optimal action; it works very well and very broadly. Of course, it is not practical to sample from the posterior distributions over a neural network, but we can effectively approximate these distributions. We can also apply Thompson Sampling to episodic reinforcement learning by considering the actions to be full policies.

**4. **Fast-Slow Recurrent Neural Networks (Mujika et al.): This architecture incorporates multiscale RNNs and deep transition RNNs to process sequential data on different timescales and learn complex transition functions.

**5. Artificial Intelligence Goes All-In** (Bowling): (A recording of this talk at another venue is available here.) DeepStack is a system for heads-up no-limit hold 'em poker. Distributions over game states are modeled by estimating, for each set of cards an opponent might hold, the likelihood that she would have folded before getting to that state. An evaluation function is learned using seven layers and a small differentiable unit to ensure zero sum values. DeepStack has beaten professional players by an average of 486 milli big blinds (mbb) per game. In contrast, professional players aim to average 50 mbb per game. The hardware used: A gaming laptop trained for less than a week on a single GPU.

**6.** Regularizing Deep Neural Networks by Noise: Its Interpretation and Optimization (Noh et al.): This paper attempts to explain why noise-injection techniques such as gradient clipping, drop-out, etc seem to improve the performance of many DNNs and then use this theory to propose a technology that they claim works even better.

**7.** Interpolated Policy Gradient: Merging On-Policy and Off-Policy Gradient Estimation for Deep Reinforcement Learning (Gu et al.): This poster proposed some cool techniques to carry over the sample-efficiency of off-policy techniques to on-policy learning.

**8.** Train Longer, Generalize Better: Closing the Generalization Gap in Large Batch Training of Neural Networks (Hoffer, et al.) The goal of this work is to parallelize models during training. Natural approaches are (1) split the model (2) split the data. With the latter approach, we might be able to increase the batch size. But it has been observed that large batch size has hurt generalization. (See Keskar et al., 2017.) But this work shows that the generalization is actually just a function of the number of epochs, which goes down with increased batch size. A series of tricks combined with fixing the number of iterations for all batch sizes eliminates the generalization gap. An explanation for this result is suggested by comparing training to a random walk over a random potential.

**9.** Self-Normalizing Neural Networks (Klambauer et al.): In an attempt to demystify some of the alchemy in neural networks, this paper presented a modification to the ReLU unit called a "self normalizing neural network" such that the activations of each internal layer of a DNN preserve their statistics, thereby obviating the need for batch normalization. To quote the authors: " … we prove that activations close to zero mean and unit variance that are propagated through many network layers will converge towards zero mean and unit variance -- even under the presence of noise and perturbations. This convergence property of SNNs allows us to (1) train deep networks with many layers, (2) employ strong regularization, and (3) to make learning highly robust. Furthermore, for activations not close to unit variance, we prove an upper and lower bound on the variance, thus, vanishing and exploding gradients are impossible."

**10.** Deep Reinforcement Learning with Subgoals (Silver): In feudal reinforcement learning, each agent tries to achieve a goal by selecting subgoals for its worker agents to achieve. The lowest level modules select raw actions. So this gives us a full framework for hierarchical reinforcement learning. FuN is Feudal RL with Direction Subgoals. There is a manager and a worker: The manager selects subgoal directions that maximize rewards, while the worker selects actions that maximize cosine similarity for a given direction. Subgoals can represent value and policy via universal value and policy function approximators. Benchmarks were shown for the Atari game Montezuma’s Revenge.

**11.** Meta-Learning Shared Hierarchies (Abbeel): Abbeel discussed three recent projects:

*Stochastic Neural Networks (SNN)*--Can we pre-train in such a way that we learn skills, and transfer what is learned to new environments?*The RL^2 framework*--We would like to learn not a policy but an RL agent that is able to learn effectively in a variety of environments drawn from a distribution.*Meta-Learning Shared Hierarchies (MLSH)*--A master action is chosen at a low time scale, and a low level policy acts at full speed, with end-to-end optimization finding sub-policies that enable fast learning of the master policy.

## Optimization

**12.** On Quadratic Convergence of DC Proximal Newton Algorithm in Nonconvex Sparse Learning (Li et al.): This paper presents a second-order method for non-convex optimization using difference-of-convex relaxations (DC programming) of non-convex problems and proximal Newton methods - along with proofs of quadratic convergence rates.

**13.** Robust Optimization for Non-Convex Objectives (Chen et al.): In this talk Robert Chen describes a pretty neat algorithm for robust optimization of non-convex objective functions by using approximate distributions over objectives.

**14.** Variance-based Regularization with Convex Objectives (Namkoong and Duchi): This Best Paper award-winner proposes a technique for variance reduced optimization using a convex surrogate for variance. They prove optimality along with faster rates of convergence than empirical risk minimization by virtue of balancing bias and variance.

**15.** The Marginal Value of Adaptive Gradient Methods in Machine Learning (Wilson et al): In continuation of Ben Recht's mission to demolish ML mythology, he and his collaborators claim in this paper that adaptive and accelerated gradient descent methods, while converging faster, tend to have significantly worse generalization properties than vanilla SGD.

## Other Algorithms

**16.** Batch Renormalization: Towards Reducing Minibatch Dependence in Batch-Normalized Models (Ioffe): Batch normalization acts differently in training and inference. If we think of performance at training as an estimator for performance at inference, it turns out that this estimator is biased. This work provides a way to eliminate the bias. The approach involves including a correcting term at training time to account for the discrepancy between minibatch and full-batch moments. This approach is now available as part of TensorFlow.

**17.** Harmonica (Hazan et al.): A spectral (i.e., Fourier domain) method for hyper-parameter optimization with finite sample guarantees. They claim that their technique is distributable, provably correct, has finite sample size and computational complexity guarantees, and runs much faster than existing hyper-parameter search algorithms.

**18.** Soft-DTW, a Differentiable Loss for Time-Series (Cuturi and Blondel): In this talk, Mathieu Blondel described a soft-min relaxation of the classical dynamic time warping algorithm that allows making it into a differentiable loss function, and which can be used for a variety of purposes, including prediction / forecasting. He also described some pretty neat algorithms for computing the gradient making use of the fact that the induced loss is a exponential-family distribution, and can be computed using dynamic programming. Very cool.

**19.** A Unified Approach to Interpreting Model Predictions (Lundberg and Lee): This paper presents a method for measuring the importance of features in a complex ML system by approximating its Shapley value. The authors claim that their technique generalizes six existing methods for feature importance measurement. Based on discussions with one of the authors, this method can also be used for studying multi-way interactions.

## Online Learning

**20.** Learning Linear Dynamical Systems via Spectral Filtering (Hazan et al.): In this paper, the authors present an online algorithm for learning the parameters of a linear dynamical system that has provable guarantees and bypasses the non-convexity problem by moving into the spectral domain.

**21.** Stochastic and Adversarial Online Learning without Hyperparameters (Cutkosky and Boahen): The paper proposes a class of online optimization algorithms that achieve O(log4(T)) regret in a wide class of stochastic settings while simultaneously maintaining O(sqrt(T) ) adversarial bounds. Furthermore the authors claim that it does not require any prior knowledge about the data or tuning of parameters to achieve superior performance.

## Theory

**22.** A New Theory for Matrix Completion: This * very *interesting poster claimed that non-convex matrix completion algorithms (such as ALS) are, in fact,

*better than*convex relaxations (such as spectral norm) in the sense that they recovered correct solutions under weaker assumptions than convex methods.

**23.** Generalization Properties of Learning with Random Features (Rudi and Rosasco): In this paper the authors show that learning with random features has better generalization properties than previously believed--and their theory explains why, in practice, people can get away with using far fewer features than the original theory suggested.

**24.** Approximation and Convergence Properties of Generative Adversarial Learning (Liu et al.): This paper attempts to answer a fundamental open questions about GANs - namely how well they can approximate the target distribution and whether convergence to global minima implies convergence in distribution. They define the notion of "adversarial divergence" and show that for objective functions that are strict adversarial divergences, convergence in the objective function implies weak convergence (i.e., in distribution).

**25.** Spectrally-normalized Margin Bounds for Neural Networks (Bartlett et al): This paper presents a new method to measure the complexity of neural networks using its local Lipschitz constant and margin-normalized "spectral complexity" and uses this to derive new generalization bounds. The authors also show experimental evidence to suggest that these bounds line up well in practice.