Several Two Sigma researchers and engineers recently attended the 35^{th} International Conference on Machine Learning (ICML 2018) in Stockholm. As in past years, Two Sigma also sponsored the event, reflecting a strong belief in the value of embracing the state of the art, challenging our own methodological assumptions, and maintaining our ties to the academic community.

ICML 2018 featured 621 excellent papers, for which we are grateful not only to the authors, but to the 160 area chairs and 1,776 reviewers who made the event possible. This year, we were impressed not only with the strong keynotes and tutorials but also with the large number of diverse workshops, thanks to co-location with International Joint Conferences on Artificial Intelligence (IJCAI) and International Conference on Autonomous Agents and Multiagent Systems (AAMAS). Enthusiasm surrounding machine learning and related topics remains palpable, and it continues to grow.

Given the wealth of high quality information on offer at ICML 2018, absorbing and evaluating everything such a comprehensive conference has to offer can be a challenge. Below, Two Sigma researchers Xiang Zhou, Travis Wang, and Steven Kim highlight several papers they found particularly novel, practical, or otherwise compelling.

### Adversarial

**Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples** – Anish Athalye, Nicholas Carlini, David Wagner

This paper won one of the best paper awards at this year’s ICML. It is widely acknowledged that although neural networks have achieved great performance (in some tasks even surpassing human performance), they are still susceptible to various types of adversarial examples. There has been significant recent interest in constructing defenses in order to increase the robustness of neural networks. This paper introduces the notion of “obfuscated gradients” as a phenomenon that leads to a false sense of security in defenses against adversarial examples.

This paper is in some sense a review paper. It reviewed seven out of nine defenses from ICLR 2018 relying on obfuscated gradients, and then constructed successful attacks to circumvent those defenses. The authors consider three types of obfuscated gradients, either caused intentionally or unintentionally:

- Shattered gradients are caused when a defense is non-differentiable, and can be attacked by the proposed Backward Pass Differentiable Approximation (BPDA) method, which essentially replaces the non-differentiable layer with a differentiable approximation;
- Stochastic gradients are caused by randomized defenses, and can be attacked by Expectation Over Transformations (EOT);
- Exploding and vanishing gradients are caused by defenses that consist of multiple iterations of neural network evaluation, and can be attacked by re-parameterization.

The paper also gives high-level guidelines about building and evaluating defenses. A good defense should make the threat model clear (i.e., the setting under which the defense is meant to work). A good defense should also make precise and testable claims. These guidelines will hopefully benefit future attack and defense researches.

### Architectures

**Conditional Neural Processes **– Marta Garnelo, Dan Rosenbaum, Chris J. Maddison, Tiago Ramalho, David Saxton, Murray Shanahan, Yee Whye Teh, Danilo J. Rezende, S. M. Ali Eslami

Deep neural networks excel at function approximation, yet they typically require training “from scratch” for each new function. Bayesian methods, on the other hand, exploit prior knowledge to quickly infer the shape of a new function at test time. These are computationally expensive, however, and designing appropriate priors can be difficult.

This paper by DeepMind proposes a family of neural models, Conditional Neural Processes (CNPs), which are inspired by the flexibility of stochastic processes such as Gaussian processes but are structured as neural networks and trained via gradient descent. Thus CNPs combine the benefits of both deep neural networks and Bayesian methods.

The experiments conducted on a range of canonical machine learning tasks (including regression, classification, and image completion) indicate that CNPs make accurate predictions after observing only a handful of training data points, yet scale to complex functions and large datasets.

This work is a step towards learning high-level abstractions, one of the grand challenges of contemporary machine learning. Functions learned by most conventional deep learning models are tied to a specific, constrained statistical context at any stage of training. A trained CNP is more general. It encapsulates the high-level statistics of a family of functions and thus can be reused for multiple tasks. The work could help in tackling the many key machine learning problems that seem to hinge on abstraction, such as transfer learning, meta-learning, and data efficiency.

**Efficient Neural Architecture Search via Parameter Sharing**–Hieu Pham, Melody Y. Guan, Barret Zoph, Quoc V. Le, Jeff Dean

This paper from a group at Google Brain is built on a previous work on (Zoph & Le, 2017) for automatic model design. Despite its impressive empirical performance, NAS is both computationally expensive and time consuming.

Recent work on transfer learning and multitask learning (Razavian et al., 2014; Zoph et al., 2016; Luong et al., 2016), established that parameters learned for a particular model on a particular task can be used for other models on other tasks, with little to no modification. Encouraged by this progress, this paper proposes Efficient Neural Architecture Search (ENAS), which improves the efficiency of NAS by forcing all child models to share weights, eschewing the lengthy process of training each child model from scratch to convergence.

The paper claims that ENAS delivers strong empirical performance, while using far fewer GPU-hours than existing automatic model design approaches; specifically, ENAS is 1000x less expensive than standard NAS. This work could certainly speed up Google’s Cloud AutoML. For automatic architecture selection, many questions still remain, such as working with small datasets, or proving optimality of the search result.

**Differentiable Dynamic Programming for Structured Prediction and Attention **– Arthur Mensch, Mathieu Blondel

Empirical successes in deep learning and modern computational infrastructure have inspired a push for end-to-end learning, but these efforts have largely been limited to learning models with stacked layers of elementary linear algebraic operations (due to the ease of automatic differentiation). The authors of this paper observe a growing desire for modeling pipelines that can process the optimal value and/or argument of a dynamic program (in domains like sequence prediction, time-series alignment, and machine translation).

Unfortunately, the optimal value and argument of a dynamic program are typically non-differentiable, hence cannot be directly inserted as a layer of a neural network. There is a wealth of prior literature on probabilistic graphical models that instead approaches dynamic programs by moving from the so-called (max, +) semiring to the (+, x) semiring; here, the sum-product algorithm (also known as belief propagation, message-passing, etc.) is the typical computational workhorse.

In this paper, the authors observe that existing probabilistic methods can be reframed as a particular instance of a general class of max operators with strongly convex regularizers (examples include entropy, l2 norm). Elementary results from convex duality show that this regularized max is smooth and differentiable (with the gradient as the maximizing argument). The authors go further first by showing that the optimal value and argument of the regularized max are differentiable. Next, they demonstrate that their gradients are explicitly computable. Finally, they prove bounds on the distance between the solutions to the regularized max and the original dynamic program. The results in this paper serve as a substantial primer on how to reliably incorporate non-trivial dynamic programming layers into a neural network architecture.

### Generative Adversarial Networks (GANs)

**GAIN: Missing Data Imputation using Generative Adversarial Nets **– Jinsung Yoon, James Jordon, Mihaela van der Schaar

This paper presents a novel method for missing data imputation inspired by generative adversarial networks (GANs) (Goodfellow et al., 2014). The method is built under the assumption that data is missing completely at random. If the model-imputed data is drawn from a distribution close to the underlying distribution of the observed original data, it should not be possible to discriminate between the imputed data and the original data.

Different from the standard GAN architecture, which uses a generator to create fake samples and a discriminator to distinguish between fake and real samples, the new architecture, generative adversarial imputation network (GAIN), uses a generator to impute the missing component for samples, and a discriminator to predict which components of a sample is fake. A hint mechanism is introduced in order to reveal to the discriminator partial information about the missingness of the original sample, which the discriminator uses to focus its attention on the imputation quality of particular components. This hint ensures that the generator does in fact learn to generate according to the true data distribution.

Experiments reported in the paper show that GAIN significantly outperforms state-of-art data imputation methods. However, as noted above, the test is based on the assumption that data is missing completely at random, so an open question is how GAIN-based imputation will behave when the completely at-random assumption is not true.

**A Two-Step Computation of the Exact GAN Wasserstein Distance **– Huidong Liu, Xianfeng David Gu, Dimitris Samaras

Generative Adversarial Networks (GANs) (Goodfellow et al., 2014) have received tremendous attention in recent years. The idea is to generate samples by deep networks that cannot be distinguished from the real samples. However, previous researches have noted that the original GAN architecture suffers from the issue of vanishing gradients, training instability, and mode collapse. The invention of Wasserstein GANs (WGANs) (Arjovsky et al., 2017) was a breakthrough to motivate researches into computing the appropriate distance metric.

While largely successful in solving the above issues, the WGAN architecture still requires a technical nuance in order to be able to use a deep network to approximate a Lipchitz function class. Specifically, a WGAN needs to clip its network weights in every iteration in order to guarantee the boundedness of the neural network function. However, weight-clipping limits the functional space of the WGAN “critic” and can also cause gradients to explode or vanish if the clipping parameters are not carefully chosen.

To mitigate these problems, this paper proposes a two-step method to compute the Wasserstein distance in WGAN. In the first step, a linear programming problem is formed to compute the exact Wasserstein distance, without the requirement of setting a weight clip parameter. In the second step, the authors apply a deep neural network to approximate the optimal solution of the first step by a differentiable trainable function. The new algorithm is driven by a new formulation of Wasserstein distance based on the discrete Monge-Kantorovich duality. Results on both synthetic data and real data show that the proposed method computes the Wasserstein distance both faster and more accurately.

### Reinforcement Learning

**SBEED: Convergent Reinforcement Learning with Nonlinear Function Approximation** – Bo Dai, Albert Shaw, Lihong Li, Lin Xiao, Niao He, Zhen Liu, Jianshu Chen, Le Song

Reinforcement learning (RL) continued to be a hot topic at ICML 2018. Recent years have witnessed the successful application of reinforcement learning in many real-world challenges, including Atari games, AlphaGo, and much more. Without a doubt, this is a result of a deepening understanding of the RL framework and important progress in developing RL algorithms.

In this paper, the authors first revisit the Bellman equation (Puterman, 2014), which is central in RL problems, and reformulate it into a primal-dual optimization problem using the Nesterov smoothing technique and the Legendre-Fenchel transformation. Then a new algorithm, called Smoothed Bellman Error Embedding (SBEED), is proposed to solve this optimization problem, based on stochastic mirror descent (Nemirovski et al., 2009). The algorithm allows for any differentiable function class, and a theoretical guarantee is provided for general nonlinear function approximation (the first, to the best of our knowledge).

The authors decompose the estimation error of value function into four sources: optimization error from mirror descent, generalization error from using finite samples, bias induced by the Nesterov smoothing, and approximation error induced by using function approximation. In addition, the proposed algorithm is closely related to other successful algorithms such as trust region policy optimization (TRPO) (Schulman et al., 2015) and natural policy gradient (Kakade, 2002).

### Theory and Optimization

**On the Optimization of Deep Networks: Implicit Acceleration by Overparameterization **– Sanjeev Arora, Nadav Cohen, Elad Hazan

Traditional theory implores caution when increasing model complexity, whether it be through generalization bounds in learning theory or variance bounds in statistics, both of which largely depend on bounding dimension (for a definition of “dimension” that suitably encodes expressiveness). Nonetheless, practitioners often find that increasing model complexity improves test error. Surprisingly, this can be the case even when the practitioner is fitting a model from a perfectly specified class of models. That is, suppose one has a synthetic dataset of labels generated by applying a particular depth-D, width-W, fully-connected network to random input data; even in the setting of known D and W, one can often attain better test error by optimizing over networks in the more complex class of depth D+1 and width W+1.

In this work, the authors attempt to shine some theoretical understanding on this practical phenomenon, by quantitatively demonstrating how increased model complexity can ease model training. Note that past work has suggested that some machine learning models may suffer from *underfitting* due to insufficient optimization, hence accelerated optimization procedures can yield not only faster training but also better model performance. This work indicates that increasing model complexity (termed “overparameterization” here) is itself a form of acceleration. Specifically, gradient descent on an overparameterized network induces an “extra” term in each gradient step that amounts to acceleration. Moreover, this paper provides a proof that this extra acceleration term cannot be attained by any sort of weight-based regularization of the loss function, as the line integral (with respect to weights) of the gradient of the overparameterized network does not vanish.

From a theoretical perspective, this paper offers one more piece of evidence that model formulation and model optimization are crucially linked, rather than being distinct components of data analysis. For the practitioner, the upshot is that model complexity can be embraced, not only for the statistical benefit of expressiveness, but also for the computational benefit of improved trainability.

**Dynamical Isometry and a Mean Field Theory of CNNs: How to Train 10,000-Layer Vanilla Convolutional Neural Networks **– Lechao Xiao, Yasaman Bahri, Jascha Sohl-Dickstein, Samuel S. Schoenholz, Jeffrey Pennington

Deep architectures have allowed increasingly sophisticated modeling in diverse domains, but practical data analysts must often wrestle with the numerical pathologies that arise during training of deep models. Specifically, as depth increases, gradients may vanish or explode, a computational instability that the authors characterize in terms of the well-conditionedness of the input-output Jacobian of the network. Traditionally, practitioners will combat obstacles to training by employing architectural “tricks” like residual/skip connections or batch normalization; while these “tricks” can be effective, it is not clear whether they are crucial from a *computational* perspective (by allowing for reasonably fast training) or from a *statistical* perspective (by enforcing desirable constraints on the model class).

Prior work in the fully-connected setting (Pennington, Schoenholz and Ganguli, 2017) has analyzed the input-output Jacobian by considering the layer-to-layer transformations of a network in “mean-field” limit (i.e., width to infinity), and then interpreting the fixed point of these dynamics. It was shown that well-posedness of the input-output Jacobian can be attained by delicately constructing initial weights to be orthogonal (in a suitable sense), which induces norm-preserving (hence the term “dynamical isometry”) transformations in each layer. In the fully-connected setting, orthogonal weights can simply be constructed via the Gram-Schmidt process. Note that the desired condition here is on *initialization*; this differs from other work on orthogonal weights involving penalties on weights in order to induce orthogonality throughout training (e.g., Balestriero and Baraniuk, 2018).

This paper offers a substantial extension of the concept of orthogonal initialization to the setting of convolutional neural networks (CNNs), where orthogonal initial weights are constructed through a non-trivial (but explicitly detailed) scheme drawn from wavelets theory. Empirically, the authors’ orthogonal initialization allows for training of very deep CNNs, without any architectural tricks. This work (along with related work on recurrent neural networks due to Chen, Pennington, and Schoenholz, 2018) further distills the effect of dynamical isometry during training of networks, and ultimately offers some concrete guidance for practitioners: better initialization induces faster training!

**Not All Samples Are Created Equal: Deep Learning with Importance Sampling **– Angelos Katharopoulos, François Fleuret

This paper presents an efficient algorithm for accelerating the training of deep neural networks using importance sampling.

Previous works (Zhao & Zhang, 2015; Needell et al., 2014) have shown the optimal distribution (to sample from) is proportional to the per-sample gradient norm. However, computing this distribution is very expensive.

This paper derives a tractable upper bound to the per-sample gradient norm, as well as an estimate of the variance reduction achieved with importance sampling. Doing so enables the importance-sampling mechanism to be “switched on” when it will result in an actual speedup. This work can result in reduced computational requirements of more than an order of magnitude compared to a recent work (Alain et al., 2015).

Experiments reported in this paper show that the algorithm is effective in reducing the training time for several tasks, both on image and sequence data. The results also show that not all data points matter equally in the duration of training, a fact that can be exploited to gain a computational speedup or better quality gradients, or both. However, the image data used in the experiments has a high signal-to-noise ratio, so it would be interesting to see how much importance-sampling can help for datasets with greater levels of noise.