A senior Two Sigma researcher provides an overview of some of the most interesting Deep Learning research from ICML 2017.
Deep learning--a powerful class of machine learning algorithms--represents an increasingly potent way to uncover patterns in vast datasets. Each year, researchers gather at conferences like the International Conference on Machine Learning (ICML) and the Conference on Neural Information Processing Systems (NIPS) to share new research and gain better awareness of the state of the art.
ICML 2017, which Two Sigma sponsored, was held in August in Sydney, with about 2500 people attending. Spread over the course of several days, the conference featured a wide variety of tutorials, presentations, and workshops. The program covered nine parallel tracks (encompassing about 400 accepted papers out of 1600 submitted), many of which focused on Deep Learning and related topics.
Several of Two Sigma’s deep learning researchers attended this year’s event. Below, senior researcher Satrajit Chatterjee summarizes some noteworthy papers and presentations from the conference.
Causal Inference—Bernhard Scholkopf
Bernhard Scholkopf gave a very interesting talk on Causal Inference. By way of motivation, he provided a compelling example of how current ML systems lack knowledge of causality: When he was looking for a laptop rucksack on Amazon, for example, it recommended a laptop.
The question I find most interesting in causal inference is whether it is possible to detect causal connections from data without intervention. Scholkopf seemed to indicate that it is indeed possible, under certain circumstances (when one has "independent noise"). He mentioned his book Elements of Causal Inference, which covers many of the topics he spoke about in the talk.
Scholkopf covered a number of real-world applications of causal inference, and in particular how his group was able to find more than a dozen new exoplanets from publicly available Kepler data.
He concluded with the thought that statistical information is only an epiphenomenon; what is more primitive is the causal generative process--and that's what we should aim to understand.
Towards Reinforcement Learning in the Real World—Raia Hadsell
Hadsell gave a broad overview of the research being pursued at Deep Mind in pursuit of AGI (artificial general intelligence). Two specific problems they are focusing on are:
- Hierarchical RL for long-term planning and decision-making (based on Feudal RL, an old idea that is now coming back in Neural Network form).
- Continual Learning to allow generalization to new tasks without forgetting how to do the old tasks (using elastic training and progressive nets).
In terms of general techniques, Hadsell said that she and her colleagues have found A3C (Asynchronous Advantage Actor-Critic) to be very powerful and stable. They have also found it useful to make the agent perform auxiliary tasks such as predicting the next state or trying to change the input image as much as possible.
Test of Time Award—David Silver and Sylvain Gelly
David Silver and Sylvain Gelly received the Test of Time Award for their work, "Combining Online and Offline Knowledge in UCT" from ICML 2007. In their acceptance speech, they gave a very nice overview of the development of computer Go in the past decade. An interesting take-away from their retrospective was that, conceptually, Alpha Go is actually simpler than some of the more complicated techniques tried in the intervening years, in that it does a fairly conventional tree-search with a value function. A second take-away was that policies trained with self-play could be bad sources of offline knowledge, since they fail to maintain diversity. (Alpha Go is trained on a database of human games.)
An interesting audience question was whether these ideas have had any impact outside of computer Go. As far as I could tell, the answer was no, but the success of computer Go has encouraged others to try RL on a variety of different problems.
Understanding Deep Nets Better
Laurent Dinh, Razvan Pascanu, Samy Bengio, Yoshua Bengio
A standing idea in the community is that gradient descent finds relatively flat minima, and that the flatness of the minima is what leads to good generalization in practice (via robustness to noise in parameters, and therefore minimum description length). The authors show that for ReLU networks, previously published notions of flatness are problematic in the sense that they are not immune to re-parameterization. In other words, what looks like a flat minimum (in one parameterization) can be made to look sharp (in a different parameterization) without changing the function itself, and therefore without altering generalization.
David Krueger, Yoshua Bengio, Stanislaw Jastrzebsk, Maxinder S. Kanwal, Nicolas Ballas, Asja Fischer, Emmanuel Bengio, Devansh Arpit, Tegan Maharaj, Aaron Courville, Simon Lacoste-Julien
The recent work of Zhang, et al. ("Understanding Deep Learning Requires Rethinking Generalization", 2016) shows that deep nets trained with gradient descent have large effective capacity: they are capable of memorizing pure noise without even needing substantially longer training time. This indicates that generalization in deep nets cannot be explained through traditional statistical learning theory, which relies on restricting capacity to prevent shattering to claim generalization.
In this work, the authors build on the work of Zhang, et al. to show that there are qualitative differences in optimization behavior of networks on real data vs. noise. For example, with real data, there are easy examples (learnt in a few epochs) and hard examples (only learnt after many epochs), whereas all examples in the random case are hard. Another example is the difference between random and real data in disparity of influence that each training example has on the loss function after many SGD steps. The authors go on to demonstrate that network optimization is content-aware i.e., networks learn simple patterns first, before resorting to memorization. Finally the authors show that regularization can differentially hinder memorization without affecting what the networks learn about real data.
Although the experiments presented in this paper are fascinating, it is hard to extract a definite conclusion. At best, the takeaway seems to be that generalization in deep nets depends on the dataset, itself.
David Balduzzi, Marcus Frean, Wan-Duo Ma, Brian McWilliams, Lennox Leary, J.P. Lewis
In the authors' telling, careful initialization and batch normalization address the problem of vanishing and exploding gradients. However, Residual Nets (i.e., architectures with skip connections) still perform much better than standard architectures (i.e., those without any skip connections) when the networks are sufficiently deep. The authors ask why, and the answer that they come up with is the shattered gradient problem.
The authors study the gradient of the output of a deep standard architecture (at initialization) with respect to the input in a simple 1-dimensional case and find that the gradient resembles white noise, i.e., it shows no correlation with the input. On the other hand, for a deep residual network and for a shallow standard architecture, it resembles that of brown noise, i.e., there is some correlation with the input. They also explain these empirical results with a theoretical analysis, as well as with an empirical study on real conv nets.
The upshot of the shattered gradients problem is that using gradients that resemble white noise makes training using current techniques (such as ADAM) very difficult (though the astute reader will notice that the gradients used in training are different than the ones studied here).
Finally, as a constructive test of their theory, the authors propose a looks-linear initialization for standard architectures that attempts to preserve the structure of the gradients much as a linear network would, and they show that this is effective in allowing much deeper networks without skip connections to be trained.
David Balduzzi, Brian McWilliams, Tony Butler-Yeoman
Modern networks based on ReLUs and max pooling are neither convex nor smooth. In fact, the number of linear regions grows exponentially with depth leading to more than 10^344 regions even for a network with 80 inputs and 20 hidden layers with 100 neurons each. The gradients in nearby linear regions could be very different, and so it is unclear why Hessian-based methods that rely on gradients being continuous work at all. And this puts into question previous attempts to understand the behavior of adaptive optimizers (such as AdaGrad or ADAM) since those attempts are based at the relationship of the gradient estimates computed by the optimizer to the Hessian.
The authors approximate the net (NOT the loss) with a Taylor approximation, plug that into the loss function to get a convex approximation to the loss and then use online (adversarial) convex optimization techniques to study the behavior of optimizers. They also conduct an empirical study and find that ADAM and other adaptive methods find better local minima, and are better at searching the discrete space of the various linear regions, than plain SGD.
David Silver, Hado van Hasselt, Matteo Hessel, Tom Schaul, Arthur Guez, Tim Harley, Gabriel Dulac-Arnold, David Reichert, Neil Rabinowitz, Andre Barreto, Thomas Degris
Model-based RL methods haven't worked as well as model-free methods on hard problems (e.g., Atari games), though intuitively it seems like they should do better, since they can allow an agent to plan (i.e., consider the effect of its actions without interacting with the environment). The authors argue that perhaps one way to improve model-based methods is to learn abstract models that are only good for predicting outcomes of interest but are otherwise not faithful to the environment. These values are not just the reward, but can be auxiliary things as well (e.g., not just the score for a game, but whether the agent is alive or is exploring a new room). The learnt model is essentially a Markov Reward Process that can then be unrolled to aid in planning. (Note that this provides an additional "consistency" signal that does not require interaction with the environment but helps the model to learn.)
Alexander Pritzel, Benigno Uria, Raam Sriram, AdriÃ Puigdomenech Badia, Oriol Vinyals, Demis Hassabis, Daan Wierstra, Charles Blundell
A key challenge for Deep RL is sample complexity: Deep Q-Networks, for instance, take four days (of game time) to learn Pong, whereas a human can master it in a couple of hours. In this work, the authors conjecture that the slow learning is due to the slow update of the Q-network (over many small stochastic gradient steps) and propose augmenting the Q-network with an explicit memory. This is a return to table-based approaches for Q-learning, except this time the table is an approximate associative lookup table per action (like a k-NN but not quite) that maps each state to a value. The states are obtained from a convolutional network that is updated slowly. They store about 500,000 states per action. They show that this does indeed lead to faster learning: I believe they now learn Pong in two days.
Irwan Bello, Barret Zoph, Vijay Vasudevan, Quoc Le
The authors use RL to train an RNN to learn a mathematical equation for weight update (rather than the function itself, since they say that it allows them to try the optimizer on other problems as well without retraining). The equation is expressed in a DSL that is pre-seeded with useful operations, such as the gradient and average of gradients. They show that this found a better update rule than ADAM for CIFAR-10 and that the resulting optimizer also worked well for a different problem (language translation).
Incidentally, the rule they found that did do well was scaling up the gradient by e if the sign of gradient and running average agree and scaling it down by e otherwise.
Jakob Foerster, Nantas Nardelli, Greg Farquhar, Triantafyllos Afouras, Phil Torr, Pushmeet Kohli, Shimon Whiteson
In cooperative multi-agent RL, it is often convenient or necessary to have each agent train independently, treating the other agents as part of the environment. However this leads to non-stationarity, and consequently Deep Q-learning becomes hard. (As Foerster pointed out during the talk, this is a very different situation from Atari: The Atari simulator is very stationary, since it hasn't changed in 30 years!) They propose two techniques to handle the non-stationary: one, a natural way to do importance re-sampling to decay previously stored experiences, and another based on a low-dimensional fingerprint to indicate the age of the data. They apply this to the problem of unit management in Starcraft. In more recent work, they have moved to Actor Critic algorithms and think they have a way to solve it there, too.
Chuan Guo, Geoff Pleiss, Yu Sun, Kilian Weinberger
This is a nice paper that points out that modern nets trained with cross entropy loss are overconfident. I.e., if you look at all examples for which the network assigns 85% probability to the most likely class, only about 50% of those examples are correctly classified (instead of 85%). The solution to calibrate the network properly turns out to be a simple one: use a softmax with a temperature scale that is adjusted to get correct calibration post-training (changing the temperature preserves relative values of the output of the softmax and so does not change accuracy).
Brandon Amos, Zico Kolter
The authors introduce a convex optimization layer that can be used in a network and back-propagated through. The idea is to put this in as a structural prior if you suspect that the function you are modeling is solving an optimization problem as part of its computation; during training, this layer will learn the optimization constraints and objectives (e.g. the constraints for Sudoku from examples of puzzles and solutions). What is notable is that they don't do it the obvious way, i.e., by unrolling an optimization algorithm for a certain number of steps, and they use a parametrization that is always feasible.
Moustapha Cisse, Piotr Bojanowski, Edouard Grave, Yann Dauphin, Nicolas Usunier
Deep Nets are not robust to adversarial examples: one can craft inputs that fool the network by adding a little noise to training examples. Adversarial training address this problem by adding such examples to the training set. To do that, however, it has to make assumptions about the process that generates the adversarial examples. The authors propose a different approach to making a network robust that depends on bounding the Lipschitz constant of the network (by controlling each layer). They do so by putting in a Parseval constraint on the weights (i.e., w wT ~= I, which is a generalization of orthogonality to non-square matrices). They then show that this makes the network more robust in the sense that one needs to add more noise to fool the network.
Robert Bamler Stephan Mandt
The authors track how the meaning of a word changes over time. At each point in time, they fit a word2vec model, but these models are tied together with a latent diffusion process and jointly trained. This allows one to see how the vector associated with a word changes over time.
Friedemann Zenke, Ben Poole, Surya Ganguli
Learning in (most) humans is continuous and not separated into training and inference phases. This allows them to deal with non-stationarity. The authors conjecture that some of this ability is due to more complicated synaptic structures in humans compared to simple scalar weights in artificial nets.
In artificial nets this boils down to solving the problem of “catastrophic forgetting,” i.e., training a net for a new task causes it to forget how to do an old task. This often not a problem with capacity, since the network can solve all the tasks well if trained jointly on them. Rather, it is more a problem of not simultaneously tracking the loss landscape of the various tasks that the network needs to solve.
The authors propose a method that, during training, tracks locally at each neuron how important it is to solving the task at hand by integrating the gradients seen during training and then uses this information to modulate weight updates when learning a new tasks. This method is similar to elastic weights by Kirpatrick, et al., except, in that work, the importance of a neuron to a task is estimated with a point estimate post-training.
Piotr Bojanowski, Armand Joulin
This is a very interesting paper on unsupervised learning. The authors want to learn features (without looking at labels) that can then help with a later classification problem (think transfer learning, etc.). Amazingly, the method boils down to assigning a random target vector from a large space (in the extreme case, each image is given a unique label) and doing supervised learning. The authors mention specifically that they use L2-loss instead of cross-entropy, but doing so entails regularizing the outputs as per Tygert et al. (2017) and batch normalization to get performance.
Mukund Sundararajan, Ankur Taly, Qiqi Yan
In this work the authors take on the problem of attributing the prediction of a deep network to its inputs, i.e., determining how much each input contributes to a prediction. They set up a list of desiderata (based on previous work in Economics on cost attribution) that a good attribution algorithm should satisfy and show that there is only one algorithm that satisfies it (their method of integrated gradients), and that it is easy to compute in practice.
Their work could be viewed as the analog of studying weights to understand linear models. They note that the obvious extension to nonlinear models through linear approximation (i.e., just gradients) fails since the gradient could be locally zero but the function still sensitive to an input. This limitation of gradients is overcome by their method of integrating gradients along a path from a specified non-informative baseline to the example for which one seeks attribution.
The paper has a number of examples from different domains where this method of attribution has been useful to understand the behavior of deep nets and to help debug and refine the networks.
Does this mean that the attribution problem is solved? Probably not. Their method depends on having a non-informative baseline (such as a black image) which may be difficult to figure out in a new setting. And the result of the attribution depends on the choice of the baseline which (if I understood correctly Mukund's response to an audience question) they appear to consider to be a feature not a bug. A second issue is that the linearity preservation axiom can be confusing (or useless) for very nonlinear functions (such as parity), and so new ideas are needed to understand how to explain the prediction of a network in such cases.
Sam Ritter, David GT Barrett, Adam Santoro, Matthew Botvinick
This was a very interesting talk on using techniques from Cognitive Psychology to understand the behavior of deep neural nets. In line with their inspiration from Psychology, the authors’ goal was to measure the behavior of the network rather than study the internal structure.
Studies in Cognitive Psychology have shown that humans have a shape bias that they use to prune hypothesis when learning: when shown a new object along with a novel name, people associate the name with the shape of the object (rather than say the color).
The authors’ work finds that two published architectures for one-shot learning from images end up displaying a shape bias similar to that exhibited by humans. They do this by essentially applying the same test that psychologists administer to humans to the networks.
One interesting finding was that the shape bias differed quite significantly between different networks (e.g. trained from different initial conditions) of comparable accuracy indicating that networks can have quite significant different behavior along an important axis that is not captured by the training metric alone. (An interesting question is whether a network with a stronger shape bias is more resistant to adversarial attacks than one with a weaker shape bias.) A second interesting finding was that shape bias emerges early on in training, long before convergence; and this is similar to what is observed in humans in early childhood.
More General and Faster Convolutions
David Budden, Alexander Matveev, Shibani Santurkar, Shraman Ray Chaudhuri, Nir Shavit
The initial motivation for this work focussing on CPUs comes from looking at 3D convolutions where current GPUs are limited by memory. Another important reason to look at CPUs is when batch size is 1, as may be the case in Reinforcement Learning, where it is not worthwhile to move data between CPU and GPU. The authors provide an algorithm for automatic synthesis of N-dimensional convolution codes that, unlike Winograd style methods, can handle arbitrary geometries. They show that their method beats existing CPU implementations even in the 2D case.
Minsik Cho, Daniel Brand
Direct convolution is slow, due to its poor memory access pattern. im2col convolution changes the input to make it BLAS-friendly. This gives it a better memory access pattern, but unfortunately needs extra memory. The authors propose a new way to convert convolution into BLAS calls that does not suffer from extra memory overhead (and also allows multiple BLAS calls to run in parallel). They are better than Winograd convolution in that they work for arbitrary geometries.
Siamak Ravanbakhsh, Jeff Schneider, Barnabas Paczos
This is a very nice paper on building architectural priors into networks. Researchers often want a network to be invariant to certain transformations of the inputs (think translation invariance for object recognition). A slightly more general notion is that of equivariance, where the action of a group on the input-output pair commutes with any function implemented by the network, i.e., f(g x) = g f(x).
The authors provide an algorithm that, given a group, constructs a sparsely connected layer that is equivariant w.r.t. to that group. They improve upon previous work by ensuring that the layer is only equivariant to that group and NOT to a larger group.
Deep Reinforcement Learning, Decision Making, and Control
Sergey Levine and Chelsea Finn, U. C. Berkeley.
Deep RL is a hot research area right now, and one of the challenges is putting all the recent advances (e.g., A3C or TRPO algorithms) into a common conceptual framework and tying them to the foundational algorithms in the field (e.g. REINFORCE). This tutorial does a very good job on both these counts. The authors put forth a general framework for RL algorithms and show how various algorithms fit into this general framework, and they provide entry points into the literature covering both the classic papers and selected recent advances. Finally, it also has a flow chart to help practitioners decide which RL algorithm to use in practice.
The slides are available here.
Real World Interactive Learning
John Langford and Alekh Agarwal, Microsoft Research
This was a tutorial on Contextual Bandits, which is a setup like RL, where one observes the state of the world, acts, and then immediately receives a reward (the difference with RL being that the reward there may be delayed). Langford takes the position that for many interactive learning problems in the real world, Contextual Bandits are better than the alternatives, such as RL and supervised learning followed by A/B testing for handling non-stationarity; and it is better than active learning since it uses a better signal (doesn't ask the user to label examples explicitly).
A significant focus of the tutorial was on real-world applications (e.g., ad selection and placement, recommendation systems, news, etc.) and the problems that arise in practice. Vowpal Wabbit has a mature implementation of these algorithms, with Langford making the point that the software has been around longer than many of the companies currently using it. There are other implementations, including a hosted service on Azure.
Some key features the authors highlight include performance often competitive with supervised learning; the ability to create counter-factual dashboards; and the ability to evaluate these online algorithms offline by progressive validation (test on next; then train on next).
Finally, they discuss some real-world pitfalls (e.g., some downstream process overriding the actions of the algorithm and detecting such off-policy behavior). Many more such issues were discussed by their colleague Sarah Bird at a workshop later on in the week.
The slides are available here.