Dec. 27, 2017

Among the trends and highlights that emerged from this year’s record-shattering NIPS symposium, a few moments made everyone’s “nice” list.

There was **Kate Crawford**’s equal parts illuminating and unsetting talk on the Trouble with Bias, which may prove the benchmark for how we talk about bias in AI from now on. Then there was **Ali Rahimi**’s appeal to scientific rigor, which sparked strong opinion over his use of a term more commonly found in medieval fiction than in today’s R&D labs. Much has already been written about both.

At the root of these talks was a clear and arguably similar message that if we want to deploy deep learning at a large scale, we need to understand it better. This means ramping up on theory and producing more thorough explanations for the “whys” and “hows”. Right now, we don’t even understand why first-order algorithms are working so well in deep learning and, while making things work in the short-term can be exhilarating, following this track won’t do us any favours in the long term.

Thankfully, as we saw at NIPS, groups of dedicated researchers are digging deeper into various aspects of machine learning theory. Their work will ultimately help us inch toward closing the fundamental knowledge gap that currently eludes us. Below, we’ve highlighted a selection the papers that may not have gotten the lion’s share of attention, but stood out to us for all the right reasons.

Generalization is a key concept in machine learning; it’s the ability of an algorithm to perform well on an unseen dataset. At present, we still don’t know why and when an algorithm will generalize well or not. Two papers presented at the conference attempted to chip away at this black box mystery.

(i) The Marginal Value of Adaptive Gradient Methods in Machine Learning

**Authors:** **Ashia C. Wilson**, **Rebecca Roelofs**, **Mitchell Stern**, **Nati Srebro**, **Benjamin Recht**

It has long been conjectured that different optimization algorithms in deep learning (stochastic gradient descent, AdaGrad/Adam, natural gradient) have different generalization performance abilities. This paper studied two main classes of optimization algorithms: SGD and its variations (as opposed to adaptive optimizers such as AdaGrad/Adam.) Through an extensive empirical treatment on various datasets and architectures, the authors found that despite the faster training performance of adaptive optimizers, they often tended to generalize more poorly than SGD and its variations. While the authors do not provide any theoretical justification for why adaptive optimizers didn’t generalize as well as SGD, their insights and observations have definitely made impact in the deep learning theory community.

**Our take:** This is interesting because while adaptive algorithms are widely used in practice, the finding suggests that different optimizers have different generalization capabilities. So, while we’re building increasingly complex architectures, we’re still using basic optimization algorithms. It’s important to understand the inherent reasons behind these algorithmic abilities in order to design more efficient ones. In essence, the choice of algorithm actually matters.

**Authors:** **Elad Hoffer**, **Itay Hubara**, **Daniel Soudry**

One of the central ideas of this paper was to explore the generalization differences between large-batch and small-batch training for a chosen optimizer. There have been various papers in the last year which argue that, through empirical findings, small-batch training tends to generalize better than large-batch training. The contribution of this paper is that there is perhaps no inherent difference between generalization performance of small-batch vs large-batch. By training with respect to the number of total gradient updates for varying batch sizes (this is in contrast with earlier experiments in other papers where they train different batch-sizes with the same number of epochs), the authors find that the so-called “generalization gap” can be closed. In addition, the authors proposed a new version of batch normalization, called the “ghost-batch normalization,” which improves generalization performance.

**Our take:** Aside from the practical viewpoint (easier parallelization rather than small-batch SGD), we found this paper interesting because it successfully breaks down a common practice in machine learning and moves the area forward, while suggesting a new technique.

Word embedding algorithms are a fundamental tool allowing for computers to reason about natural language. Their function is to represent words by a vector of numerical values whose mathematical properties reflect some of the word’s semantics.

(iii) Poincaré Embeddings for Learning Hierarchical Representations

**Authors:** **Maximillian Nickel**, **Douwe Kiela**

In order to train a machine learning algorithm on a text dataset, you need to learn what a good representation of words in a continuous space looks like. In this representational space, the distance must reflect the semantic similarity between the words. Usually, it’s a Euclidean space and the words are represented by a vector: for example, each number can be represented by the number of documents where the word is present, or the neighbour context, etc.

If the representation is good, the inner product (between two of these vectors) should determine how close they are semantically. But this result depends on the representation that the network has learned. For example, Paris + London are similar but Paris + Eiffel tower could also be considered similar. So which one is right?

For now, the representation is very high-dimensional while the vectors are usually sparse. To date, some people have invented techniques to lower the dimensionality of the representation space and, instead, represent words by dense vectors. This technique is called word embeddings.

**Our take: **This paper got a lot of attention post-NIPS in part because it proposed a novel approach that outperforms the classical approach in characterizing hierarchical language relationships by computing the embeddings in a hyperbolic space: the Poincaré ball.

Generative adversarial networks are still hot, and not just because **Ian Goodfellow** was seen walking around the Long Beach Convention Center in a “Yes, we GAN!” t-shirt and the perfection of that moment nearly shattered the time-space continuum.

But while GANs attract a ton of attention, they’re still rather poorly understood. From a theoretical viewpoint, many questions arise:

i) What is the true influence of the cost function on the quality of the generative samples and on mode collapse?

ii) Are we sampling from the data distribution?

iii) What insights from dynamical systems can we get to design a new regularizer to stabilize the training dynamics?

iv) Does it converge to a locally stable equilibrium point?

v) How does the choice of optimizer influence the learning dynamics?

vi) How do we avoid situations where the dynamics get close to reaching

equilibrium, but then diverge?

Numerous machine learning researchers are trying to partially answer these questions from both a theoretical and practical perspective.

In Gradient descent GAN optimization is locally stable, the authors studied the deterministic version of the GAN dynamics and showed how under a set of restrictive assumptions, the dynamics were locally asymptotically stable. Their local analysis helped them derive a regularization term that stabilized the training.

**Our take: **We found this paper interesting because it tried to understand the stability of GANs from a theoretical viewpoint and the authors managed to propose a way to improve training. However, we felt the assumptions made in the paper were not realistic and the theory did not quite match with the experiments. At the same time, this is the first step toward a better understanding of GAN dynamics and we’re certain this is a very interesting research direction.