Aug. 16, 2017

Earlier this month, we had the opportunity to attend ICML in Sydney, Australia and present our work. We thought it would be useful to discuss some of the highlights from this year’s conference; however, please note this is by no means a comprehensive summary of all we found interesting and worth discussing as that would require its own separate website and approximately 400,000,000 words (WordPress caps our posts at 399,999,999). With these limitations established, some abridged thoughts.

For a long time one of the biggest criticisms about deep learning has centred around what we call the “black box” – or the absence of theory around how everything actually works. While we’ve made incredible progress over the past few years, we’ve lacked much of the theoretical underpinning to explain what’s going on under the hood. All those amazing hacks and tricks? Even the people who figured them out haven’t been entirely sure about the reasons why they work so well.

That’s finally changing. This year, our understanding of deep learning theory has advanced to the point where there was a full-day track at ICML dedicated to it exclusively. That meant at any given moment our team had to choose between the DL theory track and the multiple other tracks happening simultaneously, which, in some cases was as traumatizing as having to choose between mom and dad.

So, what has been the theoretical tipping point? Well, the most obvious answer is we now have way more people studying it. That’ll move things. And speaking of numbers, we’ve also helped close the theory gap by inviting more math to the party, like we’re doing here at Borealis AI with our own in-house mathematicians. With these elements in motion, it’s only a matter of time before our grasp of theory catches up to all the fun bits.

One example of the momentum happening in this area centres around our understanding of** rectifier networks**. These days, most feed-forward architectures use some kind of rectifier nonlinearity and could be said to fall into this category. They compute what’s called **piecewise linear functions**, which chunk the input domain into small pieces and approximate each piece with a linear function. Easy enough to understand, as piecewise linear functions are conceptually quite simple.

However, in their Neural Taylor Approximations paper, **Balduzzi**, **McWilliams**, and **Butler-Yeoman** argued that the actual functions computed by modern covnets are extremely unsmooth, since the number of these regions grow exponentially with depth. As a result, many of the intuitions derived from thinking about covnets as piecewise linear functions have been misleading, because typical modern covnet architectures range from anywhere between tens to thousands of layers.

Using this key observation as a starting point, the authors decomposed the convergence analysis into vanilla optimization in smooth regions and exploration across nonsmooth boundaries. According to the authors, this analysis yields the first convergence bound for deep learning, which takes the form of a standard bound for convex non-smooth optimization, plus a hard convex approximation loss due to Taylor approximating the neural net.

Although this bound is unlikely to be tight, it provides new theoretical perspective and intuition about why certain techniques actually work in practice. The authors then described a new intuition about stochastic optimization of rectifier nets: searching the energy landscape with a convex flashlight that does not shine through the kinks of the shattered landscape. Based on this novel intuition, they posed a few questions about stochastic optimization for deep rectifier nets. For example, they conjecture that the reason root-mean-square normalization in **RMSprop** and **Adam** works so well is that it improves exploration across nonsmooth boundaries.This new perspective could potentially yield new optimization algorithm and architecture design in the future.

The motivation and analysis involved in the work above touches on a phenomenon known as “**shattered gradient.**” **Balduzzi** and colleagues presented another work in the same track called The Shattered Gradients Problem: If resnets are the answer, then what is the question?

Early lessons from deep learning lore suggest that deeper is better – that is, until someone excavates a net hundreds of layers deep and the net fails to learn. Historically, people believed this failure was due to a vanishing gradient problem, and implemented improved initialization to solve the problem. But while it did improve things, it didn’t fix the problem completely. Then, suddenly, resnet architecture came along and things seemed to work and the world began anew. Hooray for resnet!

Now, if you look at the frequency of the terms mentioned everywhere, it’s clear resnet is becoming exponentially more popular. In practice, it works really well. And when things work, most people don’t question why, unless you’re dealing with academic researchers and that’s literally the only question they’re asking and all their home and office furniture has the word “Why?” stitched on it in elegant textile cursive. So, when researchers took a closer look at the problem, they discovered it wasn’t the signal magnitude that was decaying to zero; rather, it was the amount of *useful information* hitting empty. Even though the signal strength remained, it became little more than white noise once it hit the bottom – no useful structure and completely random, like an infomercial at 3 am. The researchers discovered that when you have the skip connections in place, the gradient that progresses backwards no longer resembles white noise. Instead, it looks like something called “Brown noise” which means it managed to preserve its useful structure and can also be worn after Labour Day.

**Why this impressed us:** Besides being good science, we found these themes interesting for their immense value toward practicality: when our machine learning system doesn’t work, we want to find out why it doesn’t work (aside from silly bugs and bad code, of course). Many times, models fail simply because we have the wrong intuition about something. Better theoretical understanding allows us to debug our ML system much more efficiently.

One of the big themes to emerge at RLDM earlier this year, meta-learning, or “learning to learn” made its presence felt once more at ICML. Meta-learning is not meant to replace systems like unsupervised learning, or semi-supervised learning; rather, it provides a new perspective and set of techniques to help solve these existing problems. Today, when you’re given data, you have one model that learns the structure of the data, while higher-level models inform us of the best and most efficient ways to learn that data. This may look like pointing out what to pay attention to, or how quickly to adapt your parameters to accept new information versus relying on prior knowledge. Really, the approaches are endless.

So far, however, none of these meta-optimizers has demonstrated a particularly impressive ability to generalize the learned learning toward new tasks that would be able to outperform strong baselines, such as Adam. With this problem in mind, Model-Agnostic Meta-Learning (MAML), a new work presented by **Chelsea Finn**, **Pieter Abbeel**, and **Sergey Levine**, demonstrated shockingly good results on a different aspect of meta-learning that focused on representations.

The gist of their work is to make sure any representational learning can be easily adapted to new tasks simply by requiring that learned representation is easily fine-tuned to new tasks. In practice, they achieved this solution through a very smart trick: You just have to compute one-step and/or multi-step task-specific fine-tuned parameters, then compute a meta-learning update direction by averaging gradients from the tasks, with each evaluated at their own task-specific fine-tuned location. This method ensures the representation can always be adapted to specific tasks quickly and within one or two gradient steps. The most amazing part, however, is that this single algorithm applies to a huge breadth of completely different problem domains, and has improved the results by a wide margin along each one. Some even beat the domain-specific state-of-the-art.

**Why this didn’t just impress, but actually shocked us: **General learning-to-learn is really what Artificial General Intelligence strives to achieve – to have one learning algorithm that solves all problems. The fact that these researchers found one algorithm that could be applied to very, very different problems with very different sources of difficulty is proof positive these algorithms can actually learn how to learn extremely well. But if we’re being honest, the most shocking bit was how quickly this area has progressed.

While not a research area per se, gradient-on-gradient has become a much more widely spread technique both in the traditional sense, and through parameter gradient on function input gradient. The MAML paper (discussed above) is one such example and there were many more presented at the conference this year. It’s like an X-ray machine that allows us to look at our models and algorithms from a completely new perspective.

Of course, it’s not as though researchers woke up yesterday, stretched, stumbled bleary-eyed into the kitchen, made a coffee and discovered this approach: Gradient-on-gradient has already been seen as a useful tool for a while. What’s changed is our **capacity for compute**. It’s only recently that software and hardware have advanced to the degree where this application can be carried out on a large-scale basis.

**To sum it all up:** All these breakthroughs are part of our progress in computational resources – both software and hardware – that provide the capacity to “see” under the proverbial hood. The more rapidly these elements become available, the more people will be able to explore and pioneer the new techniques that can perform all these amazing tasks and the more we increase capacity to be shocked and blown away by what we’ll see at ICML 2018.