Last month, I attended the Compact Deep Neural Network Representation with Industrial Applications (CDNNRIA) workshop. The workshop was part of the Neural Information Processing Systems (NeurIPS) conference held in Montreal. There has been much written about the future of machine learning, how the field should evolve, and what the focus of our community should look like and because of my own interest in making machine learning tools that are more energy efficient, I was eager to see how the conversations that took place in this particular workshop would take shape.
Whether you’re training a model on an enormous dataset with an industrial scale server or deploying a small model for a cellphone application, energy is often a fundamental bottleneck. I suspect algorithmic innovations that provide greater energy efficiency will be necessary to push forward the next frontier of machine learning. It’s with this mindset (and with my own paper in tow) that I attended this workshop.
In a previous blog post, I discussed Max Welling’s Intelligence per Kilowatthour paper, which he presented at the ICML conference in Stockholm last summer. Machine learning models are helping to solve increasingly difficult tasks, but as a natural result of scale, some of the models are getting enormous. Often, our community creates models that work just for the particular task at hand; but if these techniques are to be widely deployable, we must work to decrease the energy of these models. Because of this problem, Prof. Welling argued that machine learning should be judged by the intelligence per unit energy. The CDNNRIA workshop I attended seemed like a very natural response to Prof. Welling’s ICML presentation, as in addition to him being one of the workshop organizers, the focus was on compact (i.e. more energy efficient) neural networks.
Since I’m personally quite passionate about this topic, I’ve spent time exploring it in various forms of scientific inquiry. My workshop paper, On Learning Wire-Length Efficient Neural Networks, which I worked on with co-authors Luyu Wang, Giuseppe Castiglione, Christopher Srinivasa, and Marcus Brubaker, attempted to tackle an aspect of this important topic. I was honoured to present it. In this post, I will summarize our paper, highlight other interesting papers that relate to the subject, present the results of an experiment inspired by some of the additional workshop papers, then draw an emergent lesson from the workshop about the value of negative results.
A classic paper in the field, called Optimal Brain Damage, first introduced the basic training and pruning pipeline. The standard technique for creating energy-efficient neural networks involves assuming some initial architecture, initializing weight and bias values of the network, then modifying those parameters so that the network closely fits training data. This step is called "training". The next step – “pruning” – involves deleting the edges of the network that are somehow deemed "unimportant," then re-training the network after the edges are removed. The way “unimportant” gets defined, in this context, can vary depending on the specific technique. The big revelation of machine learning is that this pipeline works, and when done iteratively, the number of parameters in the model can be reduced by upwards of 50 times with no decrease in accuracy. In fact, there’s often an increase in accuracy.
Most previous work considers evaluating the performance of pruning algorithms by using the number of non-zero parameters metric. But as we shall see, these are not the only criteria. Some of the notable work at the CDNNRIA workshop considered energy consumption that assumes a three-level cache architecture (as discussed below). The existing work – both the cache-architecture work and number of non-zero parameters work – is a good model for energy consumption on general purpose circuitry with fixed memory hierarchies. However, some machine learning applications (say, image recognition), may need to be more widely deployed.
As with error-control coding, specialized neural networks may directly implement the edges of the neural network as wires. In this case, however, it is the total wiring length and not the number of memory accesses that will dominate energy consumption. The reason for this is due to the resistive-capacitive effects of wires, but more generally, it occurs because wiring length is a fundamental energy limitation of all the practical computational techniques that we can conceive. This hinges upon a basic fact: real systems have friction.
With this context established, our paper seeks to introduce a simple criterion for analyzing energy. We called it the “wire-length”, or information-friction model. Our model is inspired by the works of Thompson, as well as the more recent work of Grover, and by my own PhD thesis, in which energy is proportional to the total length of all the edges connecting the neurons of the network. The technique involves placing the nodes of the neural network on a three-dimensional grid, so the nodes are at least one unit-distance apart. Then, if two nodes are connected by a wire in the neural network, the length of the wire becomes the Manhattan distance between the two nodes they connect. The task we define is to find a neural network that is both accurate and has a placement of nodes that could be considered wire-length efficient.
In our paper, we introduced three algorithms that can be combined and used at the training, pruning, and node placement steps. Our experiments show that each of our techniques is independently effective, and by combining them and using a hyperparameter search we can get even lower energy, which has allowed us to produce benchmarks for some standard problems. We also found that the techniques worked across datasets.
Several workshop papers submitted in parallel to the conference caught my attention. This one, authored by Yue Wang, Tan Nguyen, Yang Zhao, Zhangyang Wang, Yingyan Lin and Richard Baraniuk, seemed to tackle similar themes as ours. What interested me most was that it sought a way to make energy efficient neural networks using a technique distinguished from the standard training-pruning pipeline:
This paper is also interesting for the way it suggests minimizing total energy in a different manner than the standard pruning paradigm. The human brain is hyper-optimized for minimizing energy consumption, and if our machine learning techniques are to mimic the kinds of tasks performed by the brain, I suspect we will have to use all kinds of techniques to keep energy costs under control. The “skip policy” idea of Wang et al. may be one such technique useful on the road to more energy efficient artificial intelligence.
In the normal iterative training, pruning, re-training framework, we keep the weights of the neural network the same post-pruning (except, of course, for the weights associated with pruned edges) then re-train the weights from this point onward. The idea behind this methodology is that the training-pruning process helps the network learn important edges and important weights.
Two similar papers submitted to the workshop added a twist to this paradigm. They showed if the weights are randomly re-initialized after some training and pruning, and then the network gets re-trained from the random re-initialization, then a higher accuracy can be obtained. This result suggests that pruning helps find important connections but not important weights. It contradicts what many (including myself) would have intuited: that pruning allows you to learn important weight values and important connections.
The experimental evidence across these two papers could provide a very easy-to-implement tool for the machine-learning practitioner, and I’m curious to see if this technique gets widely adopted. There’s a good chance, as one of the papers, “Re-thinking the Value of Network Pruning”, won the workshop’s best paper award.
However, while the papers’ results suggest a simple approach toward achieving higher accuracy on low-energy machine learning models, it begs the question whether this tool will work on the pruned architectures we obtained when optimizing for wire-length pruning. The fact that this was discovered independently by two different groups gives me more confidence that it will work for us. In the spirit of learning from the workshop, we decided to try it out back at Borealis AI HQ.
We obtained the best-performing model using distance-based regularization at different target accuracies. Then, we re-initialized the weights of the resulting network before we re-trained. The table below shows the results:
|Accuracy Before Initialization||Accuracy After Initialization||Accuracy After Re-training|
The left side presents the accuracy before re-initialization; the middle column shows the accuracy after-reinitialization, and the final column reveals the accuracy after re-training the re-initialized network. As we can see, we consistently get lower accuracy than the network before re-initialization. This suggests the re-initialization approach does not work. But we also find the technique gets less effective when we re-initialize smaller networks, which we have considered as a possible explanation for why the technique didn’t work.
Does this contradict the results of the workshop papers discussed above? No. But it might reveal that the general approach is less likely to work than we might have thought. Nevertheless, in “Rethinking the Value of Network Pruning,” the authors present negative results, so we might have guessed the technique wouldn’t work on our networks trained for shorter wire-length. In the next section, I’ll discuss why I think having negative results is so useful.
It’s a machine learning truism that any single result is hard to pinpoint as being independently important, but the many pieces of evidence reported across multiple papers allows us to draw an emergent lesson. I view machine learning as a grab-bag of techniques that help solve new classes of computational problems. They don’t always work, but all-too-often some of them do. This allows us to use the literature in a way that produces rules-of-thumb to inform techniques that might work. For example, if we looked at the original Optimal Brain Damage paper in isolation, it might be hard to discern the paper’s broad applicability. But the fact that the standard training-pruning pipeline has been so widely used, and that so many modifications of the technique (including our wire-length pruning work) also work, gives us confidence in the idea’s ability to capture something basic and fundamental – that doing some type of pruning is appropriate if minimizing energy consumption is a major concern.
Due to the multiplicity of possible techniques, the only thing machine learning practitioners can do is test them out in the first place, set good evaluation criteria, and see if they work. Since engineering and computational resources are limited, this also means judiciously choosing which techniques to take on. This process requires a careful balancing of engineering risk and reward.
So, while it may be worth it to try a technique, the decision depends on the particulars of the problem and the probability that the technique will be successful. Presenting negative results allows our readers to intuit this probability. Moreover, negative results inform researchers about areas that have already been attempted and saves them the effort of re-testing them.
The value of negative results can be further illustrated with a concrete thought experiment. Suppose your goal is to create a neural network in a place where computational resources are free and plentiful, and the tool is not going to be widely deployed. Perhaps such a network is used in an internal tool at a small company. The engineer, in this case, might ask: Should I try the re-initialization and re-training technique in order to get a more accurate small network?
Since the papers (and our experiment) suggest the techniques only work some of the time, it may not be worth the effort to give it a try. After all, there’s only a slim chance of it being successful and the reward margin is small. However, suppose the network were to be widely deployed to a billion cellphones, like, for example, in some widely deployed social media application. In this case, it makes sense to try this technique, as well as a number of others, to ensure the tool uses as little energy as possible.
Real problems may fit somewhere between these two extremes, and choosing the right approach requires having a finely tuned sense of the probabilities that they will work. Having a collection of experimental results in the literature, both positive and negative, helps the engineer make the right judgment call about whether a technique is worth the effort.
Right now, we have enormous potential in the field, but we have very limited human talent, and limited computational resources. We should take on the responsibility to ensure we draw the right lessons from the work we do and present our work in as useful a way as possible. That’s why “Re-thinking the Value of Network Pruning” is a strong output – not only does it find a surprising and successful technique, it also presented negative results. The quality of the scientific analysis in the paper makes it, in my opinion, a worthy recipient of the workshop’s Best Paper Award and hopefully sets a precedent for more researchers to explore negative results for the greater good of the field.
*Special thanks to Luyu Wang for running the re-initialization experiment in this post.