Borealis AI has big goals for 2019 and exceeding those goals requires exceptional leadership. Just two weeks into the new year, we’re well on our way with the addition of Dr. Kathryn Hume, who joins our team today as Director of Business Development. In her new role, Kathryn will oversee the application of our academic research within the bank, help inform our strategy and also tap into her broad experience to assist with driving Borealis AI's brand profile among key audiences.
Kathryn brings an unusually rich and varied background to the AI field. In addition to holding prior leadership positions at Integrate AI and Fast Forward Labs (Cloudera), she’s a prolific speaker and author on AI, has mastered seven languages and holds a PhD in Comparative Literature from Stanford. In her spare time, she teaches courses on enterprise adoption of AI, law and ethics at Harvard, MIT, Stanford and University of Calgary (just kidding, she doesn’t have any spare time).
As this introduction barely scratches the surface, we thought it would help to let Kathryn do the talking.
People underestimate how much of our lives are touched by banking as the substrate for the entire economy. Banking has macro impact—with risk management undergirding international market stability—and micro impact, where we all entrust banks with our financial assets to support our daily needs, like food, and our life aspirations, like education.
Now with AI, we’re able to use rich data that’s far more relevant to banking. People may not be aware that one of the first production deep learning applications was the use of computer vision to automatically recognize handwritten digits on cheques. This had been a rate limiter for the ATM and now, just a few years later, a customer can easily insert up to 50 cheques at a time and have the denominations read, analyzed and deposited within seconds. And now that we can also recognize and generate speech, what else might we do? What could payments look like? I’m most interested in AI applications like this where the tech hides behind the scenes but makes our lives so much easier.
First off, I really love the team and culture. I find there’s a mixture of curiosity and pure research talent. I also love that it’s a culture grounded in integrity. Everybody here takes the time to mean what they say and that’s very important to me. Apart from the culture, it’s exciting to return to my academic roots while pursuing my long-term career ambition, which is to be at the forefront of early commercialization of academic and scientific research.
There are a lot of existing applications the team has already built and I’m excited to bring them out of the lab and into production across the bank. I’m also looking forward to solidifying the relationships with our academic partners and to use the success of Borealis AI as an example of how academia and business can work together to holistically bridge gaps between both worlds.
My approach to responsible AI comes from a firm belief that ethics occurs in the trenches. There are, obviously, aspects of ethics that tackle large questions about AI’s impact on society, but I think the rubber really hits the road when a group of people collaborating to build a machine learning system have come together from different departments to make a series of tactical choices together. I’m excited to put this into practice here at Borealis AI. What better place to be impacting the future of responsible AI than in one of the world’s largest banks?
I was working in New York in early 2017 and it was common knowledge in the American machine learning research community that Canada was the place to be. The Vector Institute had just been established in Toronto and it was interesting to observe this experiment in building a commercialization leg from a university research department. I originally moved here to join a company called Integrate AI. What’s kept me in Toronto is the excitement of working in an ecosystem that feels similar to what Silicon Valley was like 15 years ago. There are new companies popping up everywhere and I sense the right energy flowing between groups in academia, policy, government and business. It’s a unique place in time to be. I also love Amii (in Edmonton) and Mila (in Montreal). What’s going on in the Canadian ecosystem is just amazing to behold.
I got my PhD in Comparative Literature, but I actually have a strong math and science background. In fact, my dissertation is about the use of habit (or repetitive action) as a technique to generate knowledge in 17th century mathematics, philosophy and literature. I’ve come to believe since then that I inadvertently wrote a history of supervised learning through this work. Supervised learning is an AI technique that starts with a set of labeled training examples. For example, we teach an algorithm to adequately identify that a picture of a cat is a cat by giving the images a “cat” label, then training the system over time. The “supervised learning” I wrote about in my thesis pertains to human self-transformation: that if we want to become a different type of person, we have to think a certain way, then practice those thoughts so we don’t default to our old habits.
Years ago, I gave a talk about why my background as an intellectual historian of math and philosophy actually makes me a great product marketer. My work doesn’t ask whether philosophers like Descartes or Leibniz or Newton were “right”; rather, it asks what did they think they were thinking? So, my task was to read everything they’d read and try to reconstruct what they thought so as to reinterpret what they were saying. It’s an excellent skill set for someone in business development because when you’re working as a translator between academic machine learning researchers and businesspeople, you have to do that work on both sides. How do the researchers think? What are they reading? How do they use language to express their point of view? Similarly, how do the bankers in the various divisions of the bank think? What do they read? How do they see the world? And, most importantly, can we make those two points meet at the intersection? These are the unique translation skills my background has provided, and I’ve seen it unfold to great effect in the boardroom. I’m really looking forward to adapting it to this next chapter of my career at Borealis AI.
]]>Whether you’re training a model on an enormous dataset with an industrial scale server or deploying a small model for a cellphone application, energy is often a fundamental bottleneck. I suspect algorithmic innovations that provide greater energy efficiency will be necessary to push forward the next frontier of machine learning. It’s with this mindset (and with my own paper in tow) that I attended this workshop.
In a previous blog post, I discussed Max Welling’s Intelligence per Kilowatthour paper, which he presented at the ICML conference in Stockholm last summer. Machine learning models are helping to solve increasingly difficult tasks, but as a natural result of scale, some of the models are getting enormous. Often, our community creates models that work just for the particular task at hand; but if these techniques are to be widely deployable, we must work to decrease the energy of these models. Because of this problem, Prof. Welling argued that machine learning should be judged by the intelligence per unit energy. The CDNNRIA workshop I attended seemed like a very natural response to Prof. Welling’s ICML presentation, as in addition to him being one of the workshop organizers, the focus was on compact (i.e. more energy efficient) neural networks.
Since I’m personally quite passionate about this topic, I’ve spent time exploring it in various forms of scientific inquiry. My workshop paper, On Learning Wire-Length Efficient Neural Networks, which I worked on with co-authors Luyu Wang, Giuseppe Castiglione, Christopher Srinivasa, and Marcus Brubaker, attempted to tackle an aspect of this important topic. I was honoured to present it. In this post, I will summarize our paper, highlight other interesting papers that relate to the subject, present the results of an experiment inspired by some of the additional workshop papers, then draw an emergent lesson from the workshop about the value of negative results.
A classic paper in the field, called Optimal Brain Damage, first introduced the basic training and pruning pipeline. The standard technique for creating energy-efficient neural networks involves assuming some initial architecture, initializing weight and bias values of the network, then modifying those parameters so that the network closely fits training data. This step is called "training". The next step – “pruning” – involves deleting the edges of the network that are somehow deemed "unimportant," then re-training the network after the edges are removed. The way “unimportant” gets defined, in this context, can vary depending on the specific technique. The big revelation of machine learning is that this pipeline works, and when done iteratively, the number of parameters in the model can be reduced by upwards of 50 times with no decrease in accuracy. In fact, there’s often an increase in accuracy.
Most previous work considers evaluating the performance of pruning algorithms by using the number of non-zero parameters metric. But as we shall see, these are not the only criteria. Some of the notable work at the CDNNRIA workshop considered energy consumption that assumes a three-level cache architecture (as discussed below). The existing work – both the cache-architecture work and number of non-zero parameters work – is a good model for energy consumption on general purpose circuitry with fixed memory hierarchies. However, some machine learning applications (say, image recognition), may need to be more widely deployed.
As with error-control coding, specialized neural networks may directly implement the edges of the neural network as wires. In this case, however, it is the total wiring length and not the number of memory accesses that will dominate energy consumption. The reason for this is due to the resistive-capacitive effects of wires, but more generally, it occurs because wiring length is a fundamental energy limitation of all the practical computational techniques that we can conceive. This hinges upon a basic fact: real systems have friction.
With this context established, our paper seeks to introduce a simple criterion for analyzing energy. We called it the “wire-length”, or information-friction model. Our model is inspired by the works of Thompson, as well as the more recent work of Grover, and by my own PhD thesis, in which energy is proportional to the total length of all the edges connecting the neurons of the network. The technique involves placing the nodes of the neural network on a three-dimensional grid, so the nodes are at least one unit-distance apart. Then, if two nodes are connected by a wire in the neural network, the length of the wire becomes the Manhattan distance between the two nodes they connect. The task we define is to find a neural network that is both accurate and has a placement of nodes that could be considered wire-length efficient.
In our paper, we introduced three algorithms that can be combined and used at the training, pruning, and node placement steps. Our experiments show that each of our techniques is independently effective, and by combining them and using a hyperparameter search we can get even lower energy, which has allowed us to produce benchmarks for some standard problems. We also found that the techniques worked across datasets.
Several workshop papers submitted in parallel to the conference caught my attention. This one, authored by Yue Wang, Tan Nguyen, Yang Zhao, Zhangyang Wang, Yingyan Lin and Richard Baraniuk, seemed to tackle similar themes as ours. What interested me most was that it sought a way to make energy efficient neural networks using a technique distinguished from the standard training-pruning pipeline:
This paper is also interesting for the way it suggests minimizing total energy in a different manner than the standard pruning paradigm. The human brain is hyper-optimized for minimizing energy consumption, and if our machine learning techniques are to mimic the kinds of tasks performed by the brain, I suspect we will have to use all kinds of techniques to keep energy costs under control. The “skip policy” idea of Wang et al. may be one such technique useful on the road to more energy efficient artificial intelligence.
In the normal iterative training, pruning, re-training framework, we keep the weights of the neural network the same post-pruning (except, of course, for the weights associated with pruned edges) then re-train the weights from this point onward. The idea behind this methodology is that the training-pruning process helps the network learn important edges and important weights.
Two similar papers submitted to the workshop added a twist to this paradigm. They showed if the weights are randomly re-initialized after some training and pruning, and then the network gets re-trained from the random re-initialization, then a higher accuracy can be obtained. This result suggests that pruning helps find important connections but not important weights. It contradicts what many (including myself) would have intuited: that pruning allows you to learn important weight values and important connections.
The experimental evidence across these two papers could provide a very easy-to-implement tool for the machine-learning practitioner, and I’m curious to see if this technique gets widely adopted. There’s a good chance, as one of the papers, “Re-thinking the Value of Network Pruning”, won the workshop’s best paper award.
However, while the papers’ results suggest a simple approach toward achieving higher accuracy on low-energy machine learning models, it begs the question whether this tool will work on the pruned architectures we obtained when optimizing for wire-length pruning. The fact that this was discovered independently by two different groups gives me more confidence that it will work for us. In the spirit of learning from the workshop, we decided to try it out back at Borealis AI HQ.
We obtained the best-performing model using distance-based regularization at different target accuracies. Then, we re-initialized the weights of the resulting network before we re-trained. The table below shows the results:
Accuracy Before Initialization | Accuracy After Initialization | Accuracy After Re-training |
---|---|---|
98 | 8.92 | 97.08 |
97 | 10.1 | 84.64 |
95 | 10.32 | 61.94 |
90 | 10.09 | 30.76 |
The left side presents the accuracy before re-initialization; the middle column shows the accuracy after-reinitialization, and the final column reveals the accuracy after re-training the re-initialized network. As we can see, we consistently get lower accuracy than the network before re-initialization. This suggests the re-initialization approach does not work. But we also find the technique gets less effective when we re-initialize smaller networks, which we have considered as a possible explanation for why the technique didn’t work.
Does this contradict the results of the workshop papers discussed above? No. But it might reveal that the general approach is less likely to work than we might have thought. Nevertheless, in “Rethinking the Value of Network Pruning,” the authors present negative results, so we might have guessed the technique wouldn’t work on our networks trained for shorter wire-length. In the next section, I’ll discuss why I think having negative results is so useful.
It’s a machine learning truism that any single result is hard to pinpoint as being independently important, but the many pieces of evidence reported across multiple papers allows us to draw an emergent lesson. I view machine learning as a grab-bag of techniques that help solve new classes of computational problems. They don’t always work, but all-too-often some of them do. This allows us to use the literature in a way that produces rules-of-thumb to inform techniques that might work. For example, if we looked at the original Optimal Brain Damage paper in isolation, it might be hard to discern the paper’s broad applicability. But the fact that the standard training-pruning pipeline has been so widely used, and that so many modifications of the technique (including our wire-length pruning work) also work, gives us confidence in the idea’s ability to capture something basic and fundamental – that doing some type of pruning is appropriate if minimizing energy consumption is a major concern.
Due to the multiplicity of possible techniques, the only thing machine learning practitioners can do is test them out in the first place, set good evaluation criteria, and see if they work. Since engineering and computational resources are limited, this also means judiciously choosing which techniques to take on. This process requires a careful balancing of engineering risk and reward.
So, while it may be worth it to try a technique, the decision depends on the particulars of the problem and the probability that the technique will be successful. Presenting negative results allows our readers to intuit this probability. Moreover, negative results inform researchers about areas that have already been attempted and saves them the effort of re-testing them.
The value of negative results can be further illustrated with a concrete thought experiment. Suppose your goal is to create a neural network in a place where computational resources are free and plentiful, and the tool is not going to be widely deployed. Perhaps such a network is used in an internal tool at a small company. The engineer, in this case, might ask: Should I try the re-initialization and re-training technique in order to get a more accurate small network?
Since the papers (and our experiment) suggest the techniques only work some of the time, it may not be worth the effort to give it a try. After all, there’s only a slim chance of it being successful and the reward margin is small. However, suppose the network were to be widely deployed to a billion cellphones, like, for example, in some widely deployed social media application. In this case, it makes sense to try this technique, as well as a number of others, to ensure the tool uses as little energy as possible.
Real problems may fit somewhere between these two extremes, and choosing the right approach requires having a finely tuned sense of the probabilities that they will work. Having a collection of experimental results in the literature, both positive and negative, helps the engineer make the right judgment call about whether a technique is worth the effort.
Right now, we have enormous potential in the field, but we have very limited human talent, and limited computational resources. We should take on the responsibility to ensure we draw the right lessons from the work we do and present our work in as useful a way as possible. That’s why “Re-thinking the Value of Network Pruning” is a strong output – not only does it find a surprising and successful technique, it also presented negative results. The quality of the scientific analysis in the paper makes it, in my opinion, a worthy recipient of the workshop’s Best Paper Award and hopefully sets a precedent for more researchers to explore negative results for the greater good of the field.
*Special thanks to Luyu Wang for running the re-initialization experiment in this post.
]]>The Pommerman environment [1] is based off of the classic Nintendo console game, Bomberman. It was set up by a group of machine learning researchers to explore multi-agent learning and push state-of-the-art through competitive play in reinforcement learning.
The team competition was held on December 8, 2018 during the NeurIPS conference in Montreal. It involved 25 participants from all over the world. The Borealis AI team, consisting of Edmonton researchers Chao Gao, Pablo Hernandez-Leal, Bilal Kartal and research director, Matt Taylor, won 2nd place in the learning agents category, and 5th place in the global ranking including (non-learning) heuristic agents. As a reward, we got to haul a sweet NVIDIA Titan V GPU CEO Edition home. Here’s how we pulled it off.
The Pommerman team competition consists of four bomber agents placed at the corners of an 11 x 11 symmetrical board. There are two teams, each consisting of two agents.
Competition rules work like this:
At every timestep, each agent has the ability to execute one of six actions: they can move in any one of four cardinal directions, remain in place, or plant a bomb.
Each cell on the board can serve as a passage (the agent can walk over it), a rigid wall (the cell cannot be destroyed), or a plank of wood (the cell can be destroyed with a bomb).
The game maps, which function as individual levels, are randomly generated; however, there is always a guaranteed path between any agents so that the procedurally generated maps are guaranteed to be playable.
Whenever an agent plants a bomb it explodes after 10 timesteps, producing flames that have a lifetime of two timesteps. Flames destroy wood and kill any agents within their blast radius. When wood is destroyed, the fallout reveals either a passage or a power-up (see below).
Power-ups, which are items that impact a player’s abilities during the game, can be of three types: i) they increase the blast radius of bombs; ii) they increase the number of bombs the agent can place; or iii) they give the ability to kick bombs.
Each game episode lasts up to 800 timesteps. There are two ways to end a game: if a team wins before reaching this upper bound, the game is over. If not, a tie is called at 800 timesteps and the game ends that way.
An example of a Pommerman team game.
The Pommerman team competition is a very challenging benchmark for reinforcement learning methods. Here’s why:
When an agent is in the early stage of training, it commits suicide many times.
After some training, the agent learns to place bombs near the opponent and move away from the blast.
Our learning agent (white) is highly skilled against a SimpleAgent. It avoids the blasts and also learns how to trick SimpleAgent to commit suicide in order to win without having to place any bombs.
When we examined the behavior of our learning agent against SimpleAgent we discovered that our agent had learned how to force SimpleAgent to commit suicide. It started when SimpleAgent first placed a bomb, then took a movement action to go toward a neighbor cell X. Our agent, after learning this opponent behaviour, then took a movement action to simultaneously go to the cell X, and thus, by game-engine forward model both were sent back to their original location in the next time step. In other words, our agent had learned a flaw in SimpleAgent and exploited this flaw to win the games by forcing it to commit suicide. This pattern was repeated until the bomb went off, successfully blasting SimpleAgent. This policy is optimal against SimpleAgent; however, it lacks generalization against other opponents since these learning agents learned to stop placing bombs and make themselves easy targets for exploitation.
Notes: This generalization over opponent policies is of utmost importance when dealing with dynamic multiagent environments and similar problems have also been encountered in Laser Tag [6].
In single-agent tasks, faulty (and strange) behaviors have also been observed [7]. A trained RL agent for CoastRunners discovered a spot in the game where, due to the unexpected mismatch between maximum possible reward and intended behaviour, RL agent can obtain higher scores in this spot rather than finishing the game.
A Skynet team is composed of a single neural network and is based on five building blocks:
We make use of the parameter sharing mechanism; that is, we allow the agents to share the parameters of a single network. This allows the network to be trained with the experiences of the two agents. However, it still allows for diverse behavior between agents because each agent receives different observations.
Additionally, we added dense rewards to help the agent improve learning performance. We took inspiration from the difference reward mechanism to provide agents with a more meaningful contribution of their behavior, this in contrast to simply using the single global reward.
Our third block, ActionFilter module, built on the philosophy of installing prior knowledge to the agent by telling the agent what it should not do, then allowed the agent to discover what to do by trial-and-error, i.e., learning. The benefit is twofold: 1.) the learning problem gets simplified; and 2.) superficial skills, such as avoiding flames or evading bombs in simple cases, are perfectly acquired by the agent.
It is worth mentioning that the above ActionFilter does not significantly slow down RL training as it is extremely fast. Together with neural net evaluation, each action still takes several milliseconds – a speed almost equivalent to pure neural net-forward inferring. For context, the time limit in the competition is 100 ms per move.
The neural net is trained by $\mathit{PPO}$, minimizing the following objective:
\begin{equation}
\begin{split}
o(\theta;\mathcal{D}) & = \sum_{(s_t, a_t, R_t) \in \mathcal{D}} \Bigg[ -\mathit{clip}(\frac{\pi_\theta(a_t|s_t)}{\pi_\theta^{old}(a_t|s_t)}, 1-\epsilon, 1+\epsilon) A(s_t, a_t) + \\
& \frac{\alpha}{2} \max\Big[ (v_\theta(s_t) -R_t)^2, (v_\theta^{old}(s_t) + \mathit{clip}(v_\theta(s_t) - v_\theta^{old}(s_t), -\epsilon, \epsilon)-R_t)^2 \Big] \Bigg],
\end{split}
\end{equation}
where $\theta$ is the neural net, $\mathcal{D}$ is sampled by $\pi_\theta^{old}$, $\epsilon$ is a tuning parameter. Refer to PPO paper for details.
We let our team compete against a set of curriculum opponents:
The reason we allowed the opponent to not place a bomb is that we realized the neural net can focus on learning true “blasting” skills, and not a skill that solely relies the opponent's mistakenly suicidal actions. Also, this strategy avoids training on “false positive” reward signal caused by an opponent's involuntary suicide.
As shown in the figure below, the architecture first repeats four convolution layers, followed by two policy and value heads, respectively.
Instead of using an LSTM to track the observational history, we used a “retrospective board” to keep track of the most recent value of each cell on the board. For cells outside an agent's purview, the “retrospective board” filled the unobserved elements of the board with the elements that were observed most recently. The input feature counts 14 planes in total, where the first 10 planes are extracted from the agent's current observation, while the remaining four come from the “retrospective board”.
An example of a game between Skynet Team (Red) vs a team composed of two SimpleAgents (Blue).
The Pommerman team competition used a double-elimination style. The top three agents used tree search methods, i.e., they actively employed the game-forward model to look ahead for each decision by using heuristics. During the competition, these agents seemed to perform more bomb-kicking, which increased their chances of survival.
As mentioned in the introduction, our Skynet team won 2nd place on the category of learning agents, and 5th place on the global ranking, including (non-learning) heuristic agents. It is worth noting that scripted agents were not among the top players in this competition, which shows the high quality level amongst the tree search and learning methods.
Another one of our submissions, CautiousTeam, was based on SimpleAgent and – interestingly enough – wound up ranking 7th overall in the competition. CautiousTeam was submitted primarily for verifying the suspicion that a SimpleAgent without placing a bomb could be strong (or perhaps even stronger) than the winner [3] of the first competition held in June, i.e., a fully observable free-for-all scenario. It seems the competition results supported this suspicion.
Aside from being an interesting and (most importantly) fun environment, the Pommerman simulator was also designed as a benchmark for multiagent learning. We are currently exploring multi-agent deep reinforcement learning methods [5] by using Pommerman as a testbed.
We would like to thank the creators of the Pommerman testbed, the competition organizers and the growing Pommerman community on Discord. We look forward to future competitions.
[1] Cinjon Resnick, Wes Eldridge, David Ha, Denny Britz, Jakob Foerster, Julian Togelius, Kyunghyun Cho, and Joan Bruna. Pommerman: A multiagent playground. arXiv preprint arXiv:1809.07124, 2018.
[2] Richard S. Sutton and Andrew G. Barto. Reinforcement learning: An introduction. MIT press, 2018.
[3] Hongwei, Zhou et al. "A hybrid search agent in pommerman." Proceedings of the 13th International Conference on the Foundations of Digital Games. ACM, 2018.
[4] Bilal Kartal, Pablo Hernandez-Leal, and Matthew E. Taylor. "Using Monte Carlo Tree Search as a Demonstrator within Asynchronous Deep RL." arXiv preprint arXiv:1812.00045(2018).
[5] Pablo Hernandez-Leal, Bilal Kartal, and Matthew E Taylor. Is multiagent deep reinforcement learning the answer or the question? A brief survey. arXiv preprint arXiv:1810.05587, 2018.
[6] Lanctot, Marc, et al. "A unified game-theoretic approach to multiagent reinforcement learning." Advances in Neural Information Processing Systems. 2017.
[7] Open AI. Faulty Reward Functions in the Wild. https://blog.openai.com/faulty-reward-functions/
[8] Pablo Hernandez-Leal, Bilal Kartal, and Matthew E Taylor. Skill Reuse in Partially Observable Multiagent Environments. LatinX in AI Workshop @ NeurIPS 2018
[9] LeCun, Yann, Yoshua Bengio, and Geoffrey Hinton. "Deep learning." nature 521.7553 (2015): 436.
[10] J. N. Foerster, Y. M. Assael, N. De Freitas, S. Whiteson, Learning to communicate with deep multi-agent reinforcement learning, in: Advances in Neural Information Processing Systems, 2016,
[11] J. N. Foerster, N. Nardelli, G. Farquhar, T. Afouras, P. H. S. Torr, P. Kohli, S. Whiteson, Stabilising Experience Replay for Deep Multi-Agent Reinforcement Learning., in: International Conference on Machine Learning, 2017.
[12] Devlin, Sam, et al. "Potential-based difference rewards for multiagent reinforcement learning." Proceedings of the 2014 international conference on Autonomous agents and multi-agent systems.
[13] Schulman, John, et al. "Proximal policy optimization algorithms." arXiv preprint arXiv:1707.06347 (2017).
[14] Bansal, Trapit, et al. "Emergent complexity via multi-agent competition." arXiv preprint arXiv:1710.03748 (2017).
In this section, we provide more details on some of the aforementioned concepts as follows:
Difference rewards: This is a method to better address the credit assignment challenge for multi-agent teams. Relying only on the external reward, both agents receive the same team (global) reward independently of what they did in the episode. However, this renders multi-agent learning more difficult, as spurious actions can occasionally be rewarded, or some of the agents can learn to do the most of the jobs while the rest learn to be lazy. However, difference rewards [14] propose to compute the individual contribution without hurting the coordination performance. The main idea is very tidy: you compute individual rewards by subtracting the counterfactual reward computed without corresponding agents’ actions to the external global reward signal. Thus, team members are encouraged to optimize the overall team performance but also optimize their own contribution, so that no lazy agents can arise.
Centralized training and decentralized execution: Even though the team game scenario comes with partially observable execution, you can only hack the game simulator to access the full state during training. With full state access, you can train a centralized value function, which is used during training for actor-critic setting, but deployed agents will only utilize the policy network, i.e. that are trained with partial observations.
Dense rewards: There can be cases where an agent commits suicide and the remaining team member terminates the opposing team by itself. In the simplest case, both team members would get a +1 reward – i.e., suicidal agent is reinforced! We altered the single stark external reward for the first dying team member in order to address this issue where the first dying agent gets a smaller reward than last surviving one. This helped our agent improve game-play performance, although this modification comes with some expected and some unforeseen consequences. For instance, under this setting we can never have a team where one agent sacrifices itself (in such a configuration that it dies with the enemy simultaneously) for the team to win or that allows it to allot less credit to a hard-working agent that terminated an enemy but died when battling the second enemy.
Action filter description: We implement a filter with two categories:
For avoiding suicide
For placing bombs
Authors: Yu-An Chung · Wei-Hung Weng · Schrasing Tong · James Glass
Last year, I was surprised by a paper that introduced a technique to perform word translation between two languages without parallel corpora. To clarify, you’re said to have unparalleled corpora when two sets of a text exist in two languages (e.g. a set of English words and a set of French words), but there is no information regarding which English word corresponds to the appropriate French translation.
Previously, the state-of-the-art method for learning cross-lingual word embeddings mainly relied on bilingual dictionaries, along with some help from character-level information for languages that shared a common alphabet. None of this was competitive with supervised machine translation techniques. The authors of the paper managed to pose a different question: Was it possible to do unsupervised word translation? They answered their own question by introducing a new technique that worked quite well.
Their model worked by obtaining the word embeddings space for both languages, independently, and introducing a technique for unsupervised alignment between the two embedding spaces that can achieve translations without parallel corpora. The intuition behind this technique is to rotate one embedding space to the point that the two embedding spaces are virtually indistinguishable to a classifier (i.e., adversarial training).
This year, a new set of authors presented their work regarding the task of automatic speech recognition without parallel data. This, again, means two independent sets of speech data and text data exist, but the correspondence information between them is unclear. This work stood out since it is the first successful attempt to apply the unsupervised alignment technique introduced last year on multiple modalities of data. The task involved taking a dataset of words from one language and a dataset of spoken words from either the same or a different language, and automatically identifying spoken words without parallel information.
The authors first trained an embedding space for written words and another embedding space for spoken words. They then applied the unsupervised alignment technique to the embedding spaces to align them so that spoken words could automatically be classified and translated. At test time, a speech segment is first mapped into its respective embedding space, aligned to the text embedding space, then the nearest neighbors of the text embedding are picked as the translation. The same procedure can be used for the text-to-speech conversion task.
The authors present some experiments on the spoken Wikipedia and LibriSpeech datasets that show unsupervised alignments are still not as good as supervised alignment – but they’re close. Some challenges still remain to be solved before unsupervised cross-modal alignments can be competitive with supervised ones; however, this work shows the promise of improving automatic speech recognition (ASR), text-to-speech (TTS) and even translation systems, especially in languages with a low availability of paralleled data. (/HS)
Authors: Rad Niazadeh · Tim Roughgarden · Joshua Wang
This paper was accepted as an oral presentation. The authors gave an approximation algorithm for maximizing continuous non-monotone submodular functions.
To give a brief recap, submodular functions arise in several important areas of machine learning and, in particular, around the intersection of economics and learning. They can be used to model the problem of maximizing multi-platform ad revenue, where a buyer wants to maximize their profit = revenue - cost by advertising on different platforms and there is a diminishing return of advertising on more platforms. This diminishing return is the precise property captured by the sub-modular functions. Mathematically, a function $f:\{0,1\}^n \rightarrow \mathbb{R}$ is submodular if $f(S \cup \{e\}) - f(S) \geq f(T \cup \{e\}) - f(T)$ for every $S \subseteq T$ and $e \notin T $. In this setting, there is an information-theoretic lower bound of $1/2$-approximation [Feige et al.'11] and there is an optimal algorithm which matches this bound [Buchbinder et al.'15].
This paper considered the continuous submodular function where, instead of maximizing on the vertices of the hypercube $\{0,1\}^n$, we want to maximize over the full hypercube $[0,1]^n$. The main result of the paper is that they obtained a randomized algorithm for maximizing a continuous submodular and $L$-Lipschitz function over the hypercube that guarantees a $1/2$-approximation. Note that this is currently the best possible ratio that is information-theoretically achievable.
The reason this paper stood out is that the authors used the double greedy framework of Buchbinder et al.'15 to solve the coordinate-wise zero-sum game, and then use the geometry of this game to bound the value at its equilibrium. This is a nice application of game theory to maximize the value of the function. The authors also conducted experiments on 100-dimensional synthetic data and achieved comparable results as the previous work they referenced. One thing we hoped to see was that their achievement of better approximations and faster algorithms would also show a significant advantage in the experiments, but that was not the case.
In terms of the open problems, I am really excited to see the development of parallel and online algorithms for continuous sub-modular optimization. In fact, there is a recent work for parallel algorithms of Chen et al.'18 which achieves a tight $(1/2 - \epsilon)$-approximation guarantee using $\tilde{O}(\epsilon^{-1}$) adaptive rounds. (/KJ)
Authors: Jiantao Jiao · Weihao Gao · Yanjun Han
This paper focused on estimating the differential entropy of a continuous distribution $f$ given $n$ i.i.d. samples. Entropy has been a core concept of information theoretic measures and has engendered numerous important applications, such as goodness-of-fit tests, feature selection, and tests of independence. In the vast body of literature around this concept, most of the measures have appeared to take on an asymptotic flavor – that is, until several recent works.
This paper is one of those works. The authors took particular focus on the fixed-k nearest neighbor (fixed-kNN) estimator, also called the Kozachenko-Leonenko estimator. This estimator is simple; there is only one parameter to tune, and it requires no knowledge about the smoothness degree $s$ about targeted distribution $f$. Moreover, it is also computationally efficient, since $k$ is fixed (compared to other methods with similar finite sample bounds) and statistically efficient: As shown in this paper, it has a finite sample bound that is close to optimal. All of these properties make the estimator realistic and attractive in practice.
I found the paper also carried some interesting technical results. One direct approach in estimating the differential entropy is to plug in a consistent estimator, for example, based on kNN distance statistics of the density function $f$ into the formula of entropy. However, such estimators usually come with an impractical computational demand. For instance, in the kNN-based estimator, $k$ has to approach $\infty$ as the number of samples $n$ approaches $\infty$.
In a recent paper by Han et al. [2017], the authors constructed a complicated estimator that achieves a finite sample bound in the rate of $n\log(n))^{-\frac{s}{s+d}} + n^{-\frac12}$ (the optimal rate). One caveat, though, is that it requires the knowledge of the smoothness degree $s$ of the targeted distribution $f$. The last challenging part is to deal with the area where $f$ is small. A major difficulty in achieving such bounds for the entropy estimator is that the nearest neighbor estimator exhibits a huge bias in its low-density area. Most papers tend to make assumptions about the property of $f$ such that this bias is well controlled. However, this paper did not presume similar assumptions. Given all these constraints, including fixed $k$, no knowledge of $s$ and no assumptions on how $f$ is bounded from below, the authors managed to prove a nearly optimal finite sample bound for a simple estimator. According to the authors, the new technical tools here are the Besicovitch convering lemma and a generalized Hardy-Littlewood maximal inequality. This part is not yet clear to me.
Lastly, the authors also pointed out several weaknesses in their paper and their plans for future work. For example, they conjectured that both the upper bound and the lower bound in the paper could be further improved. They also hypothesized a way to extend the constraint on $s$ in the theorem so that the result can be applied to a more general setting. (/RH)
Authors: Kevin Scaman · Francis Bach · Sebastien Bubeck · Laurent Massoulié · Yin Tat Lee
This paper considered distributed optimization of non-smooth convex functions using a network of computing units. The objective of this work was to study the impact of the communication network on learning, and the tradeoff between the structure of the network and algorithmic efficiency. The network consists of a connected simple graph of nodes, each having access to a function (such as a loss function). The optimization problem exists to minimize the average of the local functions: communication between nodes takes a given length of time and computation takes one unit of time. Under a decentralized scenario, local communication is performed through gossip.
The authors give bounds on the time to reach a given precision, then provide an optimal algorithm that uses a primal-dual reformulation. They are able to show that the error due to limits in communication resources will then decrease at a fast rate. In the centralized setting, the authors provide an algorithm which achieves convergence rates within $d^{1/4}$ to the optimal, where d is the underlying dimension.
I found this paper intriguing because it considers the impact of communication and computation resources in learning, which will be increasingly important as systems we learn on become larger. It received one of the best paper awards and is one of few papers that consider such impacts. There’s an argument to be made that these two things are related; as learning systems scale up and get distributed through IoT and mobile devices, the importance of distributed learning in a setting where there is tension between communication and computation has also increased. The elegant analytical tools used in this paper – gossip methods, primal-dual formulation, Chambolle-Pock algorithm for saddle-point optimization, the combined use of optimization and graph theory, and the bounds that give insight into which resources are important at which stage of convergence – show that the award places well-deserved attention toward a growing area. (/NH)
Invited talk: Jon Kleinberg - Fairness, Simplicity, and Ranking
In Jon Kleinberg's fascinating invited talk, he addressed the effect of implicit bias on producing adverse outcomes. The specific application he referred to is that of bias in activities such as hiring, promotion, admissions. The setting is as follows: a recruitment committee is tasked with selecting a shortlist of final candidates from a given pool of applicants, but their estimates of skill, used in the selection, may be skewed by implicit bias.
The Rooney Rule is an NFL policy in effect since 2003 that requires teams to interview ethnic-minority candidates for coaching and operations positions. (Note: There is no quota or preference given in the actual hiring). Kleinberg and his co-authors showed that measures such as the Rooney Rule lead to higher payoffs for the organization. Their model is as follows: a recruiting committee must select a list of k candidates for final interviews; the set of applicants is then divided into two groups, X and Y, X being the minority group; there are $n$ Y applicants and $n\alpha$ X applicants with $\alpha \le 1$. Each candidate has a numerical value representing their skill, and there is a common distribution from which these skills are drawn.
Based on empirical studies of skills in creative and skilled workforces, the authors then modeled this distribution as a Pareto distribution (power law). The utility that the recruiting committee aims to maximize is the sum of the candidates’ skills that were selected to the list. The authors modeled the bias as a multiplicative bias in the estimation of the skills of X-candidates. So, Y candidates are estimated at their true value and an X-candidate skill is estimated to be $X_i/\beta$ for candidate i where $\beta >1$. The authors then analyzed the utility of a list of $k$ candidates where at least one must be an X candidate. Their analysis showed an increase in utility even when the list was of size 2, and for a large range of values for the bias, power law, and population parameters.
I found this to be another very interesting and important paper because it tackles the question of fairness at a very practical level and provided a tangible algorithmic framework with which to expose, then analyze the outcomes. Furthermore, the modelling assumptions were very realistic, and their results demonstrated the potential for significant impact. The particular scenario considered here may be for activities such as hiring and admissions, but the result has consequences for machine learning models. (/NH)
]]>In this work, we present a simple yet surprisingly effective way to prevent catastrophic forgetting. Our method, called Few-Shot Self Reminder (FSR), regularizes the neural net from changing its learned behaviour by performing logit matching on selected samples kept in episodic memory from the previous tasks.
Surprisingly, this simplistic approach only requires retraining a small amount of data in order to outperform previous knowledge retention methods. We demonstrate the superiority of our method to previous ones on popular benchmarks, as well as a new continual learning problem where tasks are designed to be more dissimilar.
]]>We discover this sensitivity by analyzing the Bayes classifier's clean accuracy and robust accuracy. Extensive empirical investigation confirms our finding. Numerous neural nets trained on MNIST and CIFAR10 variants achieve comparable clean accuracies, but they exhibit very different robustness when adversarially trained. This counter-intuitive phenomenon suggests that input data distribution alone can affect the adversarial robustness of trained neural networks, not necessarily the tasks themselves. Lastly, we discuss practical implications on evaluating adversarial robustness, and make initial attempts to understand this complex phenomenon.
]]>In this paper, we present a new technique for hard negative mining for learning visual-semantic embeddings for cross-modal retrieval. We focus on selecting hard negative pairs that are sampled by an adversarial generator.
In settings with attention, our adversarial generator composes harder negatives through novel combinations of image regions across different images for a given caption. We find scores across the board for all R@K-based metrics, but this technique is also significantly more sample efficient and leads to faster convergence in fewer iterations.
]]>
Authors: Kry Lui, Gavin Weiguang Ding, Ruitong Huang, Robert McCann
Poster: Decemeber 5, 10:45 am – 12:45 pm @ Room 210 & 230 AB #103
Dimensionality reduction occurs frequently in machine learning. It is widely believed that reducing more dimensions will often result in a greater loss of information. However, the phenomenon remains a conceptual mystery in theory. In this work, we try to rigorously quantify such phenomena in an information retrieval setting by using geometric techniques. To the best of our knowledge, these are the first provable information loss rates due to dimensionality reduction.
Authors: Christopher Blake, Luyu Wang, Giuseppe Castiglione, Christopher Srinivasa, Marcus Brubaker
Workshop: Compact Deep Neural Network Representation (Spotlight Paper); December 7, 2:50 pm
When seeking energy-efficient neural networks, we argue that wire-length is an important metric to consider. Based on this theory, new techniques are developed and tested to train neural networks that are both accurate and wire-length-efficient. This contrasts to previous techniques that minimize the number of weights in the network, suggesting these techniques may be useful for creating specialized neural network circuits that consume less energy.
Authors: Junfeng Wen, Yanshuai Cao, Ruitong Huang
Workshop: Continual Learning; December 7
We present a simple, yet surprisingly effective, way to prevent catastrophic forgetting. Our method regularizes the neural net from changing its learned behaviour by performing logit matching on selected samples kept in episodic memory from previous tasks. As little as one data point per class is found to be effective. With similar storage, our algorithm outperforms previous state-of-the-art methods.
Authors: *A.J. Bose, *Huan Ling, Yanshuai Cao
Workshop: ViGIL; December 7, 8 am – 6:30 pm
We present a new technique for hard negative mining for learning visual-semantic embeddings. The technique uses an adversary that is learned in a min-max game with the cross-modal embedding model. The adversary exploits compositionality of images and texts and is able to compose harder negatives through a novel combination of objects and regions across different images for a given caption. We show new state-of-the-art results on MS-COCO.
Authors: Gavin Weiguang Ding, Yik Chau (Kry) Lui, Xiaomeng Jin, Luyu Wang, Ruitong Huang
Workshop: Security in Machine Learning; December 7, 8:45 am – 5:30 pm
We demonstrate an intriguing phenomenon about adversarial training – that adversarial robustness, unlike clean accuracy, is highly sensitive to the input data distribution. In theory, we show this by analyzing the Bayes classifier’s robustness. In experiments, we further show that transformed variants of MNIST and CIFAR10 achieve comparable clean accuracies under standard training but significantly different robust accuracies under adversarial training.
Authors: Pablo Hernandez-Leal, Bilal Kartal, Matthew E. Taylor
Workshop: Latinx in AI Coalition; December 2, 8 am – 6:30 pm
Our goal is to tackle partially observable multiagent scenarios by proposing a framework based on learning robust best responses (i.e., skills) and Bayesian inference for opponent detection. In order to reduce long training periods, we propose to intelligently reuse policies (skills) by quickly identifying the opponent we are playing with.
Authors: Yash Sharma, Gavin Weiguang Ding
Workshop: NeurIPS 2018 Competition Track Day 1; 8 am – 6:30 pm
This challenge pitted submitted adversarial attacks against submitted defenses. The challenge was unique in that they allowed for a limited set of queries, outputting the decision of the defense, rewarded minimizing the (L2) distortion instead of using a (Linf) distortion constraint, and used TinyImageNet instead of the ImageNet dataset, making it tractable for competitors to train their own models. Our attack solution placed top-10 overall in the challenge, in particular placing 5th in the targeted attack track – a more difficult setting. We based our solution on performing a binary search to find the minimal successful distortion, then optimizing the procedure while still performing the necessary number of iterations to meet the computational constraints.
Northern Frontier sat down with Abhishek Gupta, AI ethics researcher at McGill University and founder of the Montreal AI Ethics Institute, to dive into some of the key themes of the day, including the threat automation poses to job loss based on the current science, whether bias is the biggest problem we face in responsible AI, the dangers of 'mathwashing', and what we should consider reasonable trade-offs for improving fairness.
]]>Dimensionality reduction occurs very naturally and very frequently within many machine learning applications. While the phenomenon remains, for the most part, a conceptual mystery, one thing many researchers believe is that reducing more dimensions will often result in a greater loss of information. What’s even harder to confirm is the rate at which this information loss occurs, as well as how to formulate the problem (even for very simple data distributions and nonlinear reduction mappings.)
In this work, we try to rigorously quantify such empirical observations from an information retrieval perspective by using geometric techniques. We begin by formulating the problem through an adaptation of two fundamental information retrieval measures – precision and recall – to the (continuous) function analytic setting. This shift in perspective allows us to borrow tools from quantitative topology in order to establish the first provable information loss rate induced by dimension-reduction.
We were surprised to discover that when we began reducing dimensions, the precision would decay exponentially. This discovery should raise red flags for practitioners and experimentalists attempting to interpret their dimension reduction maps. For example, it may not be possible to design information retrieval systems that enjoy high precision and recall at the same time. This realization should keep us mindful of the limitations of even the very best dimension reduction algorithms, such as t-SNE.
While precision and recall are natural information retrieval measures, they do not directly take advantage of the distance information between data (e.g. in data visualization). We therefore propose an alternative dimension-reduction measure based on Wasserstein distances, which also provably captures the dimension reduction effect. To obtain this theoretical guarantee, we solve the following iterated-optimization problem:
\[
\inf_{W:\,\text{Vol}_n(W) = M} W_{2}(\mathbb{P}_{B_r}, \mathbb{P}_{W})
=
\inf_{W:\,\text{Vol}_n(W) = M} \inf_{\gamma \in \Gamma (\mathbb{P}_{B_r} , \mathbb{P}_{W})} \mathbb{E}_{(a, b) \sim \gamma} [ \| a - b \|^{2}_{2} ]^{1/2} ,
\]
by using recent results from optimal partial transport literature.
While precision and recall for supervised learning problems are familiar concepts, let’s do a quick review of it before we adapt it to the dimensionality reduction context.
In a supervised learning setting, say the classification of cats vs dogs, first select your favorite neural net classifier fW, then collect 1,000 test images.
\[
Precision = 1/2 \frac{\text{How many are cats}}{\text{Among the ones predicted as cats}} + 1/2 \frac{\text{How many are dogs}} {\text{Among the ones predicted as dogs}}
\]
\[
Recall = 1/2 \frac{ \text{How many are predicted as cats}}{\text{Among the cats}} + 1/2 \frac{\text{How many are predicted as dogs}}{\text{Among the dogs}}
\]
Generally speaking, we can average precision and recall over n classes.
The formulation we used for precision and recall was inspired by the supervised learning setting. Since dimensionality reduction can actually happen in an unsupervised setting, we needed to change a few things around. In a typical dimensionality reduction map $f: X \rightarrow Y$, we often care about preserving the local structure post-reduction.
Since the unsupervised setting means there are no more labels, the first thing we did was to replace “label for an input x” by “neighboring points for an input x in high dimension.” We didn’t have the predictions either, but we felt it made sense to replace “prediction for an input x” by “neighboring points for an input y = f(x) in low dimension.”
When computing precision and recall in the supervised cases, we averaged across the labels. So, the second thing we did was to average over each data point.
\[Precision = \frac{1}{n} \sum_{x \in X_n} {\text{How many are neighbors of} ~x~ \text{in high dimension} / \text{Among the low dimensional neighbors of}~ y = f(x)}
\]
\[Recall = \frac{1}{n} \sum_{x \in X_n} {\text{How many are neighbors of} ~y = f(x)~ \text{in low dimension} / \text{Among the high dimensional neighbors of} ~x}
\]
But it was still hard to prove anything, even with the settings detailed above. One difficulty we faced is that f is a map between continuous spaces, but the data points are finite samples. This motivated us to look for continuous analogues of the examples below:
Finally, we arrived at one of the paper’s key observations: Precision is roughly injectivity; recall is roughly continuity.
Let’s build some intuition with linear maps. In linear algebra, we learn that a linear map from $\mathbb{R}^n$ to $\mathbb{R}^m$ must have null space of dimension at least $n - m$. One may interpret this as the “how” and “why” of when linear maps lose information: distant points in high-dimension are projected together in low-dimension. This process leads to very poor precision.
In practice, DR maps can be much more flexible than linear maps. So, can this expressivity circumvent the linear algebraic dimensional mismatch issue? To study dimension reduction under continuous maps, we turned to the corresponding study of topological dimension: waist inequality from quantitative topology. It turns out that a continuous map from high-to-low dimension still fails to circumvent the similar issue that plagues linear maps – that many continuous maps collapse the points together. For most $x$, we have for $y = f(x): \text{Vol}_{n-m} f^{-1}(y) \ge C(x) \text{Vol}_{n-m} B^{n-m}$.
Roughly speaking, the relevant neighborhood $U$ of $x$ is typically small, in all $n$ directions, while the retrieval neighborhood $f^{-1}(V)$ is big in n-m directions. This quantitative mismatch makes it very difficult to achieve high precision for a continuous DR map. It’s this mismatch that leads to the exponential decay of precision:
\[
Precision^{f}(U, V)
\leq
D(n, m)\,\left(\frac{r_U}{R}\right)^{n-m}\,\frac{ r_U^{^{m}} }{p^{m}(r_V/L)}
\]
The above trade-off/information loss phenomenon has been widely observed by experimentalists. Naturally, practitioners have developed various tools to measure the imperfections. What was less clear in this regard is what led to the trade-off, so having clarified this a bit more, we can now design better measurement devices.
When we naively compute sample precision and recall:
\[
Precision =
\frac{\mathrm{~How~many~points~are~high~dimensional~neighbors~of}~x }{\mathrm{Among~the~low~dimensional~neighbors~of}~y}
\]
\[
Recall =
\frac{\mathrm{~How~many~points~are~low~dimensional~neighbors~of}~y }{\mathrm{Among~the~high~dimensional~neighbors~of}~x}
\]
These two quantities are equal when we fix the number of neighboring points. (Here, the numerators are the same. When we fix the number of neighboring points, the denominators are equal as well). Fixing the number of neighboring points is one of the reasons behind t-SNE’s success, since some data points are quite far away from others and without fixing them some outliers wouldn’t have any neighboring points.
We can alternatively compute them by discretizing the continuous precision and recall shown above.
\[
Precision =
\frac{\mathrm{How~many~points~are~within~r_U~from}~x~\mathrm{and~within~r_V~from}~y}{\mathrm{How~many~points~are~within~r_V~from}~y}
\]
\[
Recall = \frac{\mathrm{How~many~points~are~within~r_U~from}~x~\mathrm{and~within~r_V~from}~y}{\mathrm{How~many~points~are~within~r_U~from}~x}
\]
But not only will this create an unequal number of neighboring points, it will result in quite a few data points ending up with very few neighbors. This is partially caused by high-dimensional geometry. Either way it can appear as though precision and recall are difficult quantities to manage in a practical situation.
Let’s revisit the problem from an alternative perspective. From the proof-of-precision’s dimension reduction rate, it’s clear that the mismatch comes from $f^{-1}(V)$ and $U$ - and this corresponds to the injectivity imperfection. Heuristically, the parallel quantity for continuity imperfection is f(U) and V.
We therefore proposed the following Wasserstein measures:
\[
W_{2}(\mathbb{P}_U, \mathbb{P}_{f^{-1}(V)});
W_{2}(\mathbb{P}_{f(U)}, \mathbb{P}_V)
\]
Like precision and recall, we associated the two Wasserstein measures for each point in the DR visualization map.
i) On a theoretical level, our work sheds light on the inherent trade-offs in any dimensionality reduction mapping model (e.g., visualization embedding).
ii) On a practical level, the implications are for practitioners to have a measurement tool to improve their data exploration practice: To date, people have put too much trust on low-dimensional visualizations. Low-dimensional visualizations can at best poorly reflect high-dimensional data structures; at worst it can produce incorrect representations, which can then degrade any subsequent analysis built upon them. We strongly suggest that practitioners improve their practice by incorporating a reliability measure for each data point on all their data visualizations.
]]>Deep reinforcement learning (DRL) is a recent yet very active area of research that joins forces between deep learning (the use of neural networks) and reinforcement learning (solving sequential decision tasks). In DRL, the goal is to learn an optimal policy (behavior) of an agent acting in an environment, with deep neural networks as the function approximator (for example in the value function). DRL has achieved outstanding results so far in areas that include beating human-level performance in Atari games [11] and in DeepMind’s now famous Go tournament versus Lee Sedol [12]. This has led to a dramatic increase in the number of applications that use this technique, most notably in video games and robotics.
Much of the successful DRL research to date has only considered single-agent environments. For instance, in Atari games, there is only one player to control. However, recent works have started to up the ante beyond single-agent scenarios and have begun exploring multi-agent scenarios, where the environment is populated with several learning agents at once. The presence of multiple agents causes additional dynamicity in the environment and in the agents themselves, which makes learning more complicated.
Recent works have reported successes in multiagent domains, such as DOTA 2 [14] or Capture the flag [21], in which many agents learn to compete and/or cooperate in the same environment. Despite these promising results, however, there are still many open challenges to be addressed. This article aims to (i) provide a clear overview of current multiagent deep reinforcement learning (MDRL) trends and (ii) share both examples and lessons learned on how certain methods and algorithms from DRL and multiagent learning can be used in complementary ways to solve problems in this emerging area.
Almost 20 years ago, Stone and Veloso's seminal survey used very intuitive and practical examples [1] to lay the groundwork for defining the area of multiagent systems (MAS) and its open problems in the context of machine learning:
“AI researchers have earned the right to start examining the implications of multiple autonomous agents interacting in the real world. In fact, they have rendered this examination indispensable. If there is one self-steering car, there will surely be more. And although each may be able to drive individually, if several autonomous vehicles meet on the highway, we must know how their behaviors interact”.
Roughly ten years later, Shoham, Powers, and Grenager [2] noted that the literature on learning in multiagent systems, or multiagent learning (MAL), was on the rise and it was no longer possible to enumerate all relevant articles. In the decade since, the number of published MAL works continues to rise, resulting in a series of different surveys (reviews) that showcase everything from analyzing the basics of MAL and their challenges [3] to addressing specific subareas (e.g., cooperative settings and evolutionary dynamics of MAL) [4-10].
Research interest in MAL has been accompanied by a number of successes; first, in single-agent Atari games [11], and more recently in two-player games [12-13] like Go, poker, and games involving two competing teams.
Deep reinforcement learning [15] plays a key role in these works and has successfully integrated with other AI techniques like (Monte Carlo tree) search, planning, and more recently, multiagent systems. The result is the emerging area of multiagent deep reinforcement learning (MDRL).
Learning in multiagent settings is fundamentally more difficult than the single-agent case thanks to problems [3-10] like:
• Non-stationarity: If all agents are learning at the same time, the dynamics become more complicated and break many standard RL assumptions.
• Curse of dimensionality: An exponential growth of state-action space when a learning agent keeps track of all agent actions.
• Multiagent credit assignment: Defining how agents should deduce their contributions when learning in a team; for example, if agents receive a team reward but it’s one agent doing most of the work, others can become “lazy.” It’s just like real life!
Despite these complexities, top AI conferences such as AAAI, AAMAS, ICLR, IJCAI, and NIPS have all published works reporting MDRL successes. The validation of these top-tier conferences convinces us it is valuable to compile an overview of the recent MDRL works and understand how these works relate to the existing literature.
In this context, we have identified four prominent categories in which to group recent works, as shown in the following figure.
(a) Analysis of emergent behaviors: evaluate DRL algorithms in multiagent scenarios.
(b) Learning communication: agents learn with actions and through messages.
(c) Learning cooperation: agents learn to cooperate using only actions and local observations.
(d) Agents modeling agents: agents reason about others to fulfill a task (e.g., cooperative or competitive).
The next objective of this article is to provide guidelines by showcasing how methods and algorithms from DRL and multiagent learning can complement each other to solve problems in MDRL. This occurs, for example, when:
• Dealing with non-stationarity
• Dealing with multiagent credit assignment
We also present general lessons learned from these works, such as the use of:
• Experience replay buffers in MDRL – a key component in many DRL works. These containers serve as explicit memory, storing interactions through which agents learn to improve their behaviors.
• Recurrent neural networks (e.g., LSTMs). These networks serve as implicit memory that improves performance, particularly for partially-observable environments.
• Centralized learning with decentralized execution: Agents can be trained with a central controller that has both access to agents’ actions and observations, but during deployment they will operate based solely on observations.
• Parameter sharing: In many tasks, it is useful to share a network’s internal layers, even when there are many outputs.
Towards the end of the article, we reflect on some open questions and challenges:
• On the challenge of sparse and delayed rewards.
Recent MAL competitions and environments (e.g., Pommerman [24], Capture the flag [21], MarLÖ, Starcraft II, and Dota 2) have complex scenarios wherein many actions must be taken before a reward signal becomes available. This is already a challenge for RL [16]; in MDRL it becomes even more problematic since the agents not only need to learn basic behaviors (like in DRL), but also need to learn the strategic element (e.g., competitive/collaborative) embedded within the multiagent setting.
• On the role of self-play.
Self-play (when all the agents use the same learning algorithm) is a MAL cornerstone that achieves impressive results [17-19]. While notable results have also occurred in MDRL, recent works have shown that plain self-play does not yield the best results [20, 21].
• On the challenge of the combinatorial nature of MDRL.
Monte Carlo tree search (MCTS) has been the backbone of major breakthroughs for AlphaGo and AlphaGo Zero, both of which used MCTS along with the DRL. However, for multiagent scenarios, there is an additional challenge of the exponential growth of all the agents' action spaces for centralized methods. Given more scalable planners [22, 23], there is room for research in combining MCTS-like planners with DRL techniques in multi-agent scenarios.
While there are a number of notable works in DRL and MDRL that represent important milestones for AI, we acknowledge there are also open questions in both single-agent learning and multiagent learning that demonstrate how much more work still needs to be done. We expect this article will help unify and motivate future research to take advantage of the abundant literature that exists in both areas (DRL and MAL) in a joint effort to promote fruitful research in the multiagent community.
Full paper: https://arxiv.org/abs/1810.05587
[1] P. Stone, M. M. Veloso, Multiagent Systems - A Survey from a Machine Learning Perspective., Autonomous Robots 8 (2000) 345–383.
[2] Y. Shoham, R. Powers, T. Grenager, If multi-agent learning is the answer, what is the question?, Artificial Intelligence 171 (2007) 365–377.
[3] K. Tuyls, G. Weiss, Multiagent learning: Basics, challenges, and prospects, AI Magazine 33 (2012) 41–52.
[4] L. Busoniu, R. Babuska, B. De Schutter, A Comprehensive Survey of Multiagent Reinforcement Learning, IEEE Transactions on Systems, Man and Cybernetics, Part C (Applications and Reviews) 38 (2008) 156–172.
[5] A. Nowé, P. Vrancx, Y.-M. De Hauwere, Game theory and multi-agent reinforcement learning, in: Reinforcement Learning, Springer, 2012, pp. 441–470.
[6] L. Panait, S. Luke, Cooperative Multi-Agent Learning: The State of the Art, Autonomous Agents and Multi-Agent Systems 11 (2005).
[7] L. Matignon, G. J. Laurent, N. Le Fort-Piat, Independent reinforcement learners in cooperative Markov games: a survey regarding coordination problems, Knowledge Engineering Review 27 (2012) 1–31.
[8] D. Bloembergen, K. Tuyls, D. Hennes, M. Kaisers, Evolutionary Dynamics of Multi-Agent Learning: A Survey., Journal of Artificial Intelligence Research 53 (2015) 659–697.
[9] P. Hernandez-Leal, M. Kaisers, T. Baarslag, E. Munoz de Cote, A Survey of Learning in Multiagent Environments - Dealing with Non-Stationarity (2017). arXiv:1707.09183.
[10] S. V. Albrecht, P. Stone, Autonomous agents modelling other agents: A comprehensive survey and open problems, Artificial Intelligence 258 (2018) 66–95.
[11] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, D. Hassabis, Human-level control through deep reinforcement learning, Nature 518 (2015) 529–533.
[12] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, D. Hassabis, Mastering the game of Go with deep neural networks and tree search, Nature 529 (2016) 484–489.
[13] M. Moravčík, M. Schmid, N. Burch, V. Lisý, D. Morrill, N. Bard, T. Davis, K. Waugh, M. Johanson, M. Bowling, DeepStack: Expert-level artificial intelligence in heads-up no-limit poker, Science 356 (2017) 508–513.
[14] Open AI Five, https://blog.openai.com/openai-five, 2018. [Online; accessed 7-September-2018].
[15] K. Arulkumaran, M. P. Deisenroth, M. Brundage, A. A. Bharath, A Brief Survey of Deep Reinforcement Learning (2017). arXiv:1708.05866v2.
[16] R. S. Sutton, A. G. Barto, Introduction to reinforcement learning, volume 135, MIT press Cambridge, 1998.
[17] J. Hu, M. P. Wellman, Nash Q-learning for general-sum stochastic games, Journal of Machine Learning Research 4 (2003) 1039–1069.
[18] M. Bowling, Convergence and no-regret in multiagent learning, in: Advances in Neural Information Processing Systems, Vancouver, Canada, 2004, pp. 209–216.
[19] J. Heinrich, D. Silver, Deep Reinforcement Learning from Self-Play in Imperfect-Information Games (2016). arXiv:1603.01121
[20] T. Bansal, J. Pachocki, S. Sidor, I. Sutskever, I. Mordatch, Emergent Complexity via Multi-Agent Competition., in: International Conference on Machine Learning, 2018.
[21] M. Jaderberg, W. M. Czarnecki, I. Dunning, L. Marris, G. Lever, A. G. Casta˜neda, C. Beattie, N. C. Rabinowitz, A. S. Morcos, A. Ruderman, N. Sonnerat, T. Green, L. Deason, J. Z. Leibo, D. Silver, D. Hassabis, K. Kavukcuoglu, T. Graepel, Human-level performance in first-person multiplayer games with population based deep reinforcement learning (2018).
[22] C. Amato, F. A. Oliehoek, et al., Scalable planning and learning for multiagent POMDPs., in: AAAI, 2015, pp. 1995–2002.
[23] G. Best, O. M. Cliff, T. Patten, R. R. Mettu, R. Fitch, Dec-MCTS: Decentralized planning for multi-robot active perception, The International Journal of Robotics Research (2018)
[24] Resnick, C., Eldridge, W., Ha, D., Britz, D., Foerster, J., Togelius, J., Cho, K. and Bruna, J., 2018. Pommerman: A Multi-Agent Playground. arXiv preprint arXiv:1809.07124.
On October 9, 2018, we threw a party to mark the official launch of our new Borealis AI Montreal research centre and to celebrate our $1M collaboration with the Canadian Institute for Advanced Research (CIFAR) on responsible AI research and initiatives.
Click on the gallery below to see pics from the event.
]]>Today, the RBC Foundation announced it will donate $1-million dollars over three years to the Canadian Institute for Advanced Research (CIFAR). The gift will support research and initiatives aimed at furthering the study of ethical artificial intelligence (AI) practices.
RBC CEO, Dave McKay, made the announcement, in conjunction with CIFAR president and CEO, Dr. Alan Bernstein, at the official launch of Borealis AI’s Montreal research centre. CIFAR is currently leading the country’s Pan-Canadian Artificial Intelligence Strategy.
Borealis AI will collaborate closely with CIFAR in an advisory capacity on key aspects of the strategy, with a particular focus on global thought leadership around the ethical implications of AI advancements.
The investment will help fund initiatives like CIFAR’s Catalyst Grants, which award up to $100,000 per year for two years to support collaborations in novel areas of AI exploration between researchers at any Canadian institution. Two of these awards will be explicitly focused on research in areas like privacy, accountability, transparency and bias in machine learning.
The money will also go toward delivering interdisciplinary AI research workshops in fields such as transportation, environmental science, public health, and energy, while the remainder will support ongoing training opportunities in the social implications of AI, including equity, diversity and inclusion.
Ethical AI is no longer a peripheral topic within the community. As scientific successes in fields like deep learning, computer vision and natural language processing continue to grow, the broad-ranging social impact of these technologies cannot be divorced from their applications and need to be researched with the same academic rigour.
RBC and Borealis AI are dedicated to the shared sense of responsibility between diverse research communities dedicated to the proliferation of responsible technologies. With CIFAR as a partner, we are confident we’ll exceed these goals.
]]>A blank canvas, a stunning city, a top-notch team, and an organization on the move. Put it all together and you get Borealis AI’s striking new Montreal research centre, which officially opened its tunnel, er, doors this week.
RBC CEO Dave MacKay kicked off the official launch this morning, reinforcing the bank’s commitment to supporting the AI ecosystem through collaboration with Canada’s leading research institutions. He was joined onstage by Nadine Renaud-Tinker, president of RBC in Quebec, and Borealis AI co-founder and head, Foteini Agrafioti, who shared their excitement about expanding the organization’s network of research centres into a city that has shown such dynamic leadership in the field.
Dr. Alan Bernstein, president and CEO of the Canadian Institute for Advanced Research (CIFAR) followed to announce the RBC Foundation’s $1-million-dollar investment in the organization’s Pan-Canadian Artificial Intelligence Strategy.
Our new 40-person research centre is located at O Mile-Ex, a former textile factory that is becoming the de facto AI industrial research hub of Montreal. We share easy access to some of the neighbourhood’s best coffee with our new neighbours Element AI, MILA, Thales, and IVADO.
With our Toronto research centre scooping up design awards and getting full-length features in Toronto Life, our Montreal design team had a tough act to follow.
Thankfully, at 6,500 square feet and with no building restrictions, there was plenty of space for us to play. That’s why one of the first features we added was a hockey motif in one of our conference rooms. It’s a nod to the city’s deep hockey heritage and a show of good sportsmanship from the Toronto lab for acknowledging the Canadiens.
A mini soccer pitch sits smack in the middle of the hall for team members to let off a little steam.
And a cinema meeting room pays homage to Quebec’s thriving film industry and specifically to the iconic Snowdon Theatre during its glory days.
The primary theme, however, was the Montreal Metro. Visitors walk through a tiled tunnel throughway to reach the front hall.
Hang a left and you arrive at a meeting room with swinging egg chairs. On the right, a chill out room with grass and some cozy bean bag chairs.
Groceries get delivered each Monday to the kitchen with its intricate tiled mosaic floor. This is yet another nod to Montreal – this time to its design community and the city’s overall exquisite attention to detail.
Our living room is like being in a park, albeit a park with a large garage door. The bright, open, window-filled space offers a panorama view for lectures, presentations, and meetings. These are all overseen by a gigantic balloon guard dog who monitors the proceedings.
Exceptional research deserves an exceptional space. And if you think this is great, just wait until you see what we have in store for Waterloo, Vancouver, and Edmonton.
]]>We took Prof. Taylor for coffee near his old east-end Toronto stomping grounds to hear his thoughts on how deep learning has evolved, whether any pre-2012 techniques still yield promising results, where Software 2.0 fits into the future and how much of a deep-domain expert you have to be in order to give truly valuable advice to businesses.
]]>While it is tempting to now adopt a “one-size-fits-all” mantra, this would be a prematurely limiting methodology. Recently, a whole spate of machine learning models have emerged that involve more than a single objective; rather, they involve multiple objectives which all interact with each other during training. The most prominent model, of course, is Generative Adversarial Networks (GANs). However, other examples include synthetic gradients, proximal-gradient TD learning, and intrinsic curiosity. The appropriate way to think about these types of problems is to interpret them as game-theoretic problems, where one aims to find the Nash equilibrium rather than local minima of each objective. Intuitively speaking, Nash equilibrium occurs when each player knows the strategy of all the other players, and no player has anything to gain by changing his or her own strategy.
Unfortunately, finding Nash equilibrium in games is notoriously difficult. In fact, theoretical computer scientists have long known that finding Nash equilibrium for general games is an intractable problem. Nor is it ideal to naively apply gradient descent to games. Firstly, gradient descent has no convergence guarantees and, even in case where it does, it may be highly unstable and slow. But the most severe drawback is that, unlike the traditional setup in supervised machine learning, there is no single objective involved in this model which means we have no way of measuring any kind of progress.
We can illustrate the complexity of interacting losses with a very simple two-player game example. Consider Player 1 and Player 2 with the respective loss functions:
In this game, Player 1 controls the variable x and Player 2 controls the variable y. The dynamics (or simultaneous gradient) is given by:
If we plot this in the 2D-Cartesian plane, all of the vector fields cycle around the origin and there is no direction which points straight to it. The technical problem here is that the dynamics ξ are not a gradient vector field; in other words, there is no function φ such that φ = ξ.
In the ICML paper, The Mechanics of n-Player Differentiable Games, [1], the authors use insights from Hamiltonian mechanics to tackle the problem of finding game equilibriums. Hamiltonian mechanics are a reformulation of classical mechanics in the following way. Consider a particle moving in the Euclidean space R^{n}. The state of the system at a given time t is determined by the coordinates of the position (q_{1},…,q_{n}) and the coordinates of the momentum (p_{1},…,p_{n}). The space R^{2n} of positions and momenta is called the “phase space”. The “Hamiltonian” H(q,p) is a function on this phase space and it represents the total energy of the system. Hamilton’s equations (also referred to as “equations of motion”) describe the time evolution of the state of the system. These are given by
We see how all of these formulations play out in our simple example. If we define the Hamiltonian H(x,y) here to be
the gradient is then ▽H = (x, y). There are two critical observations to be made here: (1) conservation of energy. The level sets of H are conserved by the dynamics ξ = (y, -x) (hence, ξ cycles around the origin) and; (2) gradient descent on the Hamiltonian, rather than the simultaneous gradient on the losses, converges to the origin.
Motivated by this philosophy, the authors in [1] introduce the notion of Hamiltonian games. For a n-player game with parameter w, they define the Hessian of a game to be
Since this is a matrix, it always admits a symmetric and anti-symmetric decomposition:
This leads to a classification: Hamiltonian games are defined as games where the symmetric component is zero, S(w) = 0. Potential games are defined as games where the anti-symmetric component is zero, A(w) = 0. Going back to our simple example, the Hessian is
so therefore we have a Hamiltonian game. One of the main theoretical contributions of this paper is that given an n-player Hamiltonian game with H(w) = ^{1}/_{2}ξ(w)^{2}, under some conditions the gradient descent on H converges to a Nash equilibrium.
Another central contribution made by the authors is the proposal of a new algorithm to find stable fixed points (which under some conditions can be considered Nash equilibrium). Their Symplectic Gradient Adjustment (SGA) adjusts the game dynamics by
For a potential game where A(w) = 0, then SGA performs the usual gradient descent by finding local minimum. In contrast, for a Hamiltonian game where S(w) = 0, then SGA finds local Nash equilibrium. Readers fluent in differential geometry can immediately see the reasoning behind the terminology “symplectic”. For a Hamiltonian game, H(w) is a Hamiltonian function and the gradient of the Hamiltonian ▽H = A^{T}ξ. The dynamic ξ is a Hamiltonian vector field, since it conserves the level-sets of the Hamiltonian H. In symplectic geometry, the relationship between symplectic form ω and the Hamiltonian vector field ξ is
The right-hand side of this equation is simply the gradient of our Hamiltonian function. In the context of Hamiltonian games, we see that the antisymmetric matrix A is playing the role of the symplectic form ω, which justifies the terminology.
In the experimental section, the authors compared their SGA method to other recently proposed algorithms for finding stable fixed points in GANs. Moreover, to demonstrate the flexibility of their algorithm, the authors also studied the performance of SGA on general two-player and four-player games. In all cases, SGA was competitive with, if not better than, existing algorithms.
As a summary, this paper provides a glimpse of how a specific class of general game problems may be tackled by borrowing tools from mathematics and physics. Machine learning models featuring multiple interacting losses are becoming increasingly popular. As such, it is necessary for us to come up with new methodologies rather than relying on the crutches of standard gradient descent. Unraveling the mysteries behind these complicated models will have considerable practical impact as it will aid the design of better scalable algorithms in the future.
[1] David Balduzzi, Sebastien Racaniere, James Martens, Jakob Foerster, Karl Tuyls, and Thore Graepel. The mechanics of n-player differentiable games. arXiv preprint arXiv:1802.05642, 2018.
]]>CoT: Cooperative Training for Generative Modeling
For tasks that involve generating natural language, a common practice is to train the model in teacher forcing mode. This means that during training, the model is always asked to predict the next word given the previous ground truth words as its input. However, at test time the model is expected to generate the next word based on its previously generated words. As a result, the mistakes the model has made along the way can quickly accumulate as it has never been exposed to its own errors during training. This phenomenon is known as exposure basis.
These models are predominantly trained via Maximum Likelihood Estimation (MLE), which may not correspond to perceived quality of the generated text. Maximizing the likelihood is equivalent to minimizing the Kullback-Leibler (KL) divergence between the real data distribution P and the estimated model distribution G. However, the KL divergence is asymmetric and has well known limitations when used for training. Thus, this paper proposes to optimize the Jensen-Shannon divergence (JSD) between P and G instead.
The JSD requires an intermediate distribution which is a mixture between the true distribution P and the model distribution Q. In this paper, the authors suggest the use of a mediator which approximates this intermediate distribution. The generative model is then trained to minimize the estimated JSD provided by the mediator. This results in an iterative algorithm that alternates between updating the mediator and the generative model. It can also be viewed as a cooperative objective between the mediator and the generator to maximize the expected log likelihood for both P and G, which gives rise to the paper’s name Cooperative Training (CoT).
Although experiments in the paper are mostly language related, it is worth mentioning that such a strategy can, in principle, be applied to many other types of data. One caveat, however, is that applying CoT requires a factorized version of the density function, which is trivial for natural language since it can be represented using an RNN language model. For instance for images, one could opt to use a model like PixelRNN. However it may be prohibitively slow.
]]>Paper: “This looks like that: deep learning for interpretable image recognition”
Authors: Chaofan Chen, Oscar Li, Alina Barnett, Jonathan Su, Cynthia Rudin
Presenter: Cynthia Rudin
The aptly titled paper, "This looks like that," involves learning "prototypes" that correspond to common features characterizing object categories. The output of such a classifier is the object category and a set of prototypes that closely matches the sample image. The title-motivating interpretation is that this input image is that particular category because it looks like these other canonical images from that category.
This paper interested me because it was something new; specifically, a new application category that can be demonstrated on a small dataset. Digging deeper, you can see there is some technical machinery involved in figuring out an appropriate cost function for training the parameters of the model. But the key innovation here is the new application category. It’s only recently that we've had access to this set of fully parallel computing architectures, and to me, artificial intelligence is about figuring out what to do with all these parallel computing machines. What we end up doing with them is still totally up in the air and fills me with promise for the field.
One weakness, however, is the authors’ example prototypes. The authors claim their technique mimics the type of analysis performed by a specialist in explaining why certain objects fall within certain categories. In the supplementary materials, for example, we see an automatically classified image of a bird labelled with what is clearly its beak, upper tail, and lower tail. As an analogy, the author shows a human-labelled image of a bird with similarly named components. This is the strongest example in the paper of a set of components that are "interpretable.” Their idea is that the way a human would explain an image of a bird is by pointing to different parts of the object and suggesting “this is an image of a bird because it contains these essential features of a bird.” However, the other examples given, that of a truck placed beside multiple images of vehicles, for instance, don't seem to have the same level of interpretability; they just look like other pictures of vehicles and not “prototypical parts of images from a given class” as suggested in the paper’s introduction. This is, of course, a somewhat subjective criticism: figuring out how to make this more objective seems to be an area of future research.
Overall, I found it notable that this paper only tested the algorithm on CIFAR 10 and 10 classes from CUB-200-2011 (two small datasets), which conceptually may be seen as limiting the scope, but it presented a novel application that was very well received by the audience.
Paper: Robust Physical-World Attacks on Deep Learning Visual Classification
Authors: Kevin Eykholt, Ivan Evtimov, Earlence Fernandes, Bo Li, Amir Rahmati, Chaowei Xiao, Atul Prakash, Tadayoshi Kohno, Dawn Song
Presenter: Dawn Song
In this talk, Dawn Song presented work on what she and her fellow authors call “Robust Physical Perturbations”. The perturbations in this case are part of the adversarial attacks whose purpose is to illuminate weaknesses in or outright break existing classifying techniques. In this case, Dawn showed a video of a stop sign with a type of physical altercation resembling graffiti, then demonstrated how the stop sign was subsequently misinterpreted as a speed limit sign reading "45 mph." In addition to this problem, the stop sign was misclassified as a speed limit sign from many different angles and distances throughout the duration of the video. These errors have obvious safety implications, which I will examine below:
There was a concrete connection to a real engineering problem – the autonomous vehicle safety problem. For instance, a nefarious actor could decide to vandalize stop signs so they’d appear as real stop signs to a human observer, but they’d be classified as something that doesn’t require stopping and could cause an accident.
The authors identified another interesting constraint: that the attack had to look like what we consider “normal” graffiti so the vandalism wouldn’t appear suspicious to a human observer. If the graffiti looks fake, it’s easier for a human to recognize these real-world inconsistencies and then identify the situation as an attempted adversarial attack on the classifier, thus motivating the human to fix the sign sooner. To demonstrate this point, the authors show some pictures of their proposed adversarial attacks: one of the attacks appears to be the words LOVE-HATE applied to the stop sign that was successful in causing a misclassification. To me this example looked like real-life graffiti that is occasionally observed on stop signs.
When you dig more deeply into the paper, it’s clear that the authors address the second point by using a masking technique. In this version, the authors perturb an input image to create an adversarial example, then mask small perturbations to zero so that the end result requires only small portions of the image to be perturbed in order to construct the adversarial example (although in this case, the small portion of the image that gets perturbed is perturbed rather significantly). This method is distinguishable from some of the other adversarial techniques used where each pixel is restricted to be perturbed by only a small amount. In this case, a small fraction of pixels are perturbed by a large amount.
There are several rules of thumb as well as questions I can draw from these adversarial examples:
If an attacker is aware of the internal parameters of the model being attacked, then it is much easier to construct an adversarial example. The authors assume the internal parameters are known, and this is probably reasonable because of the possibility of black box extraction attacks; that is, a set of techniques like those introduced by Ristenpart et al. that allow an adversary to estimate the weights of a neural network by just performing black box queries of the model.
You should ground the analysis in real-world costs and benefits. This way, the constrained problem of stop sign detection makes the task more real, and perhaps more informative.
We lack the language to really discuss "interpretability" and "robustness to adversarial attack" so this seems like an interesting open area for research exploration. However, the papers I've noted here are good examples of how this can be done.
The fact that the “masking technique” to make the adversarial attack resemble typical graffitti worked is notable: it suggests that if we constrain what the attack is to look like, at least somewhat, we can still create successful attacks. It seems like there could be many more ways to make adversarial attacks beyond the existing techniques.
Paper: Intelligence per Kilowatthour
Authors: Max Welling
Presenter: Max Welling
In this talk, Prof. Welling demonstrated a pruning algorithm with a well-tested robustness. The key idea with pruning algorithms is that you can start with a densely connected neural network, then iteratively prune some of the edges to retrain the network. Sometimes compression rates of 99.5% can be obtained.
I think the most important point in the talk comes from the slide above: "AI algorithms will be measured by the amount of intelligence they provide per killowatthour." As the amount of data balloons, and machine learning algorithms use increasingly large numbers of parameters, we will be forced to measure our AI algorithms per unit energy. If not, I suspect many tasks on the frontiers of artificial intelligence will simply be infeasible.
Taking inspiration from Max's talk, I noted that one way of optimizing this metric is by decreasing energy consumption while maintaining performance on a previously identified task. But another way of optimizing this is by finding new ways to measure exactly how intelligent these systems are. I suspect that approaches like that in Cynthia Rudin's talk discussing "this looks like that" may be a way of increasing the "intelligence" part of the equation. Finding ways to energy-efficiently create stop sign detectors that are robust to adversarial graffiti-style attacks like those discussed by Dawn Song could be another way to increase the intelligence of these models.
]]>Movies are one of life’s most pleasurable escapes, an artform that has dominated our cultural experience for over a century. But underneath the costumes, emoted lines and beautiful settings are the mechanics of technology, without which there would be no cinema.
Sanja Fidler, assistant professor at University of Toronto and head of NVIDIA’s new Toronto AI research lab, has made a name, in part, by marrying her expertise in computer vision and natural language processing to fascinating – and even funny – machine learning applications. She sat down with Northern Frontier at the Revue Cinema in Toronto’s west end to explore the link between this core science and popular art form through her research, how AI can enhance human creativity, and how this new frontier in tech could influence a new era of moviemaking.
]]>