We discover this sensitivity by analyzing the Bayes classifier's clean accuracy and robust accuracy. Extensive empirical investigation confirms our finding. Numerous neural nets trained on MNIST and CIFAR10 variants achieve comparable clean accuracies, but they exhibit very different robustness when adversarially trained. This counter-intuitive phenomenon suggests that input data distribution alone can affect the adversarial robustness of trained neural networks, not necessarily the tasks themselves. Lastly, we discuss practical implications on evaluating adversarial robustness, and make initial attempts to understand this complex phenomenon.
]]>On Friday, March 1, we hosted a reception in honour of our 2018-2019 Borealis AI Fellowship winners. Our fellows span the country from universities across British Columbia, Alberta, Ontario and Quebec. We were thrilled that so many flew in to attend our ceremony, which gave us a chance to congratulate our new fellows in person and share the research we’re doing at Borealis AI.
Most importantly, the event provided a space for everyone to connect with each other. A big motivation for our fellowship awards is the continuation of our fulfillment to support students across Canada. A foundational aspect of this support is to nurture a spate of networks, where researchers can exchange ideas, form fruitful partnerships, and continue our national leadership in the AI space.
So much of our success in the field stems from Canada’s supportive academic culture. It’s just who we are. For every brilliant graduate, there are a host of advisors and mentors who drew out the best in them along a challenging and competitive path.
True to form, Professors Ioannis Mitliagkas, Reihaneh Rabbany, and Jackie Cheung came by to support the winners. “We’ve been a pioneer in machine learning and AI, largely due to smart investments by public funding organizations. It bodes well for the future to see industry join in this effort to help sustain our research community,” said Prof. Mitliagkas, who came to support his new graduate student, Alexia Jolicoeur-Martineau.
Alexia, a statistician by training, was recently accepted to study at Mila on the strength of her independent work and has been gracious about her achievements. “This [fellowship] will bring me a peace of mind, so that I can fully focus on my research,” she said. “This makes me very confident about the future!”
Gauthier Gidel, whose research proposal on Efficient Saddle-Point Optimization for Modern Machine Learning impressed the adjudication committee, explained why young researchers need more efforts to champion their work: “It makes me feel like my research matters, that I’m moving in the right direction,” he said. “To me, winning this award is a sort of peer recognition.”
After a tour of our research centre, a catered lunch and an opportunity to meet the Borealis AI Montreal research team, we ended on a high note, with some of our guests hanging out at the lab well into the afternoon. With new friendships forged, we look forward to seeing what direction our outstanding fellows bring to Canada’s research community.
]]>Borealis AI is thrilled to announce our 2018-2019 Graduate Fellowship winners. The fellowships were awarded to exceptional students pursuing graduate-level studies in machine learning and artificial intelligence at top universities across Canada.
Each of our winners demonstrated outstanding research capabilities, provided strong references, and outlined a clear, thoughtful research focus for the current academic year. We were overwhelmed by the exceptional calibre of this year’s candidates and our adjudication committee had no easy task selecting the 10 finalists.
We’re proud to introduce our inaugural group (below) and look forward to meeting everyone in person on March 1 as we host an event in their honour in Montreal.
School: University of Alberta, Amii, PhD candidate
Research Interests: Reinforcement learning
Research Topic: The Predictive Approach to Knowledge
School: Université de Montréal, Mila, PhD candidate
Research Areas: Deep generative modelling, computational statistics
Research Topic: Understanding, improving, and extending GANs
School: McGill University, Mila/RLLab, PhD candidate
Research Interests: Language and interaction, reinforcement learning
Research Topic: Emergent Communication and Representation Learning
School: University of British Columbia, PhD candidate
Research Interests: ML on knowledge graphs
Research Topic: Improved Knowledge Graph Embedding Using Ontology, Time, and Higher Arity Relations
School: Université de Montréal, Mila, PhD candidate
Research Interest: Optimization, multi-agent learning
Research Topic: Efficient Saddle-Point Optimization for Modern Machine Learning
School: University of British Columbia, MSc candidate
Research Interests: Discrete and continuous optimization, theoretical ML, ML and data mining, design and analysis of algorithms, computational neuroscience, computational biology
Research Topic: Non-homogenous Stochastic Gradient Descent
School: University of Toronto, Vector Institute, PhD candidate
Research Interests: Optimization, regularization, Bayesian neural networks, generative models
Research Topic: Online Hyperparameter Adaptation for Improved Training and Generalization
School: University of Waterloo, PhD candidate
Research Interests: Deep generative models, mixture models, online learning, sum-product networks, optimization
Research Topic: Deep Homogenous Mixture Models: Representation, Separation
School: McGill University, Mila, PhD candidate
Research Interest: Deep reinforcement learning
Research Topic: Unifying Imitation and RL for Data-Efficient Learning
School: University of Toronto, Vector Institute, PhD candidate
Research Areas: Bayesian deep learning, generalization
Research Topic: Reliable Uncertainty Estimation in Bayesian Neural Networks
Last month, we had the chance to attend the 2019 Association for the Advancement of Artificial Intelligence (AAAI) conference in Honolulu. As one of the longest-standing conferences in the field, AAAI distinguishes itself from most leading ML events in that it focuses on far more than just deep learning. This breadth of topics encourages a vigorous interaction between some of the more classical methods in the field and a few of the modern ones. In turn, knowledge about these classical ideas can be used to inform and develop better ML approaches. Below, we’ve compiled a few of our favorite moments and stand-out papers, some which clearly show the interactive display between these two worlds.
Auto ML proved one of the hot topics at this year’s conference and one of the more intriguing papers on the topic was Automatic Bayesian Density Analysis by Antonio Vergari (Max Planck Institute for Intelligent Systems)*; Alejandro Molina (TU Darmstadt); Robert Peharz (University of Cambridge); Zoubin Ghahramani (University of Cambridge); Kristian Kersting (TU Darmstadt); Isabel Valera (MPI-IS). Inspired by advances around automatic model selection in supervised learning, the paper proposed an automated framework for tackling the problem of density estimation in unsupervised learning. A challenge here is that this problem often requires domain expertise from the sector where the data comes from. Instead of going down this limiting path, the authors suggested learning a sum-product network – an architecture initially proposed by Hoifung Poon and Pedro Domingos at UAI 2011 – to model the data.
The sum and product nodes of the network allowed the authors to capture the data’s feature behaviour as a mixture of heterogeneous distributions. In other words, rather than relying solely on a mixture of Gaussians for a particular feature, the authors allowed for the same mixture to contain discrete and continuous distributions and different types of parameterizations from a preset dictionary. Once learned, the structure of the SPN allowed for efficient inference using Gibbs sampling to infer missing feature values for some of the data points in the dataset. The model is also capable of providing information with respect to how well the data points fit the mixtures and highlights which points are likely to be outliers.
The idea of automating selection and analysis, which is a widely emergent approach at ML and AI conferences, and to see it applied to the age-old statistical ML problem of density estimation, is a prime example of new approaches being used to tackle long-standing classical problems.
One of the most engaging panels at the conference took the form of a debate on the “Future of AI.” Here, one team, consisting of Peter Stone and Jennifer Neville, argued for the proposition that “The AI community today should continue to focus mostly on ML methods,” while the opposing team, Michael Littman and Oren Etzioni, argued against it.
The panel was hilarious, entertaining and informative. However, the audience rightly pointed out that all the debaters were, in fact, ML experts, which made the contrarian position somewhat of an argument for the sake of debate and suggested that perhaps bias in ML is mostly inevitable. There was a good argument presented during the panel that suggested too much of an ML focus, like the one happening now, will create bias in current students that due, to industry demands, haven’t learned the classical AI and non-ML related topics as well as previous grads. The challenge will arise when the trend changes from ML to the next big thing, at which point we might end up with an insufficient number of experts in non-ML related AI topics.
Post-debate voting showed the audience still argued against the proposition that suggest we need to focus on ML research.
One dominant theme throughout the conference was the development and usage of neural networks to address problems involving graphs. Central to this argument was William Hamilton and Jian Tang’s tutorial on Graph Representation Learning. Here, the presenters outlined current advances in Graph Neural Networks (GNNs) and also presented models capable of generating graphs. Example domains of applications for these approaches included operations research and biology, where the data doesn't lend itself well to traditional architectures. An example of this less-than-ideal fit would be convolutional networks, which were designed mainly for vision applications with the image structure in mind.
Of particular note was a paper presented in the Search, Constraint Satisfaction and Optimization session, entitled Learning to Solve NP-Complete Problems - A Graph Neural Network for Decision TSP by Marcelo Prates (Federal University of Rio Grande do Sul); Pedro H C Avelar (Federal University of Rio Grande do Sul); Henrique Lemos (Federal University of Rio Grande do Sul); Luís Lamb (Federal University of Rio Grande do Sul); Moshe Vardi (Rice University). In the paper, the authors proposed using a GNN to tackle the “traveling salesman” problem. They introduced this approach as an extension of Learning a SAT Solver from Single-Bit Supervision by Selsam et al., whose authors used a message-passing NN approach to solve SAT problems.
Unlike that problem, however, the traveling salesman problem (TSP) represents a class of NP-complete problems with weighted relations between the vertices of its graph. To handle this, the authors mapped the weighted graph to a GNN by also allowing for embeddings on the edge weights, rather than just on the vertices. The model was then trained as classification problem by giving it a constructed graph and asking it the question: “Does a Hamiltonian path of length X or less exist in this graph?”
For each constructed graph, two training examples were fed to the model: one with cost (1-dev)X* and the other with cost (1+dev)X* where X* is the known minimal length and dev is the user-desired deviation from this optimal length. In the training process, the correct labels for these two instances are “NO” and “YES”, respectively. In addition to providing an extension for GNN usage beyond binary SAT problems, this approach is of particular importance because the TSP, along with Karp's 20 other NP-complete problems, often re-occur as reductions of many daily ML problems. Using GNNs – a modern approach – to solve these NP-complete problems which arise from classical computer science literature is another example of how old met new at AAAI.
Overall, AAAI 2019 showed that while we are going through current research explosions in DL and RL, classical topics (along with the faculty who have worked on them for decades!) are still alive, well and reincarnating themselves through these new channels. A welcome sight and a sign of what's to come.
]]>Borealis AI has big goals for 2019 and exceeding those goals requires exceptional leadership. Just two weeks into the new year, we’re well on our way with the addition of Dr. Kathryn Hume, who joins our team today as Director of Business Development. In her new role, Kathryn will oversee the application of our academic research within the bank, help inform our strategy and also tap into her broad experience to assist with driving Borealis AI's brand profile among key audiences.
Kathryn brings an unusually rich and varied background to the AI field. In addition to holding prior leadership positions at Integrate AI and Fast Forward Labs (Cloudera), she’s a prolific speaker and author on AI, has mastered seven languages and holds a PhD in Comparative Literature from Stanford. In her spare time, she teaches courses on enterprise adoption of AI, law and ethics at Harvard, MIT, Stanford and University of Calgary (just kidding, she doesn’t have any spare time).
As this introduction barely scratches the surface, we thought it would help to let Kathryn do the talking.
People underestimate how much of our lives are touched by banking as the substrate for the entire economy. Banking has macro impact—with risk management undergirding international market stability—and micro impact, where we all entrust banks with our financial assets to support our daily needs, like food, and our life aspirations, like education.
Now with AI, we’re able to use rich data that’s far more relevant to banking. People may not be aware that one of the first production deep learning applications was the use of computer vision to automatically recognize handwritten digits on cheques. This had been a rate limiter for the ATM and now, just a few years later, a customer can easily insert up to 50 cheques at a time and have the denominations read, analyzed and deposited within seconds. And now that we can also recognize and generate speech, what else might we do? What could payments look like? I’m most interested in AI applications like this where the tech hides behind the scenes but makes our lives so much easier.
First off, I really love the team and culture. I find there’s a mixture of curiosity and pure research talent. I also love that it’s a culture grounded in integrity. Everybody here takes the time to mean what they say and that’s very important to me. Apart from the culture, it’s exciting to return to my academic roots while pursuing my long-term career ambition, which is to be at the forefront of early commercialization of academic and scientific research.
There are a lot of existing applications the team has already built and I’m excited to bring them out of the lab and into production across the bank. I’m also looking forward to solidifying the relationships with our academic partners and to use the success of Borealis AI as an example of how academia and business can work together to holistically bridge gaps between both worlds.
My approach to responsible AI comes from a firm belief that ethics occurs in the trenches. There are, obviously, aspects of ethics that tackle large questions about AI’s impact on society, but I think the rubber really hits the road when a group of people collaborating to build a machine learning system have come together from different departments to make a series of tactical choices together. I’m excited to put this into practice here at Borealis AI. What better place to be impacting the future of responsible AI than in one of the world’s largest banks?
I was working in New York in early 2017 and it was common knowledge in the American machine learning research community that Canada was the place to be. The Vector Institute had just been established in Toronto and it was interesting to observe this experiment in building a commercialization leg from a university research department. I originally moved here to join a company called Integrate AI. What’s kept me in Toronto is the excitement of working in an ecosystem that feels similar to what Silicon Valley was like 15 years ago. There are new companies popping up everywhere and I sense the right energy flowing between groups in academia, policy, government and business. It’s a unique place in time to be. I also love Amii (in Edmonton) and Mila (in Montreal). What’s going on in the Canadian ecosystem is just amazing to behold.
I got my PhD in Comparative Literature, but I actually have a strong math and science background. In fact, my dissertation is about the use of habit (or repetitive action) as a technique to generate knowledge in 17th century mathematics, philosophy and literature. I’ve come to believe since then that I inadvertently wrote a history of supervised learning through this work. Supervised learning is an AI technique that starts with a set of labeled training examples. For example, we teach an algorithm to adequately identify that a picture of a cat is a cat by giving the images a “cat” label, then training the system over time. The “supervised learning” I wrote about in my thesis pertains to human self-transformation: that if we want to become a different type of person, we have to think a certain way, then practice those thoughts so we don’t default to our old habits.
Years ago, I gave a talk about why my background as an intellectual historian of math and philosophy actually makes me a great product marketer. My work doesn’t ask whether philosophers like Descartes or Leibniz or Newton were “right”; rather, it asks what did they think they were thinking? So, my task was to read everything they’d read and try to reconstruct what they thought so as to reinterpret what they were saying. It’s an excellent skill set for someone in business development because when you’re working as a translator between academic machine learning researchers and businesspeople, you have to do that work on both sides. How do the researchers think? What are they reading? How do they use language to express their point of view? Similarly, how do the bankers in the various divisions of the bank think? What do they read? How do they see the world? And, most importantly, can we make those two points meet at the intersection? These are the unique translation skills my background has provided, and I’ve seen it unfold to great effect in the boardroom. I’m really looking forward to adapting it to this next chapter of my career at Borealis AI.
]]>Whether you’re training a model on an enormous dataset with an industrial scale server or deploying a small model for a cellphone application, energy is often a fundamental bottleneck. I suspect algorithmic innovations that provide greater energy efficiency will be necessary to push forward the next frontier of machine learning. It’s with this mindset (and with my own paper in tow) that I attended this workshop.
In a previous blog post, I discussed Max Welling’s Intelligence per Kilowatthour paper, which he presented at the ICML conference in Stockholm last summer. Machine learning models are helping to solve increasingly difficult tasks, but as a natural result of scale, some of the models are getting enormous. Often, our community creates models that work just for the particular task at hand; but if these techniques are to be widely deployable, we must work to decrease the energy of these models. Because of this problem, Prof. Welling argued that machine learning should be judged by the intelligence per unit energy. The CDNNRIA workshop I attended seemed like a very natural response to Prof. Welling’s ICML presentation, as in addition to him being one of the workshop organizers, the focus was on compact (i.e. more energy efficient) neural networks.
Since I’m personally quite passionate about this topic, I’ve spent time exploring it in various forms of scientific inquiry. My workshop paper, On Learning Wire-Length Efficient Neural Networks, which I worked on with co-authors Luyu Wang, Giuseppe Castiglione, Christopher Srinivasa, and Marcus Brubaker, attempted to tackle an aspect of this important topic. I was honoured to present it. In this post, I will summarize our paper, highlight other interesting papers that relate to the subject, present the results of an experiment inspired by some of the additional workshop papers, then draw an emergent lesson from the workshop about the value of negative results.
A classic paper in the field, called Optimal Brain Damage, first introduced the basic training and pruning pipeline. The standard technique for creating energy-efficient neural networks involves assuming some initial architecture, initializing weight and bias values of the network, then modifying those parameters so that the network closely fits training data. This step is called "training". The next step – “pruning” – involves deleting the edges of the network that are somehow deemed "unimportant," then re-training the network after the edges are removed. The way “unimportant” gets defined, in this context, can vary depending on the specific technique. The big revelation of machine learning is that this pipeline works, and when done iteratively, the number of parameters in the model can be reduced by upwards of 50 times with no decrease in accuracy. In fact, there’s often an increase in accuracy.
Most previous work considers evaluating the performance of pruning algorithms by using the number of non-zero parameters metric. But as we shall see, these are not the only criteria. Some of the notable work at the CDNNRIA workshop considered energy consumption that assumes a three-level cache architecture (as discussed below). The existing work – both the cache-architecture work and number of non-zero parameters work – is a good model for energy consumption on general purpose circuitry with fixed memory hierarchies. However, some machine learning applications (say, image recognition), may need to be more widely deployed.
As with error-control coding, specialized neural networks may directly implement the edges of the neural network as wires. In this case, however, it is the total wiring length and not the number of memory accesses that will dominate energy consumption. The reason for this is due to the resistive-capacitive effects of wires, but more generally, it occurs because wiring length is a fundamental energy limitation of all the practical computational techniques that we can conceive. This hinges upon a basic fact: real systems have friction.
With this context established, our paper seeks to introduce a simple criterion for analyzing energy. We called it the “wire-length”, or information-friction model. Our model is inspired by the works of Thompson, as well as the more recent work of Grover, and by my own PhD thesis, in which energy is proportional to the total length of all the edges connecting the neurons of the network. The technique involves placing the nodes of the neural network on a three-dimensional grid, so the nodes are at least one unit-distance apart. Then, if two nodes are connected by a wire in the neural network, the length of the wire becomes the Manhattan distance between the two nodes they connect. The task we define is to find a neural network that is both accurate and has a placement of nodes that could be considered wire-length efficient.
In our paper, we introduced three algorithms that can be combined and used at the training, pruning, and node placement steps. Our experiments show that each of our techniques is independently effective, and by combining them and using a hyperparameter search we can get even lower energy, which has allowed us to produce benchmarks for some standard problems. We also found that the techniques worked across datasets.
Several workshop papers submitted in parallel to the conference caught my attention. This one, authored by Yue Wang, Tan Nguyen, Yang Zhao, Zhangyang Wang, Yingyan Lin and Richard Baraniuk, seemed to tackle similar themes as ours. What interested me most was that it sought a way to make energy efficient neural networks using a technique distinguished from the standard training-pruning pipeline:
This paper is also interesting for the way it suggests minimizing total energy in a different manner than the standard pruning paradigm. The human brain is hyper-optimized for minimizing energy consumption, and if our machine learning techniques are to mimic the kinds of tasks performed by the brain, I suspect we will have to use all kinds of techniques to keep energy costs under control. The “skip policy” idea of Wang et al. may be one such technique useful on the road to more energy efficient artificial intelligence.
In the normal iterative training, pruning, re-training framework, we keep the weights of the neural network the same post-pruning (except, of course, for the weights associated with pruned edges) then re-train the weights from this point onward. The idea behind this methodology is that the training-pruning process helps the network learn important edges and important weights.
Two similar papers submitted to the workshop added a twist to this paradigm. They showed if the weights are randomly re-initialized after some training and pruning, and then the network gets re-trained from the random re-initialization, then a higher accuracy can be obtained. This result suggests that pruning helps find important connections but not important weights. It contradicts what many (including myself) would have intuited: that pruning allows you to learn important weight values and important connections.
The experimental evidence across these two papers could provide a very easy-to-implement tool for the machine-learning practitioner, and I’m curious to see if this technique gets widely adopted. There’s a good chance, as one of the papers, “Re-thinking the Value of Network Pruning”, won the workshop’s best paper award.
However, while the papers’ results suggest a simple approach toward achieving higher accuracy on low-energy machine learning models, it begs the question whether this tool will work on the pruned architectures we obtained when optimizing for wire-length pruning. The fact that this was discovered independently by two different groups gives me more confidence that it will work for us. In the spirit of learning from the workshop, we decided to try it out back at Borealis AI HQ.
We obtained the best-performing model using distance-based regularization at different target accuracies. Then, we re-initialized the weights of the resulting network before we re-trained. The table below shows the results:
Accuracy Before Initialization | Accuracy After Initialization | Accuracy After Re-training |
---|---|---|
98 | 8.92 | 97.08 |
97 | 10.1 | 84.64 |
95 | 10.32 | 61.94 |
90 | 10.09 | 30.76 |
The left side presents the accuracy before re-initialization; the middle column shows the accuracy after-reinitialization, and the final column reveals the accuracy after re-training the re-initialized network. As we can see, we consistently get lower accuracy than the network before re-initialization. This suggests the re-initialization approach does not work. But we also find the technique gets less effective when we re-initialize smaller networks, which we have considered as a possible explanation for why the technique didn’t work.
Does this contradict the results of the workshop papers discussed above? No. But it might reveal that the general approach is less likely to work than we might have thought. Nevertheless, in “Rethinking the Value of Network Pruning,” the authors present negative results, so we might have guessed the technique wouldn’t work on our networks trained for shorter wire-length. In the next section, I’ll discuss why I think having negative results is so useful.
It’s a machine learning truism that any single result is hard to pinpoint as being independently important, but the many pieces of evidence reported across multiple papers allows us to draw an emergent lesson. I view machine learning as a grab-bag of techniques that help solve new classes of computational problems. They don’t always work, but all-too-often some of them do. This allows us to use the literature in a way that produces rules-of-thumb to inform techniques that might work. For example, if we looked at the original Optimal Brain Damage paper in isolation, it might be hard to discern the paper’s broad applicability. But the fact that the standard training-pruning pipeline has been so widely used, and that so many modifications of the technique (including our wire-length pruning work) also work, gives us confidence in the idea’s ability to capture something basic and fundamental – that doing some type of pruning is appropriate if minimizing energy consumption is a major concern.
Due to the multiplicity of possible techniques, the only thing machine learning practitioners can do is test them out in the first place, set good evaluation criteria, and see if they work. Since engineering and computational resources are limited, this also means judiciously choosing which techniques to take on. This process requires a careful balancing of engineering risk and reward.
So, while it may be worth it to try a technique, the decision depends on the particulars of the problem and the probability that the technique will be successful. Presenting negative results allows our readers to intuit this probability. Moreover, negative results inform researchers about areas that have already been attempted and saves them the effort of re-testing them.
The value of negative results can be further illustrated with a concrete thought experiment. Suppose your goal is to create a neural network in a place where computational resources are free and plentiful, and the tool is not going to be widely deployed. Perhaps such a network is used in an internal tool at a small company. The engineer, in this case, might ask: Should I try the re-initialization and re-training technique in order to get a more accurate small network?
Since the papers (and our experiment) suggest the techniques only work some of the time, it may not be worth the effort to give it a try. After all, there’s only a slim chance of it being successful and the reward margin is small. However, suppose the network were to be widely deployed to a billion cellphones, like, for example, in some widely deployed social media application. In this case, it makes sense to try this technique, as well as a number of others, to ensure the tool uses as little energy as possible.
Real problems may fit somewhere between these two extremes, and choosing the right approach requires having a finely tuned sense of the probabilities that they will work. Having a collection of experimental results in the literature, both positive and negative, helps the engineer make the right judgment call about whether a technique is worth the effort.
Right now, we have enormous potential in the field, but we have very limited human talent, and limited computational resources. We should take on the responsibility to ensure we draw the right lessons from the work we do and present our work in as useful a way as possible. That’s why “Re-thinking the Value of Network Pruning” is a strong output – not only does it find a surprising and successful technique, it also presented negative results. The quality of the scientific analysis in the paper makes it, in my opinion, a worthy recipient of the workshop’s Best Paper Award and hopefully sets a precedent for more researchers to explore negative results for the greater good of the field.
*Special thanks to Luyu Wang for running the re-initialization experiment in this post.
]]>The Pommerman environment [1] is based off of the classic Nintendo console game, Bomberman. It was set up by a group of machine learning researchers to explore multi-agent learning and push state-of-the-art through competitive play in reinforcement learning.
The team competition was held on December 8, 2018 during the NeurIPS conference in Montreal. It involved 25 participants from all over the world. The Borealis AI team, consisting of Edmonton researchers Chao Gao, Pablo Hernandez-Leal, Bilal Kartal and research director, Matt Taylor, won 2nd place in the learning agents category, and 5th place in the global ranking including (non-learning) heuristic agents. As a reward, we got to haul a sweet NVIDIA Titan V GPU CEO Edition home. Here’s how we pulled it off.
The Pommerman team competition consists of four bomber agents placed at the corners of an 11 x 11 symmetrical board. There are two teams, each consisting of two agents.
Competition rules work like this:
At every timestep, each agent has the ability to execute one of six actions: they can move in any one of four cardinal directions, remain in place, or plant a bomb.
Each cell on the board can serve as a passage (the agent can walk over it), a rigid wall (the cell cannot be destroyed), or a plank of wood (the cell can be destroyed with a bomb).
The game maps, which function as individual levels, are randomly generated; however, there is always a guaranteed path between any agents so that the procedurally generated maps are guaranteed to be playable.
Whenever an agent plants a bomb it explodes after 10 timesteps, producing flames that have a lifetime of two timesteps. Flames destroy wood and kill any agents within their blast radius. When wood is destroyed, the fallout reveals either a passage or a power-up (see below).
Power-ups, which are items that impact a player’s abilities during the game, can be of three types: i) they increase the blast radius of bombs; ii) they increase the number of bombs the agent can place; or iii) they give the ability to kick bombs.
Each game episode lasts up to 800 timesteps. There are two ways to end a game: if a team wins before reaching this upper bound, the game is over. If not, a tie is called at 800 timesteps and the game ends that way.
An example of a Pommerman team game.
The Pommerman team competition is a very challenging benchmark for reinforcement learning methods. Here’s why:
When an agent is in the early stage of training, it commits suicide many times.
After some training, the agent learns to place bombs near the opponent and move away from the blast.
Our learning agent (white) is highly skilled against a SimpleAgent. It avoids the blasts and also learns how to trick SimpleAgent to commit suicide in order to win without having to place any bombs.
When we examined the behavior of our learning agent against SimpleAgent we discovered that our agent had learned how to force SimpleAgent to commit suicide. It started when SimpleAgent first placed a bomb, then took a movement action to go toward a neighbor cell X. Our agent, after learning this opponent behaviour, then took a movement action to simultaneously go to the cell X, and thus, by game-engine forward model both were sent back to their original location in the next time step. In other words, our agent had learned a flaw in SimpleAgent and exploited this flaw to win the games by forcing it to commit suicide. This pattern was repeated until the bomb went off, successfully blasting SimpleAgent. This policy is optimal against SimpleAgent; however, it lacks generalization against other opponents since these learning agents learned to stop placing bombs and make themselves easy targets for exploitation.
Notes: This generalization over opponent policies is of utmost importance when dealing with dynamic multiagent environments and similar problems have also been encountered in Laser Tag [6].
In single-agent tasks, faulty (and strange) behaviors have also been observed [7]. A trained RL agent for CoastRunners discovered a spot in the game where, due to the unexpected mismatch between maximum possible reward and intended behaviour, RL agent can obtain higher scores in this spot rather than finishing the game.
A Skynet team is composed of a single neural network and is based on five building blocks:
We make use of the parameter sharing mechanism; that is, we allow the agents to share the parameters of a single network. This allows the network to be trained with the experiences of the two agents. However, it still allows for diverse behavior between agents because each agent receives different observations.
Additionally, we added dense rewards to help the agent improve learning performance. We took inspiration from the difference reward mechanism to provide agents with a more meaningful contribution of their behavior, this in contrast to simply using the single global reward.
Our third block, ActionFilter module, built on the philosophy of installing prior knowledge to the agent by telling the agent what it should not do, then allowed the agent to discover what to do by trial-and-error, i.e., learning. The benefit is twofold: 1.) the learning problem gets simplified; and 2.) superficial skills, such as avoiding flames or evading bombs in simple cases, are perfectly acquired by the agent.
It is worth mentioning that the above ActionFilter does not significantly slow down RL training as it is extremely fast. Together with neural net evaluation, each action still takes several milliseconds – a speed almost equivalent to pure neural net-forward inferring. For context, the time limit in the competition is 100 ms per move.
The neural net is trained by $\mathit{PPO}$, minimizing the following objective:
\begin{equation}
\begin{split}
o(\theta;\mathcal{D}) & = \sum_{(s_t, a_t, R_t) \in \mathcal{D}} \Bigg[ -\mathit{clip}(\frac{\pi_\theta(a_t|s_t)}{\pi_\theta^{old}(a_t|s_t)}, 1-\epsilon, 1+\epsilon) A(s_t, a_t) + \\
& \frac{\alpha}{2} \max\Big[ (v_\theta(s_t) -R_t)^2, (v_\theta^{old}(s_t) + \mathit{clip}(v_\theta(s_t) - v_\theta^{old}(s_t), -\epsilon, \epsilon)-R_t)^2 \Big] \Bigg],
\end{split}
\end{equation}
where $\theta$ is the neural net, $\mathcal{D}$ is sampled by $\pi_\theta^{old}$, $\epsilon$ is a tuning parameter. Refer to PPO paper for details.
We let our team compete against a set of curriculum opponents:
The reason we allowed the opponent to not place a bomb is that we realized the neural net can focus on learning true “blasting” skills, and not a skill that solely relies the opponent's mistakenly suicidal actions. Also, this strategy avoids training on “false positive” reward signal caused by an opponent's involuntary suicide.
As shown in the figure below, the architecture first repeats four convolution layers, followed by two policy and value heads, respectively.
Instead of using an LSTM to track the observational history, we used a “retrospective board” to keep track of the most recent value of each cell on the board. For cells outside an agent's purview, the “retrospective board” filled the unobserved elements of the board with the elements that were observed most recently. The input feature counts 14 planes in total, where the first 10 planes are extracted from the agent's current observation, while the remaining four come from the “retrospective board”.
An example of a game between Skynet Team (Red) vs a team composed of two SimpleAgents (Blue).
The Pommerman team competition used a double-elimination style. The top three agents used tree search methods, i.e., they actively employed the game-forward model to look ahead for each decision by using heuristics. During the competition, these agents seemed to perform more bomb-kicking, which increased their chances of survival.
As mentioned in the introduction, our Skynet team won 2nd place on the category of learning agents, and 5th place on the global ranking, including (non-learning) heuristic agents. It is worth noting that scripted agents were not among the top players in this competition, which shows the high quality level amongst the tree search and learning methods.
Another one of our submissions, CautiousTeam, was based on SimpleAgent and – interestingly enough – wound up ranking 7th overall in the competition. CautiousTeam was submitted primarily for verifying the suspicion that a SimpleAgent without placing a bomb could be strong (or perhaps even stronger) than the winner [3] of the first competition held in June, i.e., a fully observable free-for-all scenario. It seems the competition results supported this suspicion.
Aside from being an interesting and (most importantly) fun environment, the Pommerman simulator was also designed as a benchmark for multiagent learning. We are currently exploring multi-agent deep reinforcement learning methods [5] by using Pommerman as a testbed.
We would like to thank the creators of the Pommerman testbed, the competition organizers and the growing Pommerman community on Discord. We look forward to future competitions.
[1] Cinjon Resnick, Wes Eldridge, David Ha, Denny Britz, Jakob Foerster, Julian Togelius, Kyunghyun Cho, and Joan Bruna. Pommerman: A multiagent playground. arXiv preprint arXiv:1809.07124, 2018.
[2] Richard S. Sutton and Andrew G. Barto. Reinforcement learning: An introduction. MIT press, 2018.
[3] Hongwei, Zhou et al. "A hybrid search agent in pommerman." Proceedings of the 13th International Conference on the Foundations of Digital Games. ACM, 2018.
[4] Bilal Kartal, Pablo Hernandez-Leal, and Matthew E. Taylor. "Using Monte Carlo Tree Search as a Demonstrator within Asynchronous Deep RL." arXiv preprint arXiv:1812.00045(2018).
[5] Pablo Hernandez-Leal, Bilal Kartal, and Matthew E Taylor. Is multiagent deep reinforcement learning the answer or the question? A brief survey. arXiv preprint arXiv:1810.05587, 2018.
[6] Lanctot, Marc, et al. "A unified game-theoretic approach to multiagent reinforcement learning." Advances in Neural Information Processing Systems. 2017.
[7] Open AI. Faulty Reward Functions in the Wild. https://blog.openai.com/faulty-reward-functions/
[8] Pablo Hernandez-Leal, Bilal Kartal, and Matthew E Taylor. Skill Reuse in Partially Observable Multiagent Environments. LatinX in AI Workshop @ NeurIPS 2018
[9] LeCun, Yann, Yoshua Bengio, and Geoffrey Hinton. "Deep learning." nature 521.7553 (2015): 436.
[10] J. N. Foerster, Y. M. Assael, N. De Freitas, S. Whiteson, Learning to communicate with deep multi-agent reinforcement learning, in: Advances in Neural Information Processing Systems, 2016,
[11] J. N. Foerster, N. Nardelli, G. Farquhar, T. Afouras, P. H. S. Torr, P. Kohli, S. Whiteson, Stabilising Experience Replay for Deep Multi-Agent Reinforcement Learning., in: International Conference on Machine Learning, 2017.
[12] Devlin, Sam, et al. "Potential-based difference rewards for multiagent reinforcement learning." Proceedings of the 2014 international conference on Autonomous agents and multi-agent systems.
[13] Schulman, John, et al. "Proximal policy optimization algorithms." arXiv preprint arXiv:1707.06347 (2017).
[14] Bansal, Trapit, et al. "Emergent complexity via multi-agent competition." arXiv preprint arXiv:1710.03748 (2017).
In this section, we provide more details on some of the aforementioned concepts as follows:
Difference rewards: This is a method to better address the credit assignment challenge for multi-agent teams. Relying only on the external reward, both agents receive the same team (global) reward independently of what they did in the episode. However, this renders multi-agent learning more difficult, as spurious actions can occasionally be rewarded, or some of the agents can learn to do the most of the jobs while the rest learn to be lazy. However, difference rewards [14] propose to compute the individual contribution without hurting the coordination performance. The main idea is very tidy: you compute individual rewards by subtracting the counterfactual reward computed without corresponding agents’ actions to the external global reward signal. Thus, team members are encouraged to optimize the overall team performance but also optimize their own contribution, so that no lazy agents can arise.
Centralized training and decentralized execution: Even though the team game scenario comes with partially observable execution, you can only hack the game simulator to access the full state during training. With full state access, you can train a centralized value function, which is used during training for actor-critic setting, but deployed agents will only utilize the policy network, i.e. that are trained with partial observations.
Dense rewards: There can be cases where an agent commits suicide and the remaining team member terminates the opposing team by itself. In the simplest case, both team members would get a +1 reward – i.e., suicidal agent is reinforced! We altered the single stark external reward for the first dying team member in order to address this issue where the first dying agent gets a smaller reward than last surviving one. This helped our agent improve game-play performance, although this modification comes with some expected and some unforeseen consequences. For instance, under this setting we can never have a team where one agent sacrifices itself (in such a configuration that it dies with the enemy simultaneously) for the team to win or that allows it to allot less credit to a hard-working agent that terminated an enemy but died when battling the second enemy.
Action filter description: We implement a filter with two categories:
For avoiding suicide
For placing bombs
Authors: Yu-An Chung · Wei-Hung Weng · Schrasing Tong · James Glass
Last year, I was surprised by a paper that introduced a technique to perform word translation between two languages without parallel corpora. To clarify, you’re said to have unparalleled corpora when two sets of a text exist in two languages (e.g. a set of English words and a set of French words), but there is no information regarding which English word corresponds to the appropriate French translation.
Previously, the state-of-the-art method for learning cross-lingual word embeddings mainly relied on bilingual dictionaries, along with some help from character-level information for languages that shared a common alphabet. None of this was competitive with supervised machine translation techniques. The authors of the paper managed to pose a different question: Was it possible to do unsupervised word translation? They answered their own question by introducing a new technique that worked quite well.
Their model worked by obtaining the word embeddings space for both languages, independently, and introducing a technique for unsupervised alignment between the two embedding spaces that can achieve translations without parallel corpora. The intuition behind this technique is to rotate one embedding space to the point that the two embedding spaces are virtually indistinguishable to a classifier (i.e., adversarial training).
This year, a new set of authors presented their work regarding the task of automatic speech recognition without parallel data. This, again, means two independent sets of speech data and text data exist, but the correspondence information between them is unclear. This work stood out since it is the first successful attempt to apply the unsupervised alignment technique introduced last year on multiple modalities of data. The task involved taking a dataset of words from one language and a dataset of spoken words from either the same or a different language, and automatically identifying spoken words without parallel information.
The authors first trained an embedding space for written words and another embedding space for spoken words. They then applied the unsupervised alignment technique to the embedding spaces to align them so that spoken words could automatically be classified and translated. At test time, a speech segment is first mapped into its respective embedding space, aligned to the text embedding space, then the nearest neighbors of the text embedding are picked as the translation. The same procedure can be used for the text-to-speech conversion task.
The authors present some experiments on the spoken Wikipedia and LibriSpeech datasets that show unsupervised alignments are still not as good as supervised alignment – but they’re close. Some challenges still remain to be solved before unsupervised cross-modal alignments can be competitive with supervised ones; however, this work shows the promise of improving automatic speech recognition (ASR), text-to-speech (TTS) and even translation systems, especially in languages with a low availability of paralleled data. (/HS)
Authors: Rad Niazadeh · Tim Roughgarden · Joshua Wang
This paper was accepted as an oral presentation. The authors gave an approximation algorithm for maximizing continuous non-monotone submodular functions.
To give a brief recap, submodular functions arise in several important areas of machine learning and, in particular, around the intersection of economics and learning. They can be used to model the problem of maximizing multi-platform ad revenue, where a buyer wants to maximize their profit = revenue - cost by advertising on different platforms and there is a diminishing return of advertising on more platforms. This diminishing return is the precise property captured by the sub-modular functions. Mathematically, a function $f:\{0,1\}^n \rightarrow \mathbb{R}$ is submodular if $f(S \cup \{e\}) - f(S) \geq f(T \cup \{e\}) - f(T)$ for every $S \subseteq T$ and $e \notin T $. In this setting, there is an information-theoretic lower bound of $1/2$-approximation [Feige et al.'11] and there is an optimal algorithm which matches this bound [Buchbinder et al.'15].
This paper considered the continuous submodular function where, instead of maximizing on the vertices of the hypercube $\{0,1\}^n$, we want to maximize over the full hypercube $[0,1]^n$. The main result of the paper is that they obtained a randomized algorithm for maximizing a continuous submodular and $L$-Lipschitz function over the hypercube that guarantees a $1/2$-approximation. Note that this is currently the best possible ratio that is information-theoretically achievable.
The reason this paper stood out is that the authors used the double greedy framework of Buchbinder et al.'15 to solve the coordinate-wise zero-sum game, and then use the geometry of this game to bound the value at its equilibrium. This is a nice application of game theory to maximize the value of the function. The authors also conducted experiments on 100-dimensional synthetic data and achieved comparable results as the previous work they referenced. One thing we hoped to see was that their achievement of better approximations and faster algorithms would also show a significant advantage in the experiments, but that was not the case.
In terms of the open problems, I am really excited to see the development of parallel and online algorithms for continuous sub-modular optimization. In fact, there is a recent work for parallel algorithms of Chen et al.'18 which achieves a tight $(1/2 - \epsilon)$-approximation guarantee using $\tilde{O}(\epsilon^{-1}$) adaptive rounds. (/KJ)
Authors: Jiantao Jiao · Weihao Gao · Yanjun Han
This paper focused on estimating the differential entropy of a continuous distribution $f$ given $n$ i.i.d. samples. Entropy has been a core concept of information theoretic measures and has engendered numerous important applications, such as goodness-of-fit tests, feature selection, and tests of independence. In the vast body of literature around this concept, most of the measures have appeared to take on an asymptotic flavor – that is, until several recent works.
This paper is one of those works. The authors took particular focus on the fixed-k nearest neighbor (fixed-kNN) estimator, also called the Kozachenko-Leonenko estimator. This estimator is simple; there is only one parameter to tune, and it requires no knowledge about the smoothness degree $s$ about targeted distribution $f$. Moreover, it is also computationally efficient, since $k$ is fixed (compared to other methods with similar finite sample bounds) and statistically efficient: As shown in this paper, it has a finite sample bound that is close to optimal. All of these properties make the estimator realistic and attractive in practice.
I found the paper also carried some interesting technical results. One direct approach in estimating the differential entropy is to plug in a consistent estimator, for example, based on kNN distance statistics of the density function $f$ into the formula of entropy. However, such estimators usually come with an impractical computational demand. For instance, in the kNN-based estimator, $k$ has to approach $\infty$ as the number of samples $n$ approaches $\infty$.
In a recent paper by Han et al. [2017], the authors constructed a complicated estimator that achieves a finite sample bound in the rate of $n\log(n))^{-\frac{s}{s+d}} + n^{-\frac12}$ (the optimal rate). One caveat, though, is that it requires the knowledge of the smoothness degree $s$ of the targeted distribution $f$. The last challenging part is to deal with the area where $f$ is small. A major difficulty in achieving such bounds for the entropy estimator is that the nearest neighbor estimator exhibits a huge bias in its low-density area. Most papers tend to make assumptions about the property of $f$ such that this bias is well controlled. However, this paper did not presume similar assumptions. Given all these constraints, including fixed $k$, no knowledge of $s$ and no assumptions on how $f$ is bounded from below, the authors managed to prove a nearly optimal finite sample bound for a simple estimator. According to the authors, the new technical tools here are the Besicovitch convering lemma and a generalized Hardy-Littlewood maximal inequality. This part is not yet clear to me.
Lastly, the authors also pointed out several weaknesses in their paper and their plans for future work. For example, they conjectured that both the upper bound and the lower bound in the paper could be further improved. They also hypothesized a way to extend the constraint on $s$ in the theorem so that the result can be applied to a more general setting. (/RH)
Authors: Kevin Scaman · Francis Bach · Sebastien Bubeck · Laurent Massoulié · Yin Tat Lee
This paper considered distributed optimization of non-smooth convex functions using a network of computing units. The objective of this work was to study the impact of the communication network on learning, and the tradeoff between the structure of the network and algorithmic efficiency. The network consists of a connected simple graph of nodes, each having access to a function (such as a loss function). The optimization problem exists to minimize the average of the local functions: communication between nodes takes a given length of time and computation takes one unit of time. Under a decentralized scenario, local communication is performed through gossip.
The authors give bounds on the time to reach a given precision, then provide an optimal algorithm that uses a primal-dual reformulation. They are able to show that the error due to limits in communication resources will then decrease at a fast rate. In the centralized setting, the authors provide an algorithm which achieves convergence rates within $d^{1/4}$ to the optimal, where d is the underlying dimension.
I found this paper intriguing because it considers the impact of communication and computation resources in learning, which will be increasingly important as systems we learn on become larger. It received one of the best paper awards and is one of few papers that consider such impacts. There’s an argument to be made that these two things are related; as learning systems scale up and get distributed through IoT and mobile devices, the importance of distributed learning in a setting where there is tension between communication and computation has also increased. The elegant analytical tools used in this paper – gossip methods, primal-dual formulation, Chambolle-Pock algorithm for saddle-point optimization, the combined use of optimization and graph theory, and the bounds that give insight into which resources are important at which stage of convergence – show that the award places well-deserved attention toward a growing area. (/NH)
Invited talk: Jon Kleinberg - Fairness, Simplicity, and Ranking
In Jon Kleinberg's fascinating invited talk, he addressed the effect of implicit bias on producing adverse outcomes. The specific application he referred to is that of bias in activities such as hiring, promotion, admissions. The setting is as follows: a recruitment committee is tasked with selecting a shortlist of final candidates from a given pool of applicants, but their estimates of skill, used in the selection, may be skewed by implicit bias.
The Rooney Rule is an NFL policy in effect since 2003 that requires teams to interview ethnic-minority candidates for coaching and operations positions. (Note: There is no quota or preference given in the actual hiring). Kleinberg and his co-authors showed that measures such as the Rooney Rule lead to higher payoffs for the organization. Their model is as follows: a recruiting committee must select a list of k candidates for final interviews; the set of applicants is then divided into two groups, X and Y, X being the minority group; there are $n$ Y applicants and $n\alpha$ X applicants with $\alpha \le 1$. Each candidate has a numerical value representing their skill, and there is a common distribution from which these skills are drawn.
Based on empirical studies of skills in creative and skilled workforces, the authors then modeled this distribution as a Pareto distribution (power law). The utility that the recruiting committee aims to maximize is the sum of the candidates’ skills that were selected to the list. The authors modeled the bias as a multiplicative bias in the estimation of the skills of X-candidates. So, Y candidates are estimated at their true value and an X-candidate skill is estimated to be $X_i/\beta$ for candidate i where $\beta >1$. The authors then analyzed the utility of a list of $k$ candidates where at least one must be an X candidate. Their analysis showed an increase in utility even when the list was of size 2, and for a large range of values for the bias, power law, and population parameters.
I found this to be another very interesting and important paper because it tackles the question of fairness at a very practical level and provided a tangible algorithmic framework with which to expose, then analyze the outcomes. Furthermore, the modelling assumptions were very realistic, and their results demonstrated the potential for significant impact. The particular scenario considered here may be for activities such as hiring and admissions, but the result has consequences for machine learning models. (/NH)
]]>In this work, we present a simple yet surprisingly effective way to prevent catastrophic forgetting. Our method, called Few-Shot Self Reminder (FSR), regularizes the neural net from changing its learned behaviour by performing logit matching on selected samples kept in episodic memory from the previous tasks.
Surprisingly, this simplistic approach only requires retraining a small amount of data in order to outperform previous knowledge retention methods. We demonstrate the superiority of our method to previous ones on popular benchmarks, as well as a new continual learning problem where tasks are designed to be more dissimilar.
]]>In this paper, we present a new technique for hard negative mining for learning visual-semantic embeddings for cross-modal retrieval. We focus on selecting hard negative pairs that are sampled by an adversarial generator.
In settings with attention, our adversarial generator composes harder negatives through novel combinations of image regions across different images for a given caption. We find scores across the board for all R@K-based metrics, but this technique is also significantly more sample efficient and leads to faster convergence in fewer iterations.
]]>
Authors: Kry Lui, Gavin Weiguang Ding, Ruitong Huang, Robert McCann
Poster: Decemeber 5, 10:45 am – 12:45 pm @ Room 210 & 230 AB #103
Dimensionality reduction occurs frequently in machine learning. It is widely believed that reducing more dimensions will often result in a greater loss of information. However, the phenomenon remains a conceptual mystery in theory. In this work, we try to rigorously quantify such phenomena in an information retrieval setting by using geometric techniques. To the best of our knowledge, these are the first provable information loss rates due to dimensionality reduction.
Authors: Christopher Blake, Luyu Wang, Giuseppe Castiglione, Christopher Srinivasa, Marcus Brubaker
Workshop: Compact Deep Neural Network Representation (Spotlight Paper); December 7, 2:50 pm
When seeking energy-efficient neural networks, we argue that wire-length is an important metric to consider. Based on this theory, new techniques are developed and tested to train neural networks that are both accurate and wire-length-efficient. This contrasts to previous techniques that minimize the number of weights in the network, suggesting these techniques may be useful for creating specialized neural network circuits that consume less energy.
Authors: Junfeng Wen, Yanshuai Cao, Ruitong Huang
Workshop: Continual Learning; December 7
We present a simple, yet surprisingly effective, way to prevent catastrophic forgetting. Our method regularizes the neural net from changing its learned behaviour by performing logit matching on selected samples kept in episodic memory from previous tasks. As little as one data point per class is found to be effective. With similar storage, our algorithm outperforms previous state-of-the-art methods.
Authors: *A.J. Bose, *Huan Ling, Yanshuai Cao
Workshop: ViGIL; December 7, 8 am – 6:30 pm
We present a new technique for hard negative mining for learning visual-semantic embeddings. The technique uses an adversary that is learned in a min-max game with the cross-modal embedding model. The adversary exploits compositionality of images and texts and is able to compose harder negatives through a novel combination of objects and regions across different images for a given caption. We show new state-of-the-art results on MS-COCO.
Authors: Gavin Weiguang Ding, Yik Chau (Kry) Lui, Xiaomeng Jin, Luyu Wang, Ruitong Huang
Workshop: Security in Machine Learning; December 7, 8:45 am – 5:30 pm
We demonstrate an intriguing phenomenon about adversarial training – that adversarial robustness, unlike clean accuracy, is highly sensitive to the input data distribution. In theory, we show this by analyzing the Bayes classifier’s robustness. In experiments, we further show that transformed variants of MNIST and CIFAR10 achieve comparable clean accuracies under standard training but significantly different robust accuracies under adversarial training.
Authors: Pablo Hernandez-Leal, Bilal Kartal, Matthew E. Taylor
Workshop: Latinx in AI Coalition; December 2, 8 am – 6:30 pm
Our goal is to tackle partially observable multiagent scenarios by proposing a framework based on learning robust best responses (i.e., skills) and Bayesian inference for opponent detection. In order to reduce long training periods, we propose to intelligently reuse policies (skills) by quickly identifying the opponent we are playing with.
Authors: Yash Sharma, Gavin Weiguang Ding
Workshop: NeurIPS 2018 Competition Track Day 1; 8 am – 6:30 pm
This challenge pitted submitted adversarial attacks against submitted defenses. The challenge was unique in that they allowed for a limited set of queries, outputting the decision of the defense, rewarded minimizing the (L2) distortion instead of using a (Linf) distortion constraint, and used TinyImageNet instead of the ImageNet dataset, making it tractable for competitors to train their own models. Our attack solution placed top-10 overall in the challenge, in particular placing 5th in the targeted attack track – a more difficult setting. We based our solution on performing a binary search to find the minimal successful distortion, then optimizing the procedure while still performing the necessary number of iterations to meet the computational constraints.
Northern Frontier sat down with Abhishek Gupta, AI ethics researcher at McGill University and founder of the Montreal AI Ethics Institute, to dive into some of the key themes of the day, including the threat automation poses to job loss based on the current science, whether bias is the biggest problem we face in responsible AI, the dangers of 'mathwashing', and what we should consider reasonable trade-offs for improving fairness.
]]>Dimensionality reduction occurs very naturally and very frequently within many machine learning applications. While the phenomenon remains, for the most part, a conceptual mystery, one thing many researchers believe is that reducing more dimensions will often result in a greater loss of information. What’s even harder to confirm is the rate at which this information loss occurs, as well as how to formulate the problem (even for very simple data distributions and nonlinear reduction mappings.)
In this work, we try to rigorously quantify such empirical observations from an information retrieval perspective by using geometric techniques. We begin by formulating the problem through an adaptation of two fundamental information retrieval measures – precision and recall – to the (continuous) function analytic setting. This shift in perspective allows us to borrow tools from quantitative topology in order to establish the first provable information loss rate induced by dimension-reduction.
We were surprised to discover that when we began reducing dimensions, the precision would decay exponentially. This discovery should raise red flags for practitioners and experimentalists attempting to interpret their dimension reduction maps. For example, it may not be possible to design information retrieval systems that enjoy high precision and recall at the same time. This realization should keep us mindful of the limitations of even the very best dimension reduction algorithms, such as t-SNE.
While precision and recall are natural information retrieval measures, they do not directly take advantage of the distance information between data (e.g. in data visualization). We therefore propose an alternative dimension-reduction measure based on Wasserstein distances, which also provably captures the dimension reduction effect. To obtain this theoretical guarantee, we solve the following iterated-optimization problem:
\[
\inf_{W:\,\text{Vol}_n(W) = M} W_{2}(\mathbb{P}_{B_r}, \mathbb{P}_{W})
=
\inf_{W:\,\text{Vol}_n(W) = M} \inf_{\gamma \in \Gamma (\mathbb{P}_{B_r} , \mathbb{P}_{W})} \mathbb{E}_{(a, b) \sim \gamma} [ \| a - b \|^{2}_{2} ]^{1/2} ,
\]
by using recent results from optimal partial transport literature.
While precision and recall for supervised learning problems are familiar concepts, let’s do a quick review of it before we adapt it to the dimensionality reduction context.
In a supervised learning setting, say the classification of cats vs dogs, first select your favorite neural net classifier fW, then collect 1,000 test images.
\[
Precision = 1/2 \frac{\text{How many are cats}}{\text{Among the ones predicted as cats}} + 1/2 \frac{\text{How many are dogs}} {\text{Among the ones predicted as dogs}}
\]
\[
Recall = 1/2 \frac{ \text{How many are predicted as cats}}{\text{Among the cats}} + 1/2 \frac{\text{How many are predicted as dogs}}{\text{Among the dogs}}
\]
Generally speaking, we can average precision and recall over n classes.
The formulation we used for precision and recall was inspired by the supervised learning setting. Since dimensionality reduction can actually happen in an unsupervised setting, we needed to change a few things around. In a typical dimensionality reduction map $f: X \rightarrow Y$, we often care about preserving the local structure post-reduction.
Since the unsupervised setting means there are no more labels, the first thing we did was to replace “label for an input x” by “neighboring points for an input x in high dimension.” We didn’t have the predictions either, but we felt it made sense to replace “prediction for an input x” by “neighboring points for an input y = f(x) in low dimension.”
When computing precision and recall in the supervised cases, we averaged across the labels. So, the second thing we did was to average over each data point.
\[Precision = \frac{1}{n} \sum_{x \in X_n} {\text{How many are neighbors of} ~x~ \text{in high dimension} / \text{Among the low dimensional neighbors of}~ y = f(x)}
\]
\[Recall = \frac{1}{n} \sum_{x \in X_n} {\text{How many are neighbors of} ~y = f(x)~ \text{in low dimension} / \text{Among the high dimensional neighbors of} ~x}
\]
But it was still hard to prove anything, even with the settings detailed above. One difficulty we faced is that f is a map between continuous spaces, but the data points are finite samples. This motivated us to look for continuous analogues of the examples below:
Finally, we arrived at one of the paper’s key observations: Precision is roughly injectivity; recall is roughly continuity.
Let’s build some intuition with linear maps. In linear algebra, we learn that a linear map from $\mathbb{R}^n$ to $\mathbb{R}^m$ must have null space of dimension at least $n - m$. One may interpret this as the “how” and “why” of when linear maps lose information: distant points in high-dimension are projected together in low-dimension. This process leads to very poor precision.
In practice, DR maps can be much more flexible than linear maps. So, can this expressivity circumvent the linear algebraic dimensional mismatch issue? To study dimension reduction under continuous maps, we turned to the corresponding study of topological dimension: waist inequality from quantitative topology. It turns out that a continuous map from high-to-low dimension still fails to circumvent the similar issue that plagues linear maps – that many continuous maps collapse the points together. For most $x$, we have for $y = f(x): \text{Vol}_{n-m} f^{-1}(y) \ge C(x) \text{Vol}_{n-m} B^{n-m}$.
Roughly speaking, the relevant neighborhood $U$ of $x$ is typically small, in all $n$ directions, while the retrieval neighborhood $f^{-1}(V)$ is big in n-m directions. This quantitative mismatch makes it very difficult to achieve high precision for a continuous DR map. It’s this mismatch that leads to the exponential decay of precision:
\[
Precision^{f}(U, V)
\leq
D(n, m)\,\left(\frac{r_U}{R}\right)^{n-m}\,\frac{ r_U^{^{m}} }{p^{m}(r_V/L)}
\]
The above trade-off/information loss phenomenon has been widely observed by experimentalists. Naturally, practitioners have developed various tools to measure the imperfections. What was less clear in this regard is what led to the trade-off, so having clarified this a bit more, we can now design better measurement devices.
When we naively compute sample precision and recall:
\[
Precision =
\frac{\mathrm{~How~many~points~are~high~dimensional~neighbors~of}~x }{\mathrm{Among~the~low~dimensional~neighbors~of}~y}
\]
\[
Recall =
\frac{\mathrm{~How~many~points~are~low~dimensional~neighbors~of}~y }{\mathrm{Among~the~high~dimensional~neighbors~of}~x}
\]
These two quantities are equal when we fix the number of neighboring points. (Here, the numerators are the same. When we fix the number of neighboring points, the denominators are equal as well). Fixing the number of neighboring points is one of the reasons behind t-SNE’s success, since some data points are quite far away from others and without fixing them some outliers wouldn’t have any neighboring points.
We can alternatively compute them by discretizing the continuous precision and recall shown above.
\[
Precision =
\frac{\mathrm{How~many~points~are~within~r_U~from}~x~\mathrm{and~within~r_V~from}~y}{\mathrm{How~many~points~are~within~r_V~from}~y}
\]
\[
Recall = \frac{\mathrm{How~many~points~are~within~r_U~from}~x~\mathrm{and~within~r_V~from}~y}{\mathrm{How~many~points~are~within~r_U~from}~x}
\]
But not only will this create an unequal number of neighboring points, it will result in quite a few data points ending up with very few neighbors. This is partially caused by high-dimensional geometry. Either way it can appear as though precision and recall are difficult quantities to manage in a practical situation.
Let’s revisit the problem from an alternative perspective. From the proof-of-precision’s dimension reduction rate, it’s clear that the mismatch comes from $f^{-1}(V)$ and $U$ - and this corresponds to the injectivity imperfection. Heuristically, the parallel quantity for continuity imperfection is f(U) and V.
We therefore proposed the following Wasserstein measures:
\[
W_{2}(\mathbb{P}_U, \mathbb{P}_{f^{-1}(V)});
W_{2}(\mathbb{P}_{f(U)}, \mathbb{P}_V)
\]
Like precision and recall, we associated the two Wasserstein measures for each point in the DR visualization map.
i) On a theoretical level, our work sheds light on the inherent trade-offs in any dimensionality reduction mapping model (e.g., visualization embedding).
ii) On a practical level, the implications are for practitioners to have a measurement tool to improve their data exploration practice: To date, people have put too much trust on low-dimensional visualizations. Low-dimensional visualizations can at best poorly reflect high-dimensional data structures; at worst it can produce incorrect representations, which can then degrade any subsequent analysis built upon them. We strongly suggest that practitioners improve their practice by incorporating a reliability measure for each data point on all their data visualizations.
]]>Deep reinforcement learning (DRL) is a recent yet very active area of research that joins forces between deep learning (the use of neural networks) and reinforcement learning (solving sequential decision tasks). In DRL, the goal is to learn an optimal policy (behavior) of an agent acting in an environment, with deep neural networks as the function approximator (for example in the value function). DRL has achieved outstanding results so far in areas that include beating human-level performance in Atari games [11] and in DeepMind’s now famous Go tournament versus Lee Sedol [12]. This has led to a dramatic increase in the number of applications that use this technique, most notably in video games and robotics.
Much of the successful DRL research to date has only considered single-agent environments. For instance, in Atari games, there is only one player to control. However, recent works have started to up the ante beyond single-agent scenarios and have begun exploring multi-agent scenarios, where the environment is populated with several learning agents at once. The presence of multiple agents causes additional dynamicity in the environment and in the agents themselves, which makes learning more complicated.
Recent works have reported successes in multiagent domains, such as DOTA 2 [14] or Capture the flag [21], in which many agents learn to compete and/or cooperate in the same environment. Despite these promising results, however, there are still many open challenges to be addressed. This article aims to (i) provide a clear overview of current multiagent deep reinforcement learning (MDRL) trends and (ii) share both examples and lessons learned on how certain methods and algorithms from DRL and multiagent learning can be used in complementary ways to solve problems in this emerging area.
Almost 20 years ago, Stone and Veloso's seminal survey used very intuitive and practical examples [1] to lay the groundwork for defining the area of multiagent systems (MAS) and its open problems in the context of machine learning:
“AI researchers have earned the right to start examining the implications of multiple autonomous agents interacting in the real world. In fact, they have rendered this examination indispensable. If there is one self-steering car, there will surely be more. And although each may be able to drive individually, if several autonomous vehicles meet on the highway, we must know how their behaviors interact”.
Roughly ten years later, Shoham, Powers, and Grenager [2] noted that the literature on learning in multiagent systems, or multiagent learning (MAL), was on the rise and it was no longer possible to enumerate all relevant articles. In the decade since, the number of published MAL works continues to rise, resulting in a series of different surveys (reviews) that showcase everything from analyzing the basics of MAL and their challenges [3] to addressing specific subareas (e.g., cooperative settings and evolutionary dynamics of MAL) [4-10].
Research interest in MAL has been accompanied by a number of successes; first, in single-agent Atari games [11], and more recently in two-player games [12-13] like Go, poker, and games involving two competing teams.
Deep reinforcement learning [15] plays a key role in these works and has successfully integrated with other AI techniques like (Monte Carlo tree) search, planning, and more recently, multiagent systems. The result is the emerging area of multiagent deep reinforcement learning (MDRL).
Learning in multiagent settings is fundamentally more difficult than the single-agent case thanks to problems [3-10] like:
• Non-stationarity: If all agents are learning at the same time, the dynamics become more complicated and break many standard RL assumptions.
• Curse of dimensionality: An exponential growth of state-action space when a learning agent keeps track of all agent actions.
• Multiagent credit assignment: Defining how agents should deduce their contributions when learning in a team; for example, if agents receive a team reward but it’s one agent doing most of the work, others can become “lazy.” It’s just like real life!
Despite these complexities, top AI conferences such as AAAI, AAMAS, ICLR, IJCAI, and NIPS have all published works reporting MDRL successes. The validation of these top-tier conferences convinces us it is valuable to compile an overview of the recent MDRL works and understand how these works relate to the existing literature.
In this context, we have identified four prominent categories in which to group recent works, as shown in the following figure.
(a) Analysis of emergent behaviors: evaluate DRL algorithms in multiagent scenarios.
(b) Learning communication: agents learn with actions and through messages.
(c) Learning cooperation: agents learn to cooperate using only actions and local observations.
(d) Agents modeling agents: agents reason about others to fulfill a task (e.g., cooperative or competitive).
The next objective of this article is to provide guidelines by showcasing how methods and algorithms from DRL and multiagent learning can complement each other to solve problems in MDRL. This occurs, for example, when:
• Dealing with non-stationarity
• Dealing with multiagent credit assignment
We also present general lessons learned from these works, such as the use of:
• Experience replay buffers in MDRL – a key component in many DRL works. These containers serve as explicit memory, storing interactions through which agents learn to improve their behaviors.
• Recurrent neural networks (e.g., LSTMs). These networks serve as implicit memory that improves performance, particularly for partially-observable environments.
• Centralized learning with decentralized execution: Agents can be trained with a central controller that has both access to agents’ actions and observations, but during deployment they will operate based solely on observations.
• Parameter sharing: In many tasks, it is useful to share a network’s internal layers, even when there are many outputs.
Towards the end of the article, we reflect on some open questions and challenges:
• On the challenge of sparse and delayed rewards.
Recent MAL competitions and environments (e.g., Pommerman [24], Capture the flag [21], MarLÖ, Starcraft II, and Dota 2) have complex scenarios wherein many actions must be taken before a reward signal becomes available. This is already a challenge for RL [16]; in MDRL it becomes even more problematic since the agents not only need to learn basic behaviors (like in DRL), but also need to learn the strategic element (e.g., competitive/collaborative) embedded within the multiagent setting.
• On the role of self-play.
Self-play (when all the agents use the same learning algorithm) is a MAL cornerstone that achieves impressive results [17-19]. While notable results have also occurred in MDRL, recent works have shown that plain self-play does not yield the best results [20, 21].
• On the challenge of the combinatorial nature of MDRL.
Monte Carlo tree search (MCTS) has been the backbone of major breakthroughs for AlphaGo and AlphaGo Zero, both of which used MCTS along with the DRL. However, for multiagent scenarios, there is an additional challenge of the exponential growth of all the agents' action spaces for centralized methods. Given more scalable planners [22, 23], there is room for research in combining MCTS-like planners with DRL techniques in multi-agent scenarios.
While there are a number of notable works in DRL and MDRL that represent important milestones for AI, we acknowledge there are also open questions in both single-agent learning and multiagent learning that demonstrate how much more work still needs to be done. We expect this article will help unify and motivate future research to take advantage of the abundant literature that exists in both areas (DRL and MAL) in a joint effort to promote fruitful research in the multiagent community.
Full paper: https://arxiv.org/abs/1810.05587
[1] P. Stone, M. M. Veloso, Multiagent Systems - A Survey from a Machine Learning Perspective., Autonomous Robots 8 (2000) 345–383.
[2] Y. Shoham, R. Powers, T. Grenager, If multi-agent learning is the answer, what is the question?, Artificial Intelligence 171 (2007) 365–377.
[3] K. Tuyls, G. Weiss, Multiagent learning: Basics, challenges, and prospects, AI Magazine 33 (2012) 41–52.
[4] L. Busoniu, R. Babuska, B. De Schutter, A Comprehensive Survey of Multiagent Reinforcement Learning, IEEE Transactions on Systems, Man and Cybernetics, Part C (Applications and Reviews) 38 (2008) 156–172.
[5] A. Nowé, P. Vrancx, Y.-M. De Hauwere, Game theory and multi-agent reinforcement learning, in: Reinforcement Learning, Springer, 2012, pp. 441–470.
[6] L. Panait, S. Luke, Cooperative Multi-Agent Learning: The State of the Art, Autonomous Agents and Multi-Agent Systems 11 (2005).
[7] L. Matignon, G. J. Laurent, N. Le Fort-Piat, Independent reinforcement learners in cooperative Markov games: a survey regarding coordination problems, Knowledge Engineering Review 27 (2012) 1–31.
[8] D. Bloembergen, K. Tuyls, D. Hennes, M. Kaisers, Evolutionary Dynamics of Multi-Agent Learning: A Survey., Journal of Artificial Intelligence Research 53 (2015) 659–697.
[9] P. Hernandez-Leal, M. Kaisers, T. Baarslag, E. Munoz de Cote, A Survey of Learning in Multiagent Environments - Dealing with Non-Stationarity (2017). arXiv:1707.09183.
[10] S. V. Albrecht, P. Stone, Autonomous agents modelling other agents: A comprehensive survey and open problems, Artificial Intelligence 258 (2018) 66–95.
[11] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, D. Hassabis, Human-level control through deep reinforcement learning, Nature 518 (2015) 529–533.
[12] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, D. Hassabis, Mastering the game of Go with deep neural networks and tree search, Nature 529 (2016) 484–489.
[13] M. Moravčík, M. Schmid, N. Burch, V. Lisý, D. Morrill, N. Bard, T. Davis, K. Waugh, M. Johanson, M. Bowling, DeepStack: Expert-level artificial intelligence in heads-up no-limit poker, Science 356 (2017) 508–513.
[14] Open AI Five, https://blog.openai.com/openai-five, 2018. [Online; accessed 7-September-2018].
[15] K. Arulkumaran, M. P. Deisenroth, M. Brundage, A. A. Bharath, A Brief Survey of Deep Reinforcement Learning (2017). arXiv:1708.05866v2.
[16] R. S. Sutton, A. G. Barto, Introduction to reinforcement learning, volume 135, MIT press Cambridge, 1998.
[17] J. Hu, M. P. Wellman, Nash Q-learning for general-sum stochastic games, Journal of Machine Learning Research 4 (2003) 1039–1069.
[18] M. Bowling, Convergence and no-regret in multiagent learning, in: Advances in Neural Information Processing Systems, Vancouver, Canada, 2004, pp. 209–216.
[19] J. Heinrich, D. Silver, Deep Reinforcement Learning from Self-Play in Imperfect-Information Games (2016). arXiv:1603.01121
[20] T. Bansal, J. Pachocki, S. Sidor, I. Sutskever, I. Mordatch, Emergent Complexity via Multi-Agent Competition., in: International Conference on Machine Learning, 2018.
[21] M. Jaderberg, W. M. Czarnecki, I. Dunning, L. Marris, G. Lever, A. G. Casta˜neda, C. Beattie, N. C. Rabinowitz, A. S. Morcos, A. Ruderman, N. Sonnerat, T. Green, L. Deason, J. Z. Leibo, D. Silver, D. Hassabis, K. Kavukcuoglu, T. Graepel, Human-level performance in first-person multiplayer games with population based deep reinforcement learning (2018).
[22] C. Amato, F. A. Oliehoek, et al., Scalable planning and learning for multiagent POMDPs., in: AAAI, 2015, pp. 1995–2002.
[23] G. Best, O. M. Cliff, T. Patten, R. R. Mettu, R. Fitch, Dec-MCTS: Decentralized planning for multi-robot active perception, The International Journal of Robotics Research (2018)
[24] Resnick, C., Eldridge, W., Ha, D., Britz, D., Foerster, J., Togelius, J., Cho, K. and Bruna, J., 2018. Pommerman: A Multi-Agent Playground. arXiv preprint arXiv:1809.07124.
On October 9, 2018, we threw a party to mark the official launch of our new Borealis AI Montreal research centre and to celebrate our $1M collaboration with the Canadian Institute for Advanced Research (CIFAR) on responsible AI research and initiatives.
Click on the gallery below to see pics from the event.
]]>Today, the RBC Foundation announced it will donate $1-million dollars over three years to the Canadian Institute for Advanced Research (CIFAR). The gift will support research and initiatives aimed at furthering the study of ethical artificial intelligence (AI) practices.
RBC CEO, Dave McKay, made the announcement, in conjunction with CIFAR president and CEO, Dr. Alan Bernstein, at the official launch of Borealis AI’s Montreal research centre. CIFAR is currently leading the country’s Pan-Canadian Artificial Intelligence Strategy.
Borealis AI will collaborate closely with CIFAR in an advisory capacity on key aspects of the strategy, with a particular focus on global thought leadership around the ethical implications of AI advancements.
The investment will help fund initiatives like CIFAR’s Catalyst Grants, which award up to $100,000 per year for two years to support collaborations in novel areas of AI exploration between researchers at any Canadian institution. Two of these awards will be explicitly focused on research in areas like privacy, accountability, transparency and bias in machine learning.
The money will also go toward delivering interdisciplinary AI research workshops in fields such as transportation, environmental science, public health, and energy, while the remainder will support ongoing training opportunities in the social implications of AI, including equity, diversity and inclusion.
Ethical AI is no longer a peripheral topic within the community. As scientific successes in fields like deep learning, computer vision and natural language processing continue to grow, the broad-ranging social impact of these technologies cannot be divorced from their applications and need to be researched with the same academic rigour.
RBC and Borealis AI are dedicated to the shared sense of responsibility between diverse research communities dedicated to the proliferation of responsible technologies. With CIFAR as a partner, we are confident we’ll exceed these goals.
]]>A blank canvas, a stunning city, a top-notch team, and an organization on the move. Put it all together and you get Borealis AI’s striking new Montreal research centre, which officially opened its tunnel, er, doors this week.
RBC CEO Dave MacKay kicked off the official launch this morning, reinforcing the bank’s commitment to supporting the AI ecosystem through collaboration with Canada’s leading research institutions. He was joined onstage by Nadine Renaud-Tinker, president of RBC in Quebec, and Borealis AI co-founder and head, Foteini Agrafioti, who shared their excitement about expanding the organization’s network of research centres into a city that has shown such dynamic leadership in the field.
Dr. Alan Bernstein, president and CEO of the Canadian Institute for Advanced Research (CIFAR) followed to announce the RBC Foundation’s $1-million-dollar investment in the organization’s Pan-Canadian Artificial Intelligence Strategy.
Our new 40-person research centre is located at O Mile-Ex, a former textile factory that is becoming the de facto AI industrial research hub of Montreal. We share easy access to some of the neighbourhood’s best coffee with our new neighbours Element AI, MILA, Thales, and IVADO.
With our Toronto research centre scooping up design awards and getting full-length features in Toronto Life, our Montreal design team had a tough act to follow.
Thankfully, at 6,500 square feet and with no building restrictions, there was plenty of space for us to play. That’s why one of the first features we added was a hockey motif in one of our conference rooms. It’s a nod to the city’s deep hockey heritage and a show of good sportsmanship from the Toronto lab for acknowledging the Canadiens.
A mini soccer pitch sits smack in the middle of the hall for team members to let off a little steam.
And a cinema meeting room pays homage to Quebec’s thriving film industry and specifically to the iconic Snowdon Theatre during its glory days.
The primary theme, however, was the Montreal Metro. Visitors walk through a tiled tunnel throughway to reach the front hall.
Hang a left and you arrive at a meeting room with swinging egg chairs. On the right, a chill out room with grass and some cozy bean bag chairs.
Groceries get delivered each Monday to the kitchen with its intricate tiled mosaic floor. This is yet another nod to Montreal – this time to its design community and the city’s overall exquisite attention to detail.
Our living room is like being in a park, albeit a park with a large garage door. The bright, open, window-filled space offers a panorama view for lectures, presentations, and meetings. These are all overseen by a gigantic balloon guard dog who monitors the proceedings.
Exceptional research deserves an exceptional space. And if you think this is great, just wait until you see what we have in store for Waterloo, Vancouver, and Edmonton.
]]>We took Prof. Taylor for coffee near his old east-end Toronto stomping grounds to hear his thoughts on how deep learning has evolved, whether any pre-2012 techniques still yield promising results, where Software 2.0 fits into the future and how much of a deep-domain expert you have to be in order to give truly valuable advice to businesses.
]]>While it is tempting to now adopt a “one-size-fits-all” mantra, this would be a prematurely limiting methodology. Recently, a whole spate of machine learning models have emerged that involve more than a single objective; rather, they involve multiple objectives which all interact with each other during training. The most prominent model, of course, is Generative Adversarial Networks (GANs). However, other examples include synthetic gradients, proximal-gradient TD learning, and intrinsic curiosity. The appropriate way to think about these types of problems is to interpret them as game-theoretic problems, where one aims to find the Nash equilibrium rather than local minima of each objective. Intuitively speaking, Nash equilibrium occurs when each player knows the strategy of all the other players, and no player has anything to gain by changing his or her own strategy.
Unfortunately, finding Nash equilibrium in games is notoriously difficult. In fact, theoretical computer scientists have long known that finding Nash equilibrium for general games is an intractable problem. Nor is it ideal to naively apply gradient descent to games. Firstly, gradient descent has no convergence guarantees and, even in case where it does, it may be highly unstable and slow. But the most severe drawback is that, unlike the traditional setup in supervised machine learning, there is no single objective involved in this model which means we have no way of measuring any kind of progress.
We can illustrate the complexity of interacting losses with a very simple two-player game example. Consider Player 1 and Player 2 with the respective loss functions:
In this game, Player 1 controls the variable x and Player 2 controls the variable y. The dynamics (or simultaneous gradient) is given by:
If we plot this in the 2D-Cartesian plane, all of the vector fields cycle around the origin and there is no direction which points straight to it. The technical problem here is that the dynamics ξ are not a gradient vector field; in other words, there is no function φ such that φ = ξ.
In the ICML paper, The Mechanics of n-Player Differentiable Games, [1], the authors use insights from Hamiltonian mechanics to tackle the problem of finding game equilibriums. Hamiltonian mechanics are a reformulation of classical mechanics in the following way. Consider a particle moving in the Euclidean space R^{n}. The state of the system at a given time t is determined by the coordinates of the position (q_{1},…,q_{n}) and the coordinates of the momentum (p_{1},…,p_{n}). The space R^{2n} of positions and momenta is called the “phase space”. The “Hamiltonian” H(q,p) is a function on this phase space and it represents the total energy of the system. Hamilton’s equations (also referred to as “equations of motion”) describe the time evolution of the state of the system. These are given by
We see how all of these formulations play out in our simple example. If we define the Hamiltonian H(x,y) here to be
the gradient is then ▽H = (x, y). There are two critical observations to be made here: (1) conservation of energy. The level sets of H are conserved by the dynamics ξ = (y, -x) (hence, ξ cycles around the origin) and; (2) gradient descent on the Hamiltonian, rather than the simultaneous gradient on the losses, converges to the origin.
Motivated by this philosophy, the authors in [1] introduce the notion of Hamiltonian games. For a n-player game with parameter w, they define the Hessian of a game to be
Since this is a matrix, it always admits a symmetric and anti-symmetric decomposition:
This leads to a classification: Hamiltonian games are defined as games where the symmetric component is zero, S(w) = 0. Potential games are defined as games where the anti-symmetric component is zero, A(w) = 0. Going back to our simple example, the Hessian is
so therefore we have a Hamiltonian game. One of the main theoretical contributions of this paper is that given an n-player Hamiltonian game with H(w) = ^{1}/_{2}ξ(w)^{2}, under some conditions the gradient descent on H converges to a Nash equilibrium.
Another central contribution made by the authors is the proposal of a new algorithm to find stable fixed points (which under some conditions can be considered Nash equilibrium). Their Symplectic Gradient Adjustment (SGA) adjusts the game dynamics by
For a potential game where A(w) = 0, then SGA performs the usual gradient descent by finding local minimum. In contrast, for a Hamiltonian game where S(w) = 0, then SGA finds local Nash equilibrium. Readers fluent in differential geometry can immediately see the reasoning behind the terminology “symplectic”. For a Hamiltonian game, H(w) is a Hamiltonian function and the gradient of the Hamiltonian ▽H = A^{T}ξ. The dynamic ξ is a Hamiltonian vector field, since it conserves the level-sets of the Hamiltonian H. In symplectic geometry, the relationship between symplectic form ω and the Hamiltonian vector field ξ is
The right-hand side of this equation is simply the gradient of our Hamiltonian function. In the context of Hamiltonian games, we see that the antisymmetric matrix A is playing the role of the symplectic form ω, which justifies the terminology.
In the experimental section, the authors compared their SGA method to other recently proposed algorithms for finding stable fixed points in GANs. Moreover, to demonstrate the flexibility of their algorithm, the authors also studied the performance of SGA on general two-player and four-player games. In all cases, SGA was competitive with, if not better than, existing algorithms.
As a summary, this paper provides a glimpse of how a specific class of general game problems may be tackled by borrowing tools from mathematics and physics. Machine learning models featuring multiple interacting losses are becoming increasingly popular. As such, it is necessary for us to come up with new methodologies rather than relying on the crutches of standard gradient descent. Unraveling the mysteries behind these complicated models will have considerable practical impact as it will aid the design of better scalable algorithms in the future.
[1] David Balduzzi, Sebastien Racaniere, James Martens, Jakob Foerster, Karl Tuyls, and Thore Graepel. The mechanics of n-player differentiable games. arXiv preprint arXiv:1802.05642, 2018.
]]>