Authors: Yu-An Chung · Wei-Hung Weng · Schrasing Tong · James Glass
Last year, I was surprised by a paper that introduced a technique to perform word translation between two languages without parallel corpora. To clarify, you’re said to have unparalleled corpora when two sets of a text exist in two languages (e.g. a set of English words and a set of French words), but there is no information regarding which English word corresponds to the appropriate French translation.
Previously, the state-of-the-art method for learning cross-lingual word embeddings mainly relied on bilingual dictionaries, along with some help from character-level information for languages that shared a common alphabet. None of this was competitive with supervised machine translation techniques. The authors of the paper managed to pose a different question: Was it possible to do unsupervised word translation? They answered their own question by introducing a new technique that worked quite well.
Their model worked by obtaining the word embeddings space for both languages, independently, and introducing a technique for unsupervised alignment between the two embedding spaces that can achieve translations without parallel corpora. The intuition behind this technique is to rotate one embedding space to the point that the two embedding spaces are virtually indistinguishable to a classifier (i.e., adversarial training).
This year, a new set of authors presented their work regarding the task of automatic speech recognition without parallel data. This, again, means two independent sets of speech data and text data exist, but the correspondence information between them is unclear. This work stood out since it is the first successful attempt to apply the unsupervised alignment technique introduced last year on multiple modalities of data. The task involved taking a dataset of words from one language and a dataset of spoken words from either the same or a different language, and automatically identifying spoken words without parallel information.
The authors first trained an embedding space for written words and another embedding space for spoken words. They then applied the unsupervised alignment technique to the embedding spaces to align them so that spoken words could automatically be classified and translated. At test time, a speech segment is first mapped into its respective embedding space, aligned to the text embedding space, then the nearest neighbors of the text embedding are picked as the translation. The same procedure can be used for the text-to-speech conversion task.
The authors present some experiments on the spoken Wikipedia and LibriSpeech datasets that show unsupervised alignments are still not as good as supervised alignment – but they’re close. Some challenges still remain to be solved before unsupervised cross-modal alignments can be competitive with supervised ones; however, this work shows the promise of improving automatic speech recognition (ASR), text-to-speech (TTS) and even translation systems, especially in languages with a low availability of paralleled data. (/HS)
Authors: Rad Niazadeh · Tim Roughgarden · Joshua Wang
This paper was accepted as an oral presentation. The authors gave an approximation algorithm for maximizing continuous non-monotone submodular functions.
To give a brief recap, submodular functions arise in several important areas of machine learning and, in particular, around the intersection of economics and learning. They can be used to model the problem of maximizing multi-platform ad revenue, where a buyer wants to maximize their profit = revenue - cost by advertising on different platforms and there is a diminishing return of advertising on more platforms. This diminishing return is the precise property captured by the sub-modular functions. Mathematically, a function $f:\{0,1\}^n \rightarrow \mathbb{R}$ is submodular if $f(S \cup \{e\}) - f(S) \geq f(T \cup \{e\}) - f(T)$ for every $S \subseteq T$ and $e \notin T $. In this setting, there is an information-theoretic lower bound of $1/2$-approximation [Feige et al.'11] and there is an optimal algorithm which matches this bound [Buchbinder et al.'15].
This paper considered the continuous submodular function where, instead of maximizing on the vertices of the hypercube $\{0,1\}^n$, we want to maximize over the full hypercube $[0,1]^n$. The main result of the paper is that they obtained a randomized algorithm for maximizing a continuous submodular and $L$-Lipschitz function over the hypercube that guarantees a $1/2$-approximation. Note that this is currently the best possible ratio that is information-theoretically achievable.
The reason this paper stood out is that the authors used the double greedy framework of Buchbinder et al.'15 to solve the coordinate-wise zero-sum game, and then use the geometry of this game to bound the value at its equilibrium. This is a nice application of game theory to maximize the value of the function. The authors also conducted experiments on 100-dimensional synthetic data and achieved comparable results as the previous work they referenced. One thing we hoped to see was that their achievement of better approximations and faster algorithms would also show a significant advantage in the experiments, but that was not the case.
In terms of the open problems, I am really excited to see the development of parallel and online algorithms for continuous sub-modular optimization. In fact, there is a recent work for parallel algorithms of Chen et al.'18 which achieves a tight $(1/2 - \epsilon)$-approximation guarantee using $\tilde{O}(\epsilon^{-1}$) adaptive rounds. (/KJ)
Authors: Jiantao Jiao · Weihao Gao · Yanjun Han
This paper focused on estimating the differential entropy of a continuous distribution $f$ given $n$ i.i.d. samples. Entropy has been a core concept of information theoretic measures and has engendered numerous important applications, such as goodness-of-fit tests, feature selection, and tests of independence. In the vast body of literature around this concept, most of the measures have appeared to take on an asymptotic flavor – that is, until several recent works.
This paper is one of those works. The authors took particular focus on the fixed-k nearest neighbor (fixed-kNN) estimator, also called the Kozachenko-Leonenko estimator. This estimator is simple; there is only one parameter to tune, and it requires no knowledge about the smoothness degree $s$ about targeted distribution $f$. Moreover, it is also computationally efficient, since $k$ is fixed (compared to other methods with similar finite sample bounds) and statistically efficient: As shown in this paper, it has a finite sample bound that is close to optimal. All of these properties make the estimator realistic and attractive in practice.
I found the paper also carried some interesting technical results. One direct approach in estimating the differential entropy is to plug in a consistent estimator, for example, based on kNN distance statistics of the density function $f$ into the formula of entropy. However, such estimators usually come with an impractical computational demand. For instance, in the kNN-based estimator, $k$ has to approach $\infty$ as the number of samples $n$ approaches $\infty$.
In a recent paper by Han et al. [2017], the authors constructed a complicated estimator that achieves a finite sample bound in the rate of $n\log(n))^{-\frac{s}{s+d}} + n^{-\frac12}$ (the optimal rate). One caveat, though, is that it requires the knowledge of the smoothness degree $s$ of the targeted distribution $f$. The last challenging part is to deal with the area where $f$ is small. A major difficulty in achieving such bounds for the entropy estimator is that the nearest neighbor estimator exhibits a huge bias in its low-density area. Most papers tend to make assumptions about the property of $f$ such that this bias is well controlled. However, this paper did not presume similar assumptions. Given all these constraints, including fixed $k$, no knowledge of $s$ and no assumptions on how $f$ is bounded from below, the authors managed to prove a nearly optimal finite sample bound for a simple estimator. According to the authors, the new technical tools here are the Besicovitch convering lemma and a generalized Hardy-Littlewood maximal inequality. This part is not yet clear to me.
Lastly, the authors also pointed out several weaknesses in their paper and their plans for future work. For example, they conjectured that both the upper bound and the lower bound in the paper could be further improved. They also hypothesized a way to extend the constraint on $s$ in the theorem so that the result can be applied to a more general setting. (/RH)
Authors: Kevin Scaman · Francis Bach · Sebastien Bubeck · Laurent Massoulié · Yin Tat Lee
This paper considered distributed optimization of non-smooth convex functions using a network of computing units. The objective of this work was to study the impact of the communication network on learning, and the tradeoff between the structure of the network and algorithmic efficiency. The network consists of a connected simple graph of nodes, each having access to a function (such as a loss function). The optimization problem exists to minimize the average of the local functions: communication between nodes takes a given length of time and computation takes one unit of time. Under a decentralized scenario, local communication is performed through gossip.
The authors give bounds on the time to reach a given precision, then provide an optimal algorithm that uses a primal-dual reformulation. They are able to show that the error due to limits in communication resources will then decrease at a fast rate. In the centralized setting, the authors provide an algorithm which achieves convergence rates within $d^{1/4}$ to the optimal, where d is the underlying dimension.
I found this paper intriguing because it considers the impact of communication and computation resources in learning, which will be increasingly important as systems we learn on become larger. It received one of the best paper awards and is one of few papers that consider such impacts. There’s an argument to be made that these two things are related; as learning systems scale up and get distributed through IoT and mobile devices, the importance of distributed learning in a setting where there is tension between communication and computation has also increased. The elegant analytical tools used in this paper – gossip methods, primal-dual formulation, Chambolle-Pock algorithm for saddle-point optimization, the combined use of optimization and graph theory, and the bounds that give insight into which resources are important at which stage of convergence – show that the award places well-deserved attention toward a growing area. (/NH)
Invited talk: Jon Kleinberg - Fairness, Simplicity, and Ranking
In Jon Kleinberg's fascinating invited talk, he addressed the effect of implicit bias on producing adverse outcomes. The specific application he referred to is that of bias in activities such as hiring, promotion, admissions. The setting is as follows: a recruitment committee is tasked with selecting a shortlist of final candidates from a given pool of applicants, but their estimates of skill, used in the selection, may be skewed by implicit bias.
The Rooney Rule is an NFL policy in effect since 2003 that requires teams to interview ethnic-minority candidates for coaching and operations positions. (Note: There is no quota or preference given in the actual hiring). Kleinberg and his co-authors showed that measures such as the Rooney Rule lead to higher payoffs for the organization. Their model is as follows: a recruiting committee must select a list of k candidates for final interviews; the set of applicants is then divided into two groups, X and Y, X being the minority group; there are $n$ Y applicants and $n\alpha$ X applicants with $\alpha \le 1$. Each candidate has a numerical value representing their skill, and there is a common distribution from which these skills are drawn.
Based on empirical studies of skills in creative and skilled workforces, the authors then modeled this distribution as a Pareto distribution (power law). The utility that the recruiting committee aims to maximize is the sum of the candidates’ skills that were selected to the list. The authors modeled the bias as a multiplicative bias in the estimation of the skills of X-candidates. So, Y candidates are estimated at their true value and an X-candidate skill is estimated to be $X_i/\beta$ for candidate i where $\beta >1$. The authors then analyzed the utility of a list of $k$ candidates where at least one must be an X candidate. Their analysis showed an increase in utility even when the list was of size 2, and for a large range of values for the bias, power law, and population parameters.
I found this to be another very interesting and important paper because it tackles the question of fairness at a very practical level and provided a tangible algorithmic framework with which to expose, then analyze the outcomes. Furthermore, the modelling assumptions were very realistic, and their results demonstrated the potential for significant impact. The particular scenario considered here may be for activities such as hiring and admissions, but the result has consequences for machine learning models. (/NH)
]]>In this paper, we present a new technique for hard negative mining for learning visual-semantic embeddings for cross-modal retrieval. We focus on selecting hard negative pairs that are sampled by an adversarial generator.
In settings with attention, our adversarial generator composes harder negatives through novel combinations of image regions across different images for a given caption. We find scores across the board for all R@K-based metrics, but this technique is also significantly more sample efficient and leads to faster convergence in fewer iterations.
]]>In this work, we present a simple yet surprisingly effective way to prevent catastrophic forgetting. Our method, called Few-Shot Self Reminder (FSR), regularizes the neural net from changing its learned behaviour by performing logit matching on selected samples kept in episodic memory from the previous tasks.
Surprisingly, this simplistic approach only requires retraining a small amount of data in order to outperform previous knowledge retention methods. We demonstrate the superiority of our method to previous ones on popular benchmarks, as well as a new continual learning problem where tasks are designed to be more dissimilar.
]]>We discover this sensitivity by analyzing the Bayes classifier's clean accuracy and robust accuracy. Extensive empirical investigation confirms our finding. Numerous neural nets trained on MNIST and CIFAR10 variants achieve comparable clean accuracies, but they exhibit very different robustness when adversarially trained. This counter-intuitive phenomenon suggests that input data distribution alone can affect the adversarial robustness of trained neural networks, not necessarily the tasks themselves. Lastly, we discuss practical implications on evaluating adversarial robustness, and make initial attempts to understand this complex phenomenon.
]]>
Authors: Kry Lui, Gavin Weiguang Ding, Ruitong Huang, Robert McCann
Poster: Decemeber 5, 10:45 am – 12:45 pm @ Room 210 & 230 AB #103
Dimensionality reduction occurs frequently in machine learning. It is widely believed that reducing more dimensions will often result in a greater loss of information. However, the phenomenon remains a conceptual mystery in theory. In this work, we try to rigorously quantify such phenomena in an information retrieval setting by using geometric techniques. To the best of our knowledge, these are the first provable information loss rates due to dimensionality reduction.
Authors: Christopher Blake, Luyu Wang, Giuseppe Castiglione, Christopher Srinivasa, Marcus Brubaker
Workshop: Compact Deep Neural Network Representation (Spotlight Paper); December 7, 2:50 pm
When seeking energy-efficient neural networks, we argue that wire-length is an important metric to consider. Based on this theory, new techniques are developed and tested to train neural networks that are both accurate and wire-length-efficient. This contrasts to previous techniques that minimize the number of weights in the network, suggesting these techniques may be useful for creating specialized neural network circuits that consume less energy.
Authors: Junfeng Wen, Yanshuai Cao, Ruitong Huang
Workshop: Continual Learning; December 7
We present a simple, yet surprisingly effective, way to prevent catastrophic forgetting. Our method regularizes the neural net from changing its learned behaviour by performing logit matching on selected samples kept in episodic memory from previous tasks. As little as one data point per class is found to be effective. With similar storage, our algorithm outperforms previous state-of-the-art methods.
Authors: *A.J. Bose, *Huan Ling, Yanshuai Cao
Workshop: ViGIL; December 7, 8 am – 6:30 pm
We present a new technique for hard negative mining for learning visual-semantic embeddings. The technique uses an adversary that is learned in a min-max game with the cross-modal embedding model. The adversary exploits compositionality of images and texts and is able to compose harder negatives through a novel combination of objects and regions across different images for a given caption. We show new state-of-the-art results on MS-COCO.
Authors: Gavin Weiguang Ding, Yik Chau (Kry) Lui, Xiaomeng Jin, Luyu Wang, Ruitong Huang
Workshop: Security in Machine Learning; December 7, 8:45 am – 5:30 pm
We demonstrate an intriguing phenomenon about adversarial training – that adversarial robustness, unlike clean accuracy, is highly sensitive to the input data distribution. In theory, we show this by analyzing the Bayes classifier’s robustness. In experiments, we further show that transformed variants of MNIST and CIFAR10 achieve comparable clean accuracies under standard training but significantly different robust accuracies under adversarial training.
Authors: Pablo Hernandez-Leal, Bilal Kartal, Matthew E. Taylor
Workshop: Latinx in AI Coalition; December 2, 8 am – 6:30 pm
Our goal is to tackle partially observable multiagent scenarios by proposing a framework based on learning robust best responses (i.e., skills) and Bayesian inference for opponent detection. In order to reduce long training periods, we propose to intelligently reuse policies (skills) by quickly identifying the opponent we are playing with.
Authors: Yash Sharma, Gavin Weiguang Ding
Workshop: NeurIPS 2018 Competition Track Day 1; 8 am – 6:30 pm
This challenge pitted submitted adversarial attacks against submitted defenses. The challenge was unique in that they allowed for a limited set of queries, outputting the decision of the defense, rewarded minimizing the (L2) distortion instead of using a (Linf) distortion constraint, and used TinyImageNet instead of the ImageNet dataset, making it tractable for competitors to train their own models. Our attack solution placed top-10 overall in the challenge, in particular placing 5th in the targeted attack track – a more difficult setting. We based our solution on performing a binary search to find the minimal successful distortion, then optimizing the procedure while still performing the necessary number of iterations to meet the computational constraints.
Northern Frontier sat down with Abhishek Gupta, AI ethics researcher at McGill University and founder of the Montreal AI Ethics Institute, to dive into some of the key themes of the day, including the threat automation poses to job loss based on the current science, whether bias is the biggest problem we face in responsible AI, the dangers of 'mathwashing', and what we should consider reasonable trade-offs for improving fairness.
]]>Dimensionality reduction occurs very naturally and very frequently within many machine learning applications. While the phenomenon remains, for the most part, a conceptual mystery, one thing many researchers believe is that reducing more dimensions will often result in a greater loss of information. What’s even harder to confirm is the rate at which this information loss occurs, as well as how to formulate the problem (even for very simple data distributions and nonlinear reduction mappings.)
In this work, we try to rigorously quantify such empirical observations from an information retrieval perspective by using geometric techniques. We begin by formulating the problem through an adaptation of two fundamental information retrieval measures – precision and recall – to the (continuous) function analytic setting. This shift in perspective allows us to borrow tools from quantitative topology in order to establish the first provable information loss rate induced by dimension-reduction.
We were surprised to discover that when we began reducing dimensions, the precision would decay exponentially. This discovery should raise red flags for practitioners and experimentalists attempting to interpret their dimension reduction maps. For example, it may not be possible to design information retrieval systems that enjoy high precision and recall at the same time. This realization should keep us mindful of the limitations of even the very best dimension reduction algorithms, such as t-SNE.
While precision and recall are natural information retrieval measures, they do not directly take advantage of the distance information between data (e.g. in data visualization). We therefore propose an alternative dimension-reduction measure based on Wasserstein distances, which also provably captures the dimension reduction effect. To obtain this theoretical guarantee, we solve the following iterated-optimization problem:
\[
\inf_{W:\,\text{Vol}_n(W) = M} W_{2}(\mathbb{P}_{B_r}, \mathbb{P}_{W})
=
\inf_{W:\,\text{Vol}_n(W) = M} \inf_{\gamma \in \Gamma (\mathbb{P}_{B_r} , \mathbb{P}_{W})} \mathbb{E}_{(a, b) \sim \gamma} [ \| a - b \|^{2}_{2} ]^{1/2} ,
\]
by using recent results from optimal partial transport literature.
While precision and recall for supervised learning problems are familiar concepts, let’s do a quick review of it before we adapt it to the dimensionality reduction context.
In a supervised learning setting, say the classification of cats vs dogs, first select your favorite neural net classifier fW, then collect 1,000 test images.
\[
Precision = 1/2 \frac{\text{How many are cats}}{\text{Among the ones predicted as cats}} + 1/2 \frac{\text{How many are dogs}} {\text{Among the ones predicted as dogs}}
\]
\[
Recall = 1/2 \frac{ \text{How many are predicted as cats}}{\text{Among the cats}} + 1/2 \frac{\text{How many are predicted as dogs}}{\text{Among the dogs}}
\]
Generally speaking, we can average precision and recall over n classes.
The formulation we used for precision and recall was inspired by the supervised learning setting. Since dimensionality reduction can actually happen in an unsupervised setting, we needed to change a few things around. In a typical dimensionality reduction map $f: X \rightarrow Y$, we often care about preserving the local structure post-reduction.
Since the unsupervised setting means there are no more labels, the first thing we did was to replace “label for an input x” by “neighboring points for an input x in high dimension.” We didn’t have the predictions either, but we felt it made sense to replace “prediction for an input x” by “neighboring points for an input y = f(x) in low dimension.”
When computing precision and recall in the supervised cases, we averaged across the labels. So, the second thing we did was to average over each data point.
\[Precision = \frac{1}{n} \sum_{x \in X_n} {\text{How many are neighbors of} ~x~ \text{in high dimension} / \text{Among the low dimensional neighbors of}~ y = f(x)}
\]
\[Recall = \frac{1}{n} \sum_{x \in X_n} {\text{How many are neighbors of} ~y = f(x)~ \text{in low dimension} / \text{Among the high dimensional neighbors of} ~x}
\]
But it was still hard to prove anything, even with the settings detailed above. One difficulty we faced is that f is a map between continuous spaces, but the data points are finite samples. This motivated us to look for continuous analogues of the examples below:
Finally, we arrived at one of the paper’s key observations: Precision is roughly injectivity; recall is roughly continuity.
Let’s build some intuition with linear maps. In linear algebra, we learn that a linear map from $\mathbb{R}^n$ to $\mathbb{R}^m$ must have null space of dimension at least $n - m$. One may interpret this as the “how” and “why” of when linear maps lose information: distant points in high-dimension are projected together in low-dimension. This process leads to very poor precision.
In practice, DR maps can be much more flexible than linear maps. So, can this expressivity circumvent the linear algebraic dimensional mismatch issue? To study dimension reduction under continuous maps, we turned to the corresponding study of topological dimension: waist inequality from quantitative topology. It turns out that a continuous map from high-to-low dimension still fails to circumvent the similar issue that plagues linear maps – that many continuous maps collapse the points together. For most $x$, we have for $y = f(x): \text{Vol}_{n-m} f^{-1}(y) \ge C(x) \text{Vol}_{n-m} B^{n-m}$.
Roughly speaking, the relevant neighborhood $U$ of $x$ is typically small, in all $n$ directions, while the retrieval neighborhood $f^{-1}(V)$ is big in n-m directions. This quantitative mismatch makes it very difficult to achieve high precision for a continuous DR map. It’s this mismatch that leads to the exponential decay of precision:
\[
Precision^{f}(U, V)
\leq
D(n, m)\,\left(\frac{r_U}{R}\right)^{n-m}\,\frac{ r_U^{^{m}} }{p^{m}(r_V/L)}
\]
The above trade-off/information loss phenomenon has been widely observed by experimentalists. Naturally, practitioners have developed various tools to measure the imperfections. What was less clear in this regard is what led to the trade-off, so having clarified this a bit more, we can now design better measurement devices.
When we naively compute sample precision and recall:
\[
Precision =
\frac{\mathrm{~How~many~points~are~high~dimensional~neighbors~of}~x }{\mathrm{Among~the~low~dimensional~neighbors~of}~y}
\]
\[
Recall =
\frac{\mathrm{~How~many~points~are~low~dimensional~neighbors~of}~y }{\mathrm{Among~the~high~dimensional~neighbors~of}~x}
\]
These two quantities are equal when we fix the number of neighboring points. (Here, the numerators are the same. When we fix the number of neighboring points, the denominators are equal as well). Fixing the number of neighboring points is one of the reasons behind t-SNE’s success, since some data points are quite far away from others and without fixing them some outliers wouldn’t have any neighboring points.
We can alternatively compute them by discretizing the continuous precision and recall shown above.
\[
Precision =
\frac{\mathrm{How~many~points~are~within~r_U~from}~x~\mathrm{and~within~r_V~from}~y}{\mathrm{How~many~points~are~within~r_V~from}~y}
\]
\[
Recall = \frac{\mathrm{How~many~points~are~within~r_U~from}~x~\mathrm{and~within~r_V~from}~y}{\mathrm{How~many~points~are~within~r_U~from}~x}
\]
But not only will this create an unequal number of neighboring points, it will result in quite a few data points ending up with very few neighbors. This is partially caused by high-dimensional geometry. Either way it can appear as though precision and recall are difficult quantities to manage in a practical situation.
Let’s revisit the problem from an alternative perspective. From the proof-of-precision’s dimension reduction rate, it’s clear that the mismatch comes from $f^{-1}(V)$ and $U$ - and this corresponds to the injectivity imperfection. Heuristically, the parallel quantity for continuity imperfection is f(U) and V.
We therefore proposed the following Wasserstein measures:
\[
W_{2}(\mathbb{P}_U, \mathbb{P}_{f^{-1}(V)});
W_{2}(\mathbb{P}_{f(U)}, \mathbb{P}_V)
\]
Like precision and recall, we associated the two Wasserstein measures for each point in the DR visualization map.
i) On a theoretical level, our work sheds light on the inherent trade-offs in any dimensionality reduction mapping model (e.g., visualization embedding).
ii) On a practical level, the implications are for practitioners to have a measurement tool to improve their data exploration practice: To date, people have put too much trust on low-dimensional visualizations. Low-dimensional visualizations can at best poorly reflect high-dimensional data structures; at worst it can produce incorrect representations, which can then degrade any subsequent analysis built upon them. We strongly suggest that practitioners improve their practice by incorporating a reliability measure for each data point on all their data visualizations.
]]>Deep reinforcement learning (DRL) is a recent yet very active area of research that joins forces between deep learning (the use of neural networks) and reinforcement learning (solving sequential decision tasks). In DRL, the goal is to learn an optimal policy (behavior) of an agent acting in an environment, with deep neural networks as the function approximator (for example in the value function). DRL has achieved outstanding results so far in areas that include beating human-level performance in Atari games [11] and in DeepMind’s now famous Go tournament versus Lee Sedol [12]. This has led to a dramatic increase in the number of applications that use this technique, most notably in video games and robotics.
Much of the successful DRL research to date has only considered single-agent environments. For instance, in Atari games, there is only one player to control. However, recent works have started to up the ante beyond single-agent scenarios and have begun exploring multi-agent scenarios, where the environment is populated with several learning agents at once. The presence of multiple agents causes additional dynamicity in the environment and in the agents themselves, which makes learning more complicated.
Recent works have reported successes in multiagent domains, such as DOTA 2 [14] or Capture the flag [21], in which many agents learn to compete and/or cooperate in the same environment. Despite these promising results, however, there are still many open challenges to be addressed. This article aims to (i) provide a clear overview of current multiagent deep reinforcement learning (MDRL) trends and (ii) share both examples and lessons learned on how certain methods and algorithms from DRL and multiagent learning can be used in complementary ways to solve problems in this emerging area.
Almost 20 years ago, Stone and Veloso's seminal survey used very intuitive and practical examples [1] to lay the groundwork for defining the area of multiagent systems (MAS) and its open problems in the context of machine learning:
“AI researchers have earned the right to start examining the implications of multiple autonomous agents interacting in the real world. In fact, they have rendered this examination indispensable. If there is one self-steering car, there will surely be more. And although each may be able to drive individually, if several autonomous vehicles meet on the highway, we must know how their behaviors interact”.
Roughly ten years later, Shoham, Powers, and Grenager [2] noted that the literature on learning in multiagent systems, or multiagent learning (MAL), was on the rise and it was no longer possible to enumerate all relevant articles. In the decade since, the number of published MAL works continues to rise, resulting in a series of different surveys (reviews) that showcase everything from analyzing the basics of MAL and their challenges [3] to addressing specific subareas (e.g., cooperative settings and evolutionary dynamics of MAL) [4-10].
Research interest in MAL has been accompanied by a number of successes; first, in single-agent Atari games [11], and more recently in two-player games [12-13] like Go, poker, and games involving two competing teams.
Deep reinforcement learning [15] plays a key role in these works and has successfully integrated with other AI techniques like (Monte Carlo tree) search, planning, and more recently, multiagent systems. The result is the emerging area of multiagent deep reinforcement learning (MDRL).
Learning in multiagent settings is fundamentally more difficult than the single-agent case thanks to problems [3-10] like:
• Non-stationarity: If all agents are learning at the same time, the dynamics become more complicated and break many standard RL assumptions.
• Curse of dimensionality: An exponential growth of state-action space when a learning agent keeps track of all agent actions.
• Multiagent credit assignment: Defining how agents should deduce their contributions when learning in a team; for example, if agents receive a team reward but it’s one agent doing most of the work, others can become “lazy.” It’s just like real life!
Despite these complexities, top AI conferences such as AAAI, AAMAS, ICLR, IJCAI, and NIPS have all published works reporting MDRL successes. The validation of these top-tier conferences convinces us it is valuable to compile an overview of the recent MDRL works and understand how these works relate to the existing literature.
In this context, we have identified four prominent categories in which to group recent works, as shown in the following figure.
(a) Analysis of emergent behaviors: evaluate DRL algorithms in multiagent scenarios.
(b) Learning communication: agents learn with actions and through messages.
(c) Learning cooperation: agents learn to cooperate using only actions and local observations.
(d) Agents modeling agents: agents reason about others to fulfill a task (e.g., cooperative or competitive).
The next objective of this article is to provide guidelines by showcasing how methods and algorithms from DRL and multiagent learning can complement each other to solve problems in MDRL. This occurs, for example, when:
• Dealing with non-stationarity
• Dealing with multiagent credit assignment
We also present general lessons learned from these works, such as the use of:
• Experience replay buffers in MDRL – a key component in many DRL works. These containers serve as explicit memory, storing interactions through which agents learn to improve their behaviors.
• Recurrent neural networks (e.g., LSTMs). These networks serve as implicit memory that improves performance, particularly for partially-observable environments.
• Centralized learning with decentralized execution: Agents can be trained with a central controller that has both access to agents’ actions and observations, but during deployment they will operate based solely on observations.
• Parameter sharing: In many tasks, it is useful to share a network’s internal layers, even when there are many outputs.
Towards the end of the article, we reflect on some open questions and challenges:
• On the challenge of sparse and delayed rewards.
Recent MAL competitions and environments (e.g., Pommerman [24], Capture the flag [21], MarLÖ, Starcraft II, and Dota 2) have complex scenarios wherein many actions must be taken before a reward signal becomes available. This is already a challenge for RL [16]; in MDRL it becomes even more problematic since the agents not only need to learn basic behaviors (like in DRL), but also need to learn the strategic element (e.g., competitive/collaborative) embedded within the multiagent setting.
• On the role of self-play.
Self-play (when all the agents use the same learning algorithm) is a MAL cornerstone that achieves impressive results [17-19]. While notable results have also occurred in MDRL, recent works have shown that plain self-play does not yield the best results [20, 21].
• On the challenge of the combinatorial nature of MDRL.
Monte Carlo tree search (MCTS) has been the backbone of major breakthroughs for AlphaGo and AlphaGo Zero, both of which used MCTS along with the DRL. However, for multiagent scenarios, there is an additional challenge of the exponential growth of all the agents' action spaces for centralized methods. Given more scalable planners [22, 23], there is room for research in combining MCTS-like planners with DRL techniques in multi-agent scenarios.
While there are a number of notable works in DRL and MDRL that represent important milestones for AI, we acknowledge there are also open questions in both single-agent learning and multiagent learning that demonstrate how much more work still needs to be done. We expect this article will help unify and motivate future research to take advantage of the abundant literature that exists in both areas (DRL and MAL) in a joint effort to promote fruitful research in the multiagent community.
Full paper: https://arxiv.org/abs/1810.05587
[1] P. Stone, M. M. Veloso, Multiagent Systems - A Survey from a Machine Learning Perspective., Autonomous Robots 8 (2000) 345–383.
[2] Y. Shoham, R. Powers, T. Grenager, If multi-agent learning is the answer, what is the question?, Artificial Intelligence 171 (2007) 365–377.
[3] K. Tuyls, G. Weiss, Multiagent learning: Basics, challenges, and prospects, AI Magazine 33 (2012) 41–52.
[4] L. Busoniu, R. Babuska, B. De Schutter, A Comprehensive Survey of Multiagent Reinforcement Learning, IEEE Transactions on Systems, Man and Cybernetics, Part C (Applications and Reviews) 38 (2008) 156–172.
[5] A. Nowé, P. Vrancx, Y.-M. De Hauwere, Game theory and multi-agent reinforcement learning, in: Reinforcement Learning, Springer, 2012, pp. 441–470.
[6] L. Panait, S. Luke, Cooperative Multi-Agent Learning: The State of the Art, Autonomous Agents and Multi-Agent Systems 11 (2005).
[7] L. Matignon, G. J. Laurent, N. Le Fort-Piat, Independent reinforcement learners in cooperative Markov games: a survey regarding coordination problems, Knowledge Engineering Review 27 (2012) 1–31.
[8] D. Bloembergen, K. Tuyls, D. Hennes, M. Kaisers, Evolutionary Dynamics of Multi-Agent Learning: A Survey., Journal of Artificial Intelligence Research 53 (2015) 659–697.
[9] P. Hernandez-Leal, M. Kaisers, T. Baarslag, E. Munoz de Cote, A Survey of Learning in Multiagent Environments - Dealing with Non-Stationarity (2017). arXiv:1707.09183.
[10] S. V. Albrecht, P. Stone, Autonomous agents modelling other agents: A comprehensive survey and open problems, Artificial Intelligence 258 (2018) 66–95.
[11] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, D. Hassabis, Human-level control through deep reinforcement learning, Nature 518 (2015) 529–533.
[12] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, D. Hassabis, Mastering the game of Go with deep neural networks and tree search, Nature 529 (2016) 484–489.
[13] M. Moravčík, M. Schmid, N. Burch, V. Lisý, D. Morrill, N. Bard, T. Davis, K. Waugh, M. Johanson, M. Bowling, DeepStack: Expert-level artificial intelligence in heads-up no-limit poker, Science 356 (2017) 508–513.
[14] Open AI Five, https://blog.openai.com/openai-five, 2018. [Online; accessed 7-September-2018].
[15] K. Arulkumaran, M. P. Deisenroth, M. Brundage, A. A. Bharath, A Brief Survey of Deep Reinforcement Learning (2017). arXiv:1708.05866v2.
[16] R. S. Sutton, A. G. Barto, Introduction to reinforcement learning, volume 135, MIT press Cambridge, 1998.
[17] J. Hu, M. P. Wellman, Nash Q-learning for general-sum stochastic games, Journal of Machine Learning Research 4 (2003) 1039–1069.
[18] M. Bowling, Convergence and no-regret in multiagent learning, in: Advances in Neural Information Processing Systems, Vancouver, Canada, 2004, pp. 209–216.
[19] J. Heinrich, D. Silver, Deep Reinforcement Learning from Self-Play in Imperfect-Information Games (2016). arXiv:1603.01121
[20] T. Bansal, J. Pachocki, S. Sidor, I. Sutskever, I. Mordatch, Emergent Complexity via Multi-Agent Competition., in: International Conference on Machine Learning, 2018.
[21] M. Jaderberg, W. M. Czarnecki, I. Dunning, L. Marris, G. Lever, A. G. Casta˜neda, C. Beattie, N. C. Rabinowitz, A. S. Morcos, A. Ruderman, N. Sonnerat, T. Green, L. Deason, J. Z. Leibo, D. Silver, D. Hassabis, K. Kavukcuoglu, T. Graepel, Human-level performance in first-person multiplayer games with population based deep reinforcement learning (2018).
[22] C. Amato, F. A. Oliehoek, et al., Scalable planning and learning for multiagent POMDPs., in: AAAI, 2015, pp. 1995–2002.
[23] G. Best, O. M. Cliff, T. Patten, R. R. Mettu, R. Fitch, Dec-MCTS: Decentralized planning for multi-robot active perception, The International Journal of Robotics Research (2018)
[24] Resnick, C., Eldridge, W., Ha, D., Britz, D., Foerster, J., Togelius, J., Cho, K. and Bruna, J., 2018. Pommerman: A Multi-Agent Playground. arXiv preprint arXiv:1809.07124.
On October 9, 2018, we threw a party to mark the official launch of our new Borealis AI Montreal research centre and to celebrate our $1M collaboration with the Canadian Institute for Advanced Research (CIFAR) on responsible AI research and initiatives.
Click on the gallery below to see pics from the event.
]]>Today, the RBC Foundation announced it will donate $1-million dollars over three years to the Canadian Institute for Advanced Research (CIFAR). The gift will support research and initiatives aimed at furthering the study of ethical artificial intelligence (AI) practices.
RBC CEO, Dave McKay, made the announcement, in conjunction with CIFAR president and CEO, Dr. Alan Bernstein, at the official launch of Borealis AI’s Montreal research centre. CIFAR is currently leading the country’s Pan-Canadian Artificial Intelligence Strategy.
Borealis AI will collaborate closely with CIFAR in an advisory capacity on key aspects of the strategy, with a particular focus on global thought leadership around the ethical implications of AI advancements.
The investment will help fund initiatives like CIFAR’s Catalyst Grants, which award up to $100,000 per year for two years to support collaborations in novel areas of AI exploration between researchers at any Canadian institution. Two of these awards will be explicitly focused on research in areas like privacy, accountability, transparency and bias in machine learning.
The money will also go toward delivering interdisciplinary AI research workshops in fields such as transportation, environmental science, public health, and energy, while the remainder will support ongoing training opportunities in the social implications of AI, including equity, diversity and inclusion.
Ethical AI is no longer a peripheral topic within the community. As scientific successes in fields like deep learning, computer vision and natural language processing continue to grow, the broad-ranging social impact of these technologies cannot be divorced from their applications and need to be researched with the same academic rigour.
RBC and Borealis AI are dedicated to the shared sense of responsibility between diverse research communities dedicated to the proliferation of responsible technologies. With CIFAR as a partner, we are confident we’ll exceed these goals.
]]>A blank canvas, a stunning city, a top-notch team, and an organization on the move. Put it all together and you get Borealis AI’s striking new Montreal research centre, which officially opened its tunnel, er, doors this week.
RBC CEO Dave MacKay kicked off the official launch this morning, reinforcing the bank’s commitment to supporting the AI ecosystem through collaboration with Canada’s leading research institutions. He was joined onstage by Nadine Renaud-Tinker, president of RBC in Quebec, and Borealis AI co-founder and head, Foteini Agrafioti, who shared their excitement about expanding the organization’s network of research centres into a city that has shown such dynamic leadership in the field.
Dr. Alan Bernstein, president and CEO of the Canadian Institute for Advanced Research (CIFAR) followed to announce the RBC Foundation’s $1-million-dollar investment in the organization’s Pan-Canadian Artificial Intelligence Strategy.
Our new 40-person research centre is located at O Mile-Ex, a former textile factory that is becoming the de facto AI industrial research hub of Montreal. We share easy access to some of the neighbourhood’s best coffee with our new neighbours Element AI, MILA, Thales, and IVADO.
With our Toronto research centre scooping up design awards and getting full-length features in Toronto Life, our Montreal design team had a tough act to follow.
Thankfully, at 6,500 square feet and with no building restrictions, there was plenty of space for us to play. That’s why one of the first features we added was a hockey motif in one of our conference rooms. It’s a nod to the city’s deep hockey heritage and a show of good sportsmanship from the Toronto lab for acknowledging the Canadiens.
A mini soccer pitch sits smack in the middle of the hall for team members to let off a little steam.
And a cinema meeting room pays homage to Quebec’s thriving film industry and specifically to the iconic Snowdon Theatre during its glory days.
The primary theme, however, was the Montreal Metro. Visitors walk through a tiled tunnel throughway to reach the front hall.
Hang a left and you arrive at a meeting room with swinging egg chairs. On the right, a chill out room with grass and some cozy bean bag chairs.
Groceries get delivered each Monday to the kitchen with its intricate tiled mosaic floor. This is yet another nod to Montreal – this time to its design community and the city’s overall exquisite attention to detail.
Our living room is like being in a park, albeit a park with a large garage door. The bright, open, window-filled space offers a panorama view for lectures, presentations, and meetings. These are all overseen by a gigantic balloon guard dog who monitors the proceedings.
Exceptional research deserves an exceptional space. And if you think this is great, just wait until you see what we have in store for Waterloo, Vancouver, and Edmonton.
]]>We took Prof. Taylor for coffee near his old east-end Toronto stomping grounds to hear his thoughts on how deep learning has evolved, whether any pre-2012 techniques still yield promising results, where Software 2.0 fits into the future and how much of a deep-domain expert you have to be in order to give truly valuable advice to businesses.
]]>While it is tempting to now adopt a “one-size-fits-all” mantra, this would be a prematurely limiting methodology. Recently, a whole spate of machine learning models have emerged that involve more than a single objective; rather, they involve multiple objectives which all interact with each other during training. The most prominent model, of course, is Generative Adversarial Networks (GANs). However, other examples include synthetic gradients, proximal-gradient TD learning, and intrinsic curiosity. The appropriate way to think about these types of problems is to interpret them as game-theoretic problems, where one aims to find the Nash equilibrium rather than local minima of each objective. Intuitively speaking, Nash equilibrium occurs when each player knows the strategy of all the other players, and no player has anything to gain by changing his or her own strategy.
Unfortunately, finding Nash equilibrium in games is notoriously difficult. In fact, theoretical computer scientists have long known that finding Nash equilibrium for general games is an intractable problem. Nor is it ideal to naively apply gradient descent to games. Firstly, gradient descent has no convergence guarantees and, even in case where it does, it may be highly unstable and slow. But the most severe drawback is that, unlike the traditional setup in supervised machine learning, there is no single objective involved in this model which means we have no way of measuring any kind of progress.
We can illustrate the complexity of interacting losses with a very simple two-player game example. Consider Player 1 and Player 2 with the respective loss functions:
In this game, Player 1 controls the variable x and Player 2 controls the variable y. The dynamics (or simultaneous gradient) is given by:
If we plot this in the 2D-Cartesian plane, all of the vector fields cycle around the origin and there is no direction which points straight to it. The technical problem here is that the dynamics ξ are not a gradient vector field; in other words, there is no function φ such that φ = ξ.
In the ICML paper, The Mechanics of n-Player Differentiable Games, [1], the authors use insights from Hamiltonian mechanics to tackle the problem of finding game equilibriums. Hamiltonian mechanics are a reformulation of classical mechanics in the following way. Consider a particle moving in the Euclidean space R^{n}. The state of the system at a given time t is determined by the coordinates of the position (q_{1},…,q_{n}) and the coordinates of the momentum (p_{1},…,p_{n}). The space R^{2n} of positions and momenta is called the “phase space”. The “Hamiltonian” H(q,p) is a function on this phase space and it represents the total energy of the system. Hamilton’s equations (also referred to as “equations of motion”) describe the time evolution of the state of the system. These are given by
We see how all of these formulations play out in our simple example. If we define the Hamiltonian H(x,y) here to be
the gradient is then ▽H = (x, y). There are two critical observations to be made here: (1) conservation of energy. The level sets of H are conserved by the dynamics ξ = (y, -x) (hence, ξ cycles around the origin) and; (2) gradient descent on the Hamiltonian, rather than the simultaneous gradient on the losses, converges to the origin.
Motivated by this philosophy, the authors in [1] introduce the notion of Hamiltonian games. For a n-player game with parameter w, they define the Hessian of a game to be
Since this is a matrix, it always admits a symmetric and anti-symmetric decomposition:
This leads to a classification: Hamiltonian games are defined as games where the symmetric component is zero, S(w) = 0. Potential games are defined as games where the anti-symmetric component is zero, A(w) = 0. Going back to our simple example, the Hessian is
so therefore we have a Hamiltonian game. One of the main theoretical contributions of this paper is that given an n-player Hamiltonian game with H(w) = ^{1}/_{2}ξ(w)^{2}, under some conditions the gradient descent on H converges to a Nash equilibrium.
Another central contribution made by the authors is the proposal of a new algorithm to find stable fixed points (which under some conditions can be considered Nash equilibrium). Their Symplectic Gradient Adjustment (SGA) adjusts the game dynamics by
For a potential game where A(w) = 0, then SGA performs the usual gradient descent by finding local minimum. In contrast, for a Hamiltonian game where S(w) = 0, then SGA finds local Nash equilibrium. Readers fluent in differential geometry can immediately see the reasoning behind the terminology “symplectic”. For a Hamiltonian game, H(w) is a Hamiltonian function and the gradient of the Hamiltonian ▽H = A^{T}ξ. The dynamic ξ is a Hamiltonian vector field, since it conserves the level-sets of the Hamiltonian H. In symplectic geometry, the relationship between symplectic form ω and the Hamiltonian vector field ξ is
The right-hand side of this equation is simply the gradient of our Hamiltonian function. In the context of Hamiltonian games, we see that the antisymmetric matrix A is playing the role of the symplectic form ω, which justifies the terminology.
In the experimental section, the authors compared their SGA method to other recently proposed algorithms for finding stable fixed points in GANs. Moreover, to demonstrate the flexibility of their algorithm, the authors also studied the performance of SGA on general two-player and four-player games. In all cases, SGA was competitive with, if not better than, existing algorithms.
As a summary, this paper provides a glimpse of how a specific class of general game problems may be tackled by borrowing tools from mathematics and physics. Machine learning models featuring multiple interacting losses are becoming increasingly popular. As such, it is necessary for us to come up with new methodologies rather than relying on the crutches of standard gradient descent. Unraveling the mysteries behind these complicated models will have considerable practical impact as it will aid the design of better scalable algorithms in the future.
[1] David Balduzzi, Sebastien Racaniere, James Martens, Jakob Foerster, Karl Tuyls, and Thore Graepel. The mechanics of n-player differentiable games. arXiv preprint arXiv:1802.05642, 2018.
]]>CoT: Cooperative Training for Generative Modeling
For tasks that involve generating natural language, a common practice is to train the model in teacher forcing mode. This means that during training, the model is always asked to predict the next word given the previous ground truth words as its input. However, at test time the model is expected to generate the next word based on its previously generated words. As a result, the mistakes the model has made along the way can quickly accumulate as it has never been exposed to its own errors during training. This phenomenon is known as exposure basis.
These models are predominantly trained via Maximum Likelihood Estimation (MLE), which may not correspond to perceived quality of the generated text. Maximizing the likelihood is equivalent to minimizing the Kullback-Leibler (KL) divergence between the real data distribution P and the estimated model distribution G. However, the KL divergence is asymmetric and has well known limitations when used for training. Thus, this paper proposes to optimize the Jensen-Shannon divergence (JSD) between P and G instead.
The JSD requires an intermediate distribution which is a mixture between the true distribution P and the model distribution Q. In this paper, the authors suggest the use of a mediator which approximates this intermediate distribution. The generative model is then trained to minimize the estimated JSD provided by the mediator. This results in an iterative algorithm that alternates between updating the mediator and the generative model. It can also be viewed as a cooperative objective between the mediator and the generator to maximize the expected log likelihood for both P and G, which gives rise to the paper’s name Cooperative Training (CoT).
Although experiments in the paper are mostly language related, it is worth mentioning that such a strategy can, in principle, be applied to many other types of data. One caveat, however, is that applying CoT requires a factorized version of the density function, which is trivial for natural language since it can be represented using an RNN language model. For instance for images, one could opt to use a model like PixelRNN. However it may be prohibitively slow.
]]>Paper: “This looks like that: deep learning for interpretable image recognition”
Authors: Chaofan Chen, Oscar Li, Alina Barnett, Jonathan Su, Cynthia Rudin
Presenter: Cynthia Rudin
The aptly titled paper, "This looks like that," involves learning "prototypes" that correspond to common features characterizing object categories. The output of such a classifier is the object category and a set of prototypes that closely matches the sample image. The title-motivating interpretation is that this input image is that particular category because it looks like these other canonical images from that category.
This paper interested me because it was something new; specifically, a new application category that can be demonstrated on a small dataset. Digging deeper, you can see there is some technical machinery involved in figuring out an appropriate cost function for training the parameters of the model. But the key innovation here is the new application category. It’s only recently that we've had access to this set of fully parallel computing architectures, and to me, artificial intelligence is about figuring out what to do with all these parallel computing machines. What we end up doing with them is still totally up in the air and fills me with promise for the field.
One weakness, however, is the authors’ example prototypes. The authors claim their technique mimics the type of analysis performed by a specialist in explaining why certain objects fall within certain categories. In the supplementary materials, for example, we see an automatically classified image of a bird labelled with what is clearly its beak, upper tail, and lower tail. As an analogy, the author shows a human-labelled image of a bird with similarly named components. This is the strongest example in the paper of a set of components that are "interpretable.” Their idea is that the way a human would explain an image of a bird is by pointing to different parts of the object and suggesting “this is an image of a bird because it contains these essential features of a bird.” However, the other examples given, that of a truck placed beside multiple images of vehicles, for instance, don't seem to have the same level of interpretability; they just look like other pictures of vehicles and not “prototypical parts of images from a given class” as suggested in the paper’s introduction. This is, of course, a somewhat subjective criticism: figuring out how to make this more objective seems to be an area of future research.
Overall, I found it notable that this paper only tested the algorithm on CIFAR 10 and 10 classes from CUB-200-2011 (two small datasets), which conceptually may be seen as limiting the scope, but it presented a novel application that was very well received by the audience.
Paper: Robust Physical-World Attacks on Deep Learning Visual Classification
Authors: Kevin Eykholt, Ivan Evtimov, Earlence Fernandes, Bo Li, Amir Rahmati, Chaowei Xiao, Atul Prakash, Tadayoshi Kohno, Dawn Song
Presenter: Dawn Song
In this talk, Dawn Song presented work on what she and her fellow authors call “Robust Physical Perturbations”. The perturbations in this case are part of the adversarial attacks whose purpose is to illuminate weaknesses in or outright break existing classifying techniques. In this case, Dawn showed a video of a stop sign with a type of physical altercation resembling graffiti, then demonstrated how the stop sign was subsequently misinterpreted as a speed limit sign reading "45 mph." In addition to this problem, the stop sign was misclassified as a speed limit sign from many different angles and distances throughout the duration of the video. These errors have obvious safety implications, which I will examine below:
There was a concrete connection to a real engineering problem – the autonomous vehicle safety problem. For instance, a nefarious actor could decide to vandalize stop signs so they’d appear as real stop signs to a human observer, but they’d be classified as something that doesn’t require stopping and could cause an accident.
The authors identified another interesting constraint: that the attack had to look like what we consider “normal” graffiti so the vandalism wouldn’t appear suspicious to a human observer. If the graffiti looks fake, it’s easier for a human to recognize these real-world inconsistencies and then identify the situation as an attempted adversarial attack on the classifier, thus motivating the human to fix the sign sooner. To demonstrate this point, the authors show some pictures of their proposed adversarial attacks: one of the attacks appears to be the words LOVE-HATE applied to the stop sign that was successful in causing a misclassification. To me this example looked like real-life graffiti that is occasionally observed on stop signs.
When you dig more deeply into the paper, it’s clear that the authors address the second point by using a masking technique. In this version, the authors perturb an input image to create an adversarial example, then mask small perturbations to zero so that the end result requires only small portions of the image to be perturbed in order to construct the adversarial example (although in this case, the small portion of the image that gets perturbed is perturbed rather significantly). This method is distinguishable from some of the other adversarial techniques used where each pixel is restricted to be perturbed by only a small amount. In this case, a small fraction of pixels are perturbed by a large amount.
There are several rules of thumb as well as questions I can draw from these adversarial examples:
If an attacker is aware of the internal parameters of the model being attacked, then it is much easier to construct an adversarial example. The authors assume the internal parameters are known, and this is probably reasonable because of the possibility of black box extraction attacks; that is, a set of techniques like those introduced by Ristenpart et al. that allow an adversary to estimate the weights of a neural network by just performing black box queries of the model.
You should ground the analysis in real-world costs and benefits. This way, the constrained problem of stop sign detection makes the task more real, and perhaps more informative.
We lack the language to really discuss "interpretability" and "robustness to adversarial attack" so this seems like an interesting open area for research exploration. However, the papers I've noted here are good examples of how this can be done.
The fact that the “masking technique” to make the adversarial attack resemble typical graffitti worked is notable: it suggests that if we constrain what the attack is to look like, at least somewhat, we can still create successful attacks. It seems like there could be many more ways to make adversarial attacks beyond the existing techniques.
Paper: Intelligence per Kilowatthour
Authors: Max Welling
Presenter: Max Welling
In this talk, Prof. Welling demonstrated a pruning algorithm with a well-tested robustness. The key idea with pruning algorithms is that you can start with a densely connected neural network, then iteratively prune some of the edges to retrain the network. Sometimes compression rates of 99.5% can be obtained.
I think the most important point in the talk comes from the slide above: "AI algorithms will be measured by the amount of intelligence they provide per killowatthour." As the amount of data balloons, and machine learning algorithms use increasingly large numbers of parameters, we will be forced to measure our AI algorithms per unit energy. If not, I suspect many tasks on the frontiers of artificial intelligence will simply be infeasible.
Taking inspiration from Max's talk, I noted that one way of optimizing this metric is by decreasing energy consumption while maintaining performance on a previously identified task. But another way of optimizing this is by finding new ways to measure exactly how intelligent these systems are. I suspect that approaches like that in Cynthia Rudin's talk discussing "this looks like that" may be a way of increasing the "intelligence" part of the equation. Finding ways to energy-efficiently create stop sign detectors that are robust to adversarial graffiti-style attacks like those discussed by Dawn Song could be another way to increase the intelligence of these models.
]]>Movies are one of life’s most pleasurable escapes, an artform that has dominated our cultural experience for over a century. But underneath the costumes, emoted lines and beautiful settings are the mechanics of technology, without which there would be no cinema.
Sanja Fidler, assistant professor at University of Toronto and head of NVIDIA’s new Toronto AI research lab, has made a name, in part, by marrying her expertise in computer vision and natural language processing to fascinating – and even funny – machine learning applications. She sat down with Northern Frontier at the Revue Cinema in Toronto’s west end to explore the link between this core science and popular art form through her research, how AI can enhance human creativity, and how this new frontier in tech could influence a new era of moviemaking.
]]>Joint work with Daniel M. Roy based on https://arxiv.org/abs/1703.11008, https://arxiv.org/abs/1712.09376, and https://arxiv.org/abs/1802.09583
]]>One of the ways of constructing these “easy” negative examples is through Noise Contrastive Estimation (NCE), a method that randomly samples configurations for negative examples. But by only sampling easy negative examples, the model fails to learn discriminating features and consequently leads to a poor final performance. In this work, we show the need for hard negative examples and provide a method to generate them. We propose to augment the negative sampler in NCE with an adversarially learned adaptive sampler that finds harder negative examples. Our method, dubbed Adversarial Contrastive Estimation (ACE), leads to improvements over a wide array of embedding models that validate the generality of the approach.
Contrastive Learning operates by contrasting losses on positive and negative examples. By seeing both these types of data, the model can try to distinguish between the two. A few popular approaches include the aforementioned Noise Contrastive Estimation (NCE) to train a skip-gram model for word embeddings, and triplet loss for deep metric learning. Again, positive examples are taken from the real data distribution i.e., training set, while negative examples are any configurations of data that are not directly observed in real data. In their most general form, contrasting learning problems can be framed as follows:
L(ω) = _{p (x+,y+,y−) }l_{ω} (x^{+},y^{+},y^{−})
Here l_{ω}(x^{+},y^{+},y^{−}) captures both the model with parameters ω and the loss that scores a positive tuple (x^{+},y^{+}) against a negative one (x^{+},y^{−}). _{p} _{(x+,y+,y−)}(.) denotes expectation with respect to some joint distribution over positive and negative samples. With a bit of algebra and the fact that given x^{+}, the negative sampling is not dependent on the positive label, we can simplify the expression to _{p(x+)}[_{p(y+|x+)p(y−|x+)}l_{ω}(x^{+},y^{+},y^{–})]. In some cases, the loss can be written as a sum of scores on positive and negative tuples. This is the case with NCE in word embeddings when using cosine similarity as the scoring function. In other cases, the loss is not decomposable as with a max margin objective, but that won’t deter our treatment.
NCE can be viewed as a special case of our factorized general loss where p(y^{−}|x^{+}) is taken from a predefined noise distribution. For example, in the case of word embeddings, this could mean picking a centre word as x^{+} and then randomly sampling a word from the entire vocabulary set y^{–} rather than using the context word, which in our notation would be y^{+}. The idea is that the negative sampled word is unlikely to form a word-context pair that would actually be observed in the real data distribution.
The negatives produced from a fixed distribution are not tailored towards x^{+} so they’re likely to be too easy. The problem with this approach is that the model fails to continuously improve throughout the entirety of training due to the absence of hard negative examples. As training progresses, and more and more negative examples are correctly learned, the probability of drawing a hard negative example (see definition below) further diminishes, thus causing slow convergence.
Before we talk about ACE and its benefits, we need to first clarify what we mean by hard negative examples. For us, hard negative examples are datapoints that are extremely difficult for the training model to distinguish from positive examples. Hard negative examples necessarily result in higher losses during training, which is useful because when you have higher losses, you get more informative gradients that aid in training. We can take inspiration from conditional generative adversarial networks (CGAN), as we intuitively want to sample negative configurations that can “fool” the model into thinking they were sampled from the real data distribution. However, it is important to note that at the beginning of training, the model needs a few easy examples to get off the ground. Thus, we can augment NCE with an adversarial sampler of the following form: λp_{nce}(y^{−}) + (1−λ)g_{θ}(y^{−}|x^{+}), where g_{θ} is a conditional distribution with a learnable parameter θ and λ is a hyperparameter. The overall objective function can then be stated as:
L(ω,θ;x)=λ_{p(y+|x)pnce(y–)}l_{ω}(x,y^{+},y^{–})+(1−λ)_{p(y+|x)gθ(y−|x)}l_{ω}(x,y^{+},y^{−})
Learning can then proceed if we recognize the main model as the discriminator and the adversarial negative sampler as the generator in a GAN-like setup. However, one slight distinction from a regular GAN is that the generator g_{θ}(y^{–}|x^{+}) defines a categorical distribution over possible y^{−} values, and samples are drawn accordingly. As a result, the generator picks an action that is a discrete choice, which prevents gradients from flowing back. The generator’s reward is based on its ability to fool the discriminator into thinking the negative samples produced are positive ones. We appeal to the policy gradient literature to train our generator. Specifically, we used the REINFORCE gradient estimator, which is perhaps the simplest one. Another difference is that in cGAN, we would like to find an equilibrium where the generator wins, whereas here, we want the discriminator to win. To ensure that the min-max optimization leads to a saddle point where the discriminator learns good representation of data, we introduce a number of training techniques.
GAN training is notorious for being unstable and when paired with reinforcement learning methods, this phenomenon gets exacerbated. Here are few tricks we used that helped to effectively train with ACE:
• Entropy Regularizer: We found that the generator often collapsed to its favourite set of negative examples, which prevented the model from improving further. To fix this problem, we added a small term to the generator’s loss to ensure it maintained high entropy. The term: R_{ent}(x)=max(0,c−H(g_{θ}(y|x))) contains H which is the entropy of the categorical distribution and c=log(k) is the entropy of a uniform distribution over k choices, while k is a hyperparameter.
Observation: This worked most of the time and never hurt the overall performance.
• Handling False Negatives: When the action space is small, the generator can occasionally sample negatives which leads to a positive example and therefore a false negative result. In NCE, this tends to occur but is not a severe problem as the sampling is random. In ACE, this could be a serious problem as the generator could collapse to these false negatives. Whenever computationally feasible, we apply an additional two-step technique. First, we maintain a hash map of the training data in memory, and use it to efficiently detect if a negative sample (x^{+},y^{+}) is an actual observation then filter it out.
Observation: This is helpful for tasks such knowledge graphs and order embeddings, but not feasible for word embeddings where the vocabulary size is large.
• Variance Reduction: REINFORCE is known to have extremely high variance. To fix this problem we used the self-critical baseline.
Observation: This only seemed to help with the word embedding task.
• Importance Sampling using NCE: The generator can leverage NCE samples for exploration in an off-policy scheme. One way to do this is to use importance re-weighting to take advantage of the NCE samples that weren’t chosen by the generator. This results in a modified reward of g_{θ}(y^{−}|x)/p_{nce}(y^{−}).
Observation: Surprisingly, this technique rarely worked well and we’re not quite sure why, since theoretically it is supposed to work.
We tested ACE on 3 different embedding tasks: Word Embeddings, Order Embeddings, and Knowledge Graph Embeddings. You can check out our paper for a more detailed analysis, but the moral of the story is that whenever learning required a contrastive objective, ACE worked better than NCE, in fact, it usually lifts simple baseline models to be competitive with or superior to SOTA.
As expected when using ACE, we saw that learning was more sample efficient and often converged to higher values than just vanilla NCE. We also validated the hypothesis that ACE samples were harder based on the third plot (see graph below), which shows a higher loss incurred by ACE samples than their NCE counterparts.
We also did an ablation study using the various training tricks listed above on learning knowledge graph embeddings. The punchline is this: ACE always worked better than NCE and entropy regularization was critical for this task.
The generator needed to construct a categorical distribution over all possible actions in order to sample, but this can be very expensive if the action space is very large. While ACE ended up being more sample efficient, it may not be faster in terms of wall-clock time. However, most embedding models are usually geared toward some downstream task [i.e., word embeddings] in language models, so we argue that spending more time on a better embedding model via ACE is worth the effort. Another limitation that results from using ACE is that with the adversarial generator, we lost the guarantee of Maximum Likelihood Estimation (MLE). MLE is provided by NCE as an asymptotic limit, but this is true for any GAN-based method and currently, to the best of our knowledge, there aren’t sufficient tools for analyzing the equilibrium of a min-max game where players are parametrized by deep neural nets.
Regardless of the problem or dataset, if learning resembles a contrastive objective, ACE works by sampling Harder negatives which leads to Better convergence while being Faster in the number of samples seen and results in an overall Stronger model.
The authors would like to thank Prof. Matt Taylor and Prof. Jackie Cheung for helpful feedback. Jordana Feldman for help with writing. Also, an immense thank you to April Cooper for the gorgeous visuals. Finally, countless others for the espresso talks, you know who you are.
Although RL has had many successes, often significant amounts of data are needed before achieving high-quality performance. In this paper we consider two distinct settings. The first setting assumes the presence of a human who can provide demonstrations (e.g., play a video game). These demonstrations can be used to change the agent’s action selection method. For example, rather than the normal “explore vs. exploit” tradeoff, we add an additional “execute” choice. Now the agent can choose between taking a random action (explore), taking the action it thinks is best based on its Q-values (exploit), and taking the action it thinks the demonstrator would have taken (execute). Initially, the probability of execute will be high, attempting to mimic the human, but over time this probability will be decreased so that the agent’s learned knowledge can allow it to outperform the human demonstrator. Another way the demonstration can be used is to use inverse reinforcement learning so that the agent can learn a potential-based shaping function. This shaping reward is guaranteed not to change the optimal policy but can allow the agent to learn much faster. In both cases, just a few minutes of human demonstration can have a dramatic effect on the total performance, and the agent can learn to outperform the human.
For the second setting, consider the case where no environmental reward signal is defined, but a human trainer can provide positive and negative feedback. If a human can train a dog, why not build algorithms to let them train an agent? Our work presents algorithms that allow non-technical human trainers to teach both simple concepts (i.e., a contextual multi-armed bandit) as well as more complex concepts like “bring the white chair to the purple room.” In particular, we treat human feedback as categorical, rather than numeric. Our Bayesian reasoning algorithm then maximizes the likelihood of the human’s desired policy, based on the history of feedback it has received.
While encouraging, we believe that there are many additional ways agents can leverage the knowledge and biases of non-technical users. My International Joint Conferences on Artificial Intelligence (IJCAI) Early Career Talk will take place in Stockholm on July 17th and will discuss the work summarized in this post. The paper this blog post was based on, which is also the basis of my talk, can be found at https://www.borealisai.com/en/publications.
]]>