Deep reinforcement learning (DRL) has had many successes on complex tasks, but is typically considered a black box. Opening this black box would enable better understanding and trust of the model which can be helpful for researchers and end users to better interact with the learner. In this paper, we propose a new visualization to better analyze DRL agents and present a case study using the Pommerman benchmark domain. This visualization combines two previously proven methods for improving human understanding of systems: saliency mapping and immersive visualization.
]]>Located at the heart of the University of Waterloo and housed in the new Evolv1 building, a leading space in sustainable design and the first ever zerocarbon office building in Canada, Borealis AI’s office features a unique design that draws inspiration from the campus life with a teacher’s lounge, a science lab, a track, and a field pitch.
It was a natural fit for Borealis AI to establish its fifth research lab in Waterloo, a city anchored by the University of Waterloo, a worldclass institution, and flanked by a number of innovative AI startups and tech companies. Our Waterloo centre strengthened our existing ties to the city and its strong research community, dating back to Borealis AI’s early days in 2016. Borealis AI is a proud supporter of Waterloo.ai, the university’s artificial intelligence institute.
True to our vision of supporting Waterloo’s AI community, we are also pleased to announce Borealis AI’s support for the Leader’s Prize at True North, powered by COMMUNITECH. The Prize is “a national competition that challenges Canadian thinkers to solve a major societal or industry problem of global proportion and consequence.” Teams will compete in employing AI/ML to produce solutions that automate “the factchecking process and flag whether a claim is true or false.” Professor Pascal Poupart, Principal Researcher at Borealis AI, will be heading up the scientific committee for the competition.
For more photos from our event, click below.
]]>
At Borealis AI, we believe that the value of ML research emerges when it is engineered into beautiful products that have a positive impact on the world. Joseph started his career in financial software at Spectra Securities Software (now part of FIS) and then led product groups at Yahoo in the U.S. and 500px in Toronto.
Our teams at Borealis AI are thrilled to be welcoming him on board.
I wanted an opportunity to leverage my experience and passion for tech and machine learning specifically to build something big that benefits Canada. I had just spent three years working for Communitech, a notforprofit, helping startups, scaleups and large organizations learn how to use data to drive innovation. It was time to get back in the game. Through the vision of Dave McKay and Foteini, and with the hard work of an amazing team, Borealis AI had built a worldclass, integrated team of AI researchers and engineers. It had a huge, friendly market in RBC with over 86,000 employees serving 16 million clients in 35 countries  the scale to drive real impact. It was a onceinalifetime chance.
We’re a distributed organization; our leadership is spread across Vancouver, Edmonton, Waterloo, Toronto, and Montreal. This is critical not only to our ability to recruit, but also to our ability to make an impact across Canada. We try to be very intentional in how we communicate so that we can work well despite the distance.
As we accelerate our work on applications it's critical that we have access to amazing engineering, product and design talent to complement our great research team. Waterloo is an amazing tech ecosystem, with a strong history of revolutionary products based on deep tech — BlackBerry, Open Text, Sandvine, Miovision and so on. Everyone forgets how technically challenging BlackBerry was. Those technical and business skills, that passion for deep tech, is very real and makes Waterloo unique.
And, of course, we train the best software developers in the world, which you need to apply ML at scale. I can see them walking to Davis Centre from my desk. So that's why Borealis AI and RBC continue to bet on Waterloo.
First, the opportunity: the financial system is the plumbing of the global economy. It's this foundational thing that changes very rapidly and has huge impact. Everywhere I look there are problems to solve and opportunities to unlock. I'm convinced that AI will revolutionize large areas of human endeavour very soon and a bank is a great place to be able to observe and effect that. So we're in a problemrich environment with lots of long levers to pull.
And second, the team. With our roots in research, we have incredible capacity in ML. I've had to suspend my disbelief and take a wide view as to what we can/can't do with AI. Dr. Simon Prince, one of our directors of research, put it best in my first week; "you bring us the user problem, the business opportunity, and let us figure out if or how AI can solve it." AI cannot, yet, do anything a human can, but my starting point is to assume that it can, and then to let the research and engineering teams help us understand any constraints, and move beyond them.
This is something that we've been wrestling with. Machine learning is still an immature field; capabilities are changing so fast that it's sometimes hard to separate the impossible from the very hard. User research is difficult to do without real results and real screens. Most times, when you're building an app or website you can get useful user feedback just by walking them through static mockups or a prototype. With a machine learning problem, the performance of the underlying algorithm is hugely important to the overall experience.
It's also hard to turn a business problem into a machine learning problem. Doing so takes skill, and requires that you slice the business problem from multiple directions. You need to understand what training data, simulation environments, or other signal is available. You also need to work with very smart people from very different disciplines with different temperaments and perspectives towards a common goal.
And finally, we're a product organization that's been built around a worldclass research team. To earn our keep we need to be finding problems and applications of ML that move us towards unsolved problems. We've got to keep chasing huge business opportunities with an element of scientific risk.
To build a worldclass product culture in Borealis AI, and to ship products that make people stop and say wow. We have the potential to develop and deploy some amazing, transformative technology that creates a big impact and I want to get us there, and quickly.
]]>Today, Borealis AI announced it will collaborate with MILA to support a machine learning research initiative on climate change.
Climate change is indisputably one of the biggest challenges of our time. Global temperature rise, glacial retreat, sealevel rise and extreme weather events are just a few examples of the impact that humans are having on Earth. While modern society is at the centre of this change, there is currently a disconnect between human responsibility and awareness. People have a hard time understanding how climate change affects them personally and what it means for their future.
MILA researchers, led by Prof Yoshua Bengio, have developed computer vision algorithms to personalize the effect of extreme weather events on locations of interest. Given an address, this machine learning model can generate a photorealistic image can visualise the impact of extreme weather phenomena in that region as predicted by a climate model associated with that geography. Generative models are used to synthesize images showing flooding and other weather effects that are hyperpersonalized and depicting of your own home or street.
This project falls under MILA’s research portfolio on “AI for Humanity” which involves a number of projects that are socially responsible and beneficial to society.
]]>Humans can recognize new object classes from very few instances. However, most machine learning techniques require thousands of examples to achieve similar performance. The goal of fewshot learning is to classify new data having seen only a few training examples. In the extreme, there might only be a single example of each class (one shot learning). In practice, fewshot learning is useful when training examples are hard to find (e.g., cases of a rare disease), or where the cost of labelling data is high.
Fewshot learning is usually studied using NwayKshot classification. Here, we aim to discriminate between $N$ classes with $K$ examples of each. A typical problem size might be to discriminate between $N=10$ classes with only $K=5$ samples from each to train from. We cannot train a classifier using conventional methods here; any modern classification algorithm will depend on far more parameters than there are training examples, and will generalize poorly.
If the data is insufficient to constrain the problem, then one possible solution is to gain experience from other similar problems. To this end, most approaches characterize fewshot learning as a metalearning problem.
In the classical learning framework, we learn a how to classify from training data and evaluate the results using test data. In the metalearning framework, we learn how to learn to classify given a set of training tasks and evaluate using a set of test tasks (figure 1); In other words, we use one set of classification problems to help solve other unrelated sets.
Here, each task mimics the fewshot scenario, so for NwayKshot classification, each task includes $N$ classes with $K$ examples of each. These are known as the support set for the task and are used for learning how to solve this task. In addition, there are further examples of the same classes, known as a query set, which are used to evaluating the performance on this task. Each task can be completely nonoverlapping; we may never see the classes from one task in any of the others. The idea is that the system repeatedly sees instances (tasks) during training that match the structure of the final fewshot task, but contain different classes.
At each step of metalearning, we update the model parameters based on a randomly selected training task. The loss function is determined by the classification performance on the query set of this training task, based on knowledge gained from its support set. Since the network is presented with a different task at each time step, it must learn how to discriminate data classes in general, rather than a particular subset of classes.
To evaluate fewshot performance, we use a set of test tasks. Each contains only unseen classes that were not in any of the training tasks. For each, we measure performance on the query set based on knowledge of their support set.
Approaches to metalearning are diverse and there is no consensus on the best approach. However, there are three distinct families, each of which exploits a different type of prior knowledge:
Prior knowledge about similarity: We learn embeddings in training tasks that tend to separate different classes even when they are unseen.
Prior knowledge about learning: We use prior knowledge to constrain the learning algorithm to choose parameters that generalize well from few examples.
Prior knowledge of data: We exploit prior knowledge about the structure and variability of the data and this allows us to learn viable models from few examples.
An overview these methods can be seen in figure 2. In this review, we will consider each family of methods in turn.
This family of algorithms aims to learn compact representations (embeddings) in which the data vector is mostly unaffected by intraclass variations but retains information about class membership. Early work focused on pairwise comparators which aim to judge whether two data examples are from the same or different classes, even though the system may not have seen these classes before. Subsequent research focused on multiclass comparators which allow assignment of new examples to one of several classes.
Pairwise comparators take two examples and classify them as either belonging to the same or different classes. This differs from the standard NwayKshot configuration and does not obviously map onto the above description of metalearning although as we will see later there is in fact a close relationship.
Koch et al. (2015) trained a model that outputs the probability $Pr(y_a=y_{b})$ that two data examples $\mathbf{x}_{a}$ and $\mathbf{x}_{b}$ belong to the same class (figure 3a). The two examples are passed through identical multilayer neural networks (hence Siamese) to create two embeddings. The componentwise absolute distance between the embeddings is computed and passed to a subsequent comparison network that reduces this distance vector to a single number. This is passed though a sigmoidal output for classification as being the same or different with a crossentropy loss.
During training, each pair of examples are randomly drawn from a superset of training classes. Hence, the system learns to discriminate between classes is general, rather than two classes in particular. In testing, completely different classes are used. Although this does not have the formal structure of the NwayKshot task, the spirit is similar.
Triplet networks (Hoffer & Ailon 2015) consist of three identical networks that are trained by triplets $\{\mathbf{x}_{+},\mathbf{x}_{a},\mathbf{x}_{}\}$ of the form (positive, anchor, negative). The positive and anchor samples are from the same class, whereas the negative sample is from a different class. The learning criterion is triplet loss which encourages the anchor to be closer to the positive example than it is to the negative example in the embedding space (figure 3b). Hence it is based on two pairwise comparisons.
After training, the system can take two examples and establish whether they are from the same or different classes, by thresholding the distance in the learned embedding space. This was employed in the context of face verification by Schroff et al. (2015). This line of work is part of a greater literature on learning distance metrics (see Suarez et al. 2018 for overview).
Pairwise comparators can be adapted to the NwayKshot setting by assigning the class for an example in the query set based on its maximum similarity to one of the examples in the support set. However, multiclass comparators attempt to do the same thing in a more principled way; here the representation and final classification are learned in an endtoend fashion.
In this section, we'll use the notation $\mathbf{x}_{nk}$ to denote the $k$th support example from the $n$th class in the NWayKShot classification task, and $y_{nk}$ to denote the corresponding label. For simplicity, we'll assume there is a single query example $\hat{\mathbf{x}}$ and the goal is to predict the associated label $\hat{y}$.
Matching networks (Vinyals et al. 2016) predict the onehot encoded queryset label $\hat{\mathbf{y}}$ as a weighted sum of all of the onehot encoded supportset labels $\{\mathbf{y}_{nk}\}_{n,k=1}^{NK}$. The weight is based on a computed similarity $a[\hat{\mathbf{x}},\mathbf{x}_{nk}]$ between the queryset data $\hat{\mathbf{x}}$ and each training example $\{\mathbf{x}_{nk}\}_{n,k=1}^{N,K}$.
\begin{equation}
\hat{\mathbf{y}} = \sum_{n=1}^{N}\sum_{k=1}^{K} a[\mathbf{x}_{nk},\hat{\mathbf{x}}]\mathbf{y}_{nk} \tag{1.1}
\end{equation}
where the similarities have been constrained to be positive and sum to one.
To compute the similarity $a[\hat{\mathbf{x}},\mathbf{x}_{nk}]$, they pass each support example $\mathbf{x}_{nk}$ through a network $\mbox{ f}[\bullet]$ to produce an embedding and pass the query example $\hat{\mathbf{x}}$ through a different network $\mbox{ g}[\bullet]$ to produce a different embedding. They then compute the cosine similarity between these embeddings (figure 5a)
\begin{equation}
d[\mathbf{x}_{nk}, \hat{\mathbf{x}}] = \frac{\mbox{ f}[\mathbf{x}_{nk}]^{T}\mbox{ g}[\hat{\mathbf{x}}]} {\mbox{ f}[\mathbf{x}_{nk}]\cdot\mbox{ g}[\hat{\mathbf{x}}]}, \tag{1.2}
\end{equation}
and normalise using a softmax function:
\begin{equation}
a[\hat{\mathbf{x}}_{nk},\mathbf{x}] = \frac{\exp[d[\mathbf{x}_{nk},\hat{\mathbf{x}}]]}{\sum_{n=1}^{N}\sum_{k=1}^{K}\exp[d[\mathbf{x}_{nk},\hat{\mathbf{x}}]]}. \tag{1.3}
\end{equation}
to produce positive similarities that sum to one. This system can be trained end to end for the NwayKshot learning task.^{1 }At each learning iteration, the system is presented with a training task; the predicted labels are computed for the query set (the calculation is based on the support set) and the loss function is the cross entropy of the ground truth and predicted labels.
Matching networks compute similarities between the embeddings of each support example and the query example. This has the disadvantage that the algorithm is not robust to data imbalance; if there are more support examples for some classes than others (i.e., we have departed from the NwayKshot scenario), the ones with more frequent training data may dominate.
Prototypical networks (Snell et al. 2017) are robust to data imbalance by construction; they average the embeddings $\{\mathbf{z}_{nk}\}_{k=1}^{K}$ of the examples for class $n$ to compute their mean embedding or prototype $\mathbf{p}_{n}$. They then use the similarity between each prototype and the query embedding (figures 4 and 5 b) as a basis for classification.
The similarity is computed as a negative multiple of the Euclidean distance (so that larger distances now give smaller numbers). They pass these similarities to a softmax function to give a probability over classes. This model effectively learns a metric space where the average of a few examples of a class is a good representation of that class and class membership can be assigned based on distance.
They noted that (i) the choice of distance function is vital as squared Euclidean distance outperformed cosine distance, (ii) having a higher number of classes in the support set helps to achieve better performance, and that (iii) the system works best when the support size of each class is matched in the training and test tasks.
Ren et al. (2018) extended this system to take advantage of additional unlabeled data which might be from the test task classes or from other distractor classes. Oreshkin et al. (2018) extended this approach by learning a taskdependent metric on the feature space, so that the distance metric changes from place to place in the embedding space.
Matching networks and prototypical networks both focus on learning the embedding and compare examples using a predefined metric (cosine and Euclidean distance, respectively). Relation networks (Santoro et al. 2016) also learn a metric for comparison of the embeddings (figure 5c). Similarly to prototypical networks, the relation network averages the embeddings of each class in the support set together to form a single prototype. Each prototype is then concatenated with the query embedding and passed to a relation module. This is a learnable nonlinear operator that produces a similarity score between 0 and 1 where 1 indicates that the query example belongs to this class prototype. This approach is clean and elegant and can be trained endtoend.
All of the pairwise and multiclass comparators are closely related to one another. Each learns an embedding space for data examples. In matching networks, there are different embeddings for support and query examples, but in the other models, they are the same. For prototypical networks and relation networks, multiple embeddings from the same class are averaged to form prototypes. Distances between support set embeddings/prototypes and query set embeddings are computed using either predetermined distance functions such as Euclidean or cosine distance (triplet networks, matching networks, prototypical networks) or by learning a distance metric (Siamese networks and relation networks).
The multiclass networks have the advantage that they can be trained endtoend for the NwayKshot classification task. This is not true for the pairwise comparators which are trained to produce a similarity or distance between pairs of data examples (which could itself subsequently be used to support multiclass classification).
Although it is not obvious how the pairwise comparators map to the metalearning framework, it is possible to consider their data as consisting of minimal training and test tasks. For Siamese networks, each pair of examples is a training task, consisting of one support example and one query example, where their classes may not necessarily match. For triplet networks, there are two support examples (from different classes) and one query example (from one of the classes).
In part I of this tutorial we have described the fewshot and metalearning problems and introduced a taxonomy of methods. We have also discussed methods that use a series of training tasks to learn prior knowledge about the similarity and dissimilarity of classes that can be exploited for future fewshot tasks. This knowledge takes the form of data embeddings that reduce withinclass variance relative to betweenclass variance, and hence make it easier to learn from just a few data points.
In part II of this tutorial, we'll discuss methods that incorporate prior knowledge about how to learn models, and that incorporate prior knowledge about the data itself.
^{1}Vinyals et al. (2016). also introduced a novel context embedding method which took the full context of the support set $\mathcal{S}$ into account so that $\mbox{ g}[\bullet] = \mbox{ g}[\mathbf{x}, \mathcal{S}]$. Here, the support set was considered as a sequence and encoded by a bidirectional LSTM. Snell et al. (2017) later argued that this context embedding was problematic and redundant.
]]>Adversarial training (Madry et al. 2017) directly optimizes for adversarial robustness by (i) minimizing the loss $\mathcal{L}[\bullet]$ on $I$ data/label pairs $\{\mathbf{x}_{i},y_{i}\}$ while simultaneously (ii) maximizing the loss for each example with respect to an adversarial change $\boldsymbol\delta_{i}$:
\begin{equation}
\min_{\boldsymbol\phi}\frac{1}{I} \sum_{i=1}^{I} \max_{\\boldsymbol\delta_{i}\ \leq \epsilon} \mathcal{L}\left[\mbox{f}[\mathbf{x}_{i} + \boldsymbol\delta_{i},\boldsymbol\phi], y_{i}\right], \tag{1.1}
\end{equation}
where $\boldsymbol\delta_{i}$ is constrained to lie within a specified $\epsilon$ball and $\mbox{f}[\bullet,\boldsymbol\phi]$ is the network function with parameters $\boldsymbol\phi$.
Unfortunately, generating adversarial examples is a nonconvex optimization problem, and so this worstcase objective can only be approximately solved (Kolter & Madry 2018). Finding a lower bound is equivalent to finding an adversarial sample, and empirically it has been observed that search algorithms almost exclusively produce high frequency solutions (Guo et al. 2018). These are samples with small pixelwise perturbations dispersed across an image. This suggests that defenses designed to counter such perturbations may be vulnerable to low frequency solutions, which is the hypothesis we focused on analyzing in our latest paper (Sharma et al. 2019).
Recent work has shown the effectiveness of low frequency perturbations. Guo et al. (2018) improved the query efficiency of the decisionbased gradientfree boundary attack (Brendel et al. 2017) by constraining the perturbation to lie within a low frequency subspace. Sharma et al. (2018) applied a 2D Gaussian filter on the gradient with respect to the input image during the iterative optimization process to win the CAAD 2018 competition.
However, two questions still remain unanswered:
To answer these questions, we utilize the discrete cosine transform (DCT) to test the effectiveness of perturbations manipulating specified frequency components. We remove certain frequency components of the perturbation $\boldsymbol\delta$ by applying a mask to its DCT transform $\text{DCT}(\boldsymbol\delta)$. We then reconstruct the perturbation by applying the inverse discrete cosine transform (IDCT) to the masked DCT transform:
\begin{align}
\text{FreqMask}[\boldsymbol\delta]=\text{IDCT}[\text{Mask}[\text{DCT}[\boldsymbol\delta]]]~. \tag{1.2}
\end{align}
Accordingly in our attack, we use the following gradient:
\begin{equation}
\nabla_{\boldsymbol\delta} \mathcal{L}[\mathbf{x}+\text{FreqMask}(\boldsymbol\delta),y]. \tag{1.3}
\end{equation}
As can be seen in Figure 1, the condition DCT_High only preserves high frequency components; DCT_Low only preserves low frequency components; DCT_Mid only preserves mid frequency components; and DCT_Rand preserves randomly sampled components. For a given reduced dimensionality $n$, we preserve $n \times n$ components. Note that when $n=128$, we only preserve $128^2 / 299^2 \approx 18.3\%$ of the frequency components, which is a relatively small fraction of the original unconstrained perturbation.
Though adversarial examples are defined with regards to generally inducing decision change, one can restrict the attack further by only prescribing success if the decision is changed to a specific target. We evaluate attacks both with and without specified targets, termed in the literature as targeted and nontargeted attacks, respectively. We use the ImageNet dataset, where the 1000 distinct classes make targeted attacks significantly harder.
We use $l_\infty$constrained projected gradient descent (Kurakin et al. 2016; Madry et al. 2017; Kolter & Madry 2018) with momentum (Dong et al. 2017) which is referred to as the momentum iterative method or MIM for short. We test $\epsilon=16/255$ and $\text{iterations}=[1,10]$ for the nontargeted case; $\epsilon=32/255$ and $\text{iterations}=10$ for the targeted case. We benchmark the attack with and without frequency constraints. For each mask type, we test $n=[256,128,64,32]$ with $d = 299$. For DCT_Rand, we average results over $3$ random seeds.
Furthermore, we evaluate attacks in the whitebox, the greybox, and the blackbox settings. For each setting, given models $A$ and $B$, where the perturbation is generated on $A$, evaluation is conducted on $A$, ''defended'' $A$, and distinct $B$, respectively. For defenses, we use the top4 winners of the NeurIPS 2017 competition (Kurakin et al. 2018), which were all prepended to the strongest released adversarially trained model at the time (Tramer et al. 2017).^{1} For our representative undefended model, we evaluate against the stateoftheart found by neural architecture search (Zoph et al. 2017).^{2}
Low frequency perturbations can be generated more efficiently (figure 2) and appear more effective (figure 4 and figure 5), when evaluated against defended models. However, against undefended models (figure 3), no tangible benefit can be observed.
This can be seen more clearly when tracking each individual sourcetarget pair (figure 6). Specifically, we can see that the NeurIPS 2017 competition winners provide almost no additional robustness to the underlying model when low frequency perturbations are applied.
However, we do observe that low frequency perturbations do not improve blackbox transfer between undefended models. Figure 7 presents the normalized difference between attack success rate (ASR) on each of the target models with ASR on the undefended model, showing that defended models are roughly as vulnerable as undefended models when encountered by low frequency perturbations.
Our results demonstrate that given the same search space size, only low frequency perturbations yield performance improvement, namely in generation efficiency and effectiveness when evaluated against defended ImageNet models. When confronted with low frequency perturbations, the top4 NeurIPS 2017 defenses provide no robustness benefit, and are roughly as vulnerable as undefended models. The question remains though: does adversarially perturbing the low frequency components of the input affect human perception?
Representative examples are shown in figures 8 and 9. Though the perturbations do not significantly change human perceptual judgement (e.g., the top example still appears to be a standing woman), the perturbations with $n\leq 128$ are indeed perceptible. Although it is wellknown that $\ell_p$norms (in input space) are far from metrics aligned with human perception, it is still assumed that with a small enough bound (e.g. $\ell_\infty$ $\epsilon=16/255$), the resulting ball will constitute a subset of the imperceptible region (Kolter & Madry 2018). The fact that low frequency perturbations are fairly visible challenges this common belief.
In all, we hope our study encourages researchers to not only consider the frequency space, but perceptual priors in general, when bounding perturbations and proposing tractable, reliable defenses.
^{1}https://github.com/tensorflow/models/tree/master/research/adv_imagenet_models
^{2}https://github.com/tensorflow/models/tree/master/research/slim
The unintentional unfairness that occurs when a decision has widely different outcomes for different groups is known as disparate impact. As machine learning algorithms are increasingly used to determine important realworld outcomes such as loan approval, pay rates, and parole decisions, it is incumbent on the AI community to minimize unintentional discrimination.
This tutorial discusses how bias can be introduced into the machine learning pipeline, what it means for a decision to be fair, and methods to remove bias and ensure fairness.
There are many possible causes of bias in machine learning predictions. Here we briefly discuss three: (i) the adequacy of the data to represent different groups, (ii) bias inherent in the data, and (iii) the adequacy of the model to describe each group.
Data adequacy. Infrequent and specific patterns may be downweighted by the model in the name of generalization and so minority records can be unfairly neglected. This lack of data may not just be because group membership is small; data collection methodology can exclude or disadvantage certain groups (e.g., if the data collection process is only in one language). Sometimes records are removed if they contain missing values and these may be more prevalent in some groups than others.
Data bias. Even if the amount of data is sufficient to represent each group, training data may reflect existing prejudices (e.g., that female workers are paid less), and this is hard to remove. Such historical unfairness in data is known as negative legacy. Bias may also be introduced by more subtle means. For example, data from two locations may be collected slightly differently. If group membership varies with location this can induce biases. Finally, the choice of attributes to input into the model may induce prejudice.
Model adequacy. The model architecture may describe some groups better than others. For example, a linear model may be suitable for one group but not for another.
A model is considered fair if errors are distributed similarly across protected groups, although there are many ways to define this. Consider taking data $\mathbf{x}$ and using a machine learning model to compute a score $\mbox{f}[\mathbf{x}$$]$ that will be used to predict a binary outcome $\hat{y}\in\{0,1\}$. Each data example $\mathbf{x}$ is associated with a protected attribute $p$. In this tutorial, we consider it to be binary $p\in\{0,1\}$. For example, it might encode subpopulations according to gender or ethnicity.
We will refer to $p=0$ as the deprived population and $p=1$ as the favored population. Similarly we will refer to $\hat{y}=1$ as the favored outcome, assuming it represents the more desirable of the two possible results.
Assume that for some dataset, we know the ground truth outcomes $y\in\{0,1\}$. Note that these outcomes may differ statistically between different populations, either because there are genuine differences between the groups or because the model is somehow biased. According to the situation, we may want our estimate $\hat{y}$ to take account of these differences or to compensate for them.
Most definitions of fairness are based on group fairness, which deals with statistical fairness across the whole population. Complementary to this is individual fairness which mandates that similar individuals should be treated similarly regardless of group membership. In this blog, we'll mainly focus on group fairness, three definitions of which include: (i) demographic parity, (ii) equality of odds, and (iii) equality of opportunity. We now discuss each in turn.
Demographic parity or statistical parity suggests that a predictor is unbiased if the prediction $\hat{y}$ is independent of the protected attribute $p$ so that
\begin{equation}
Pr(\hat{y}p) = Pr(\hat{y}). \tag{2.1}
\end{equation}
Here, the same proportion of each population are classified as positive. However, this may result in different false positive and true positive rates if the true outcome $y$ does actually vary with the protected attribute $p$.
Deviations from statistical parity are sometimes measured by the statistical parity difference
\begin{equation}
\mbox{SPD} = Pr(\hat{y}=1, p=1)  Pr(\hat{y}=1, p=0), \tag{2.2}
\end{equation}
or the disparate impact which replaces the difference in this equation with a ratio. Both of these are measures of discrimination (i.e. deviation from fairness).
Equality of odds is satisfied if the prediction $\hat{y}$ is conditionally independent to the protected attribute $p$, given the true value $y$:
\begin{equation}
Pr(\hat{y}y,p) = Pr(\hat{y} y). \tag{2.3}
\end{equation}
This means that the true positive rate and false positive rate will be the same for each population; each error type is matched between each group.
Equality of opportunity has the same mathematical formulation as equality of odds, but is focused on one particular label $y=1$ of the true value so that:
\begin{equation}
Pr(\hat{y}y=1,p) = Pr(\hat{y} y=1). \tag{2.4}
\end{equation}
In this case, we want the true positive rate $Pr(\hat{y}=1y=1)$ to be the same for each population with no regard for the errors when $y=0$. In effect it means that the same proportion of each population receive the "good'' outcome $y=1$.
Deviation from equality of opportunity is measured by the equal opportunity difference:
\begin{equation}
\mbox{EOD} = Pr(\hat{y}=1,y=1, p=1)  Pr(\hat{y}=1,y=1, p=0). \tag{2.5}
\end{equation}
To make these ideas concrete, we consider the example of an algorithm that predicts credit rating scores for loan decisions. This scenario follows from the work of Hardt et al. (2016) and the associated blog.
There are two pools of loan applicants $p\in\{0,1\}$ that we'll describe as the blue and yellow populations. We assume that we are given historical data, so we know both the credit rating and whether the applicant actually defaulted on the loan ($y=0$) or repaid it ($y=1$).
We can now think of four groups of data corresponding to (i) the blue and yellow populations and (ii) whether they did or did not repay the loan. For each of these four groups we have a distribution of credit ratings (figure 1). In an ideal world, the two distributions for the yellow population would be exactly the same as those for the blue population. However, as figure 1 shows, this is clearly not the case here.
Why might the distributions for blue and yellow populations be different? It could be that the behaviour of the populations is identical, but the credit rating algorithm is biased; it may favor one population over another or simply be more noisy for one group. Alternatively, it could be that that the populations genuinely behave differently. In practice, the differences in blue and yellow distributions are probably attributable to a combination of these factors.
Let's assume that we can't retrain the credit score prediction algorithm; our job is to adjudicate whether each individual is refused the loan ($\hat{y}=0)$ or granted it ($\hat{y}=1$). Since we only have the credit score $\mbox{f}[\mathbf{x}$$]$ to go on, the best we can do is to assign different thresholds $\tau_{0}$ and $\tau_{1}$ for the blue and yellow populations so that the loan is granted if $f[\mathbf{x}$$]$ $>\tau_{0}$ for the blue population and $f[\mathbf{x}$$]$ $>\tau_{1}$ for the yellow population.
We'll now consider different possible ways to set these thresholds that result in different senses of fairness. We emphasize that we are not advocating any particular criterion, but merely exploring the ramifications of different choices.
Blindness to protected attribute: We choose the same threshold for blue and yellow populations. This sounds sensible, but it neither guarantees that the overall frequency of loans, nor the frequency of successful loans will be the same for the two groups. For the thresholds chosen in figure 2a, many more loans are made to the yellow population than the blue population (figure 2b). Moreover, examination of the receiver operating characteristic (ROC) curve shows that both the rate of true positives $Pr(\hat{y}=1y=1)$ and false alarms $Pr(\hat{y}=1y=0)$ differ for the two groups (figure 2c).
Equality of odds: This definition of fairness proposes that the false positive and true positive rates should be the same for both populations. This also sounds reasonable, but figure 2c shows that it is not possible for this example. There is no combination of thresholds that can achieve this because the ROC curves do not intersect. Even if they did, we would be stuck giving loans based on the particular false positive and true positive rates at the intersection which might not be desirable.
Demographic parity: The threshold could be chosen so that the same proportion of each group are classified as $\hat{y} =1$ and given loans (figure 3). We make an equal number of loans to each group despite the different tendencies of each to repay (figure 3b). This has the disadvantage that the true positive and false positive rates might be completely different in different populations (figure 3c). From the perspective of the lender, it is desirable to give loans in proportion to people's ability to pay them back. From the perspective of an individual in a more reliable group, it may seem unfair that the other group gets offered the same number of loans despite the fact they are less reliable.
Equal opportunity: The thresholds are chosen so that so that the true positive rate is is the same for both population (figure 4). Of the people who pay back the loan, the same proportion are offered credit in each group. In terms of the two ROC curves, it means choosing thresholds so that the vertical position on each curve is the same without regard for the horizontal position (figure 2c). However, it means that different proportions of the blue and yellow groups are given loans (figure 4b).
We have seen that there is no straightforward way to choose thresholds on an existing classifier for different populations, so that all definitions of fairness are satisfied. Now we'll investigate a different approach that aims to make the classification performance more similar for the two models.
The ROC curves show that accuracy is higher in predicting whether the blue population will repay the loan as opposed to the yellow group (i.e. the blue ROC curve is everywhere higher than the yellow one). What if we try to reduce the accuracy for the blue population so that this more nearly matches? One way to do this is to add noise to the credit score for the blue population (figure 5). As we add increasing amounts of noise the blue ROC curve moves towards the positive diagonal and at some point will cross the yellow ROC curve. Now equality of odds can be achieved.
Unfortunately, this approach has two unattractive features. First, we now make worse decisions for the blue population; it is a general feature of most remedial approaches that there is a trade off between accuracy and fairness (Kamiran & Calders 2012; CorbettDavies et al. 2017). Second, adding noise violates individual fairness. Two identical members of the blue population may have different noise values added to the scores, resulting in different decisions on their loans.
The conclusion of the worked loan example is that it is very hard to remove bias once the classifier has already been trained, even for very simple cases. For further information, the reader is invited to consult Kamiran & Calders (2012), Hardt et al. (2016), Menon & Williamson (2017) and Pleiss et al. (2017).
PostProcessing  InProcessing  PreProcessing  Data Collection 

• Change thresholds • Trade off accuracy for fairness 
• Adversarial training • Regularize for fairness • Constrain to be fair 
• Modify labels • Modify input data • Modify label/data pairs • Weight label/data pairs 
• Identify lack of examples or variates and collect 
Thankfully, there are approaches to deal with bias at all stages of the data collection, preprocessing, and training pipeline (figure 6). In this section we consider some of these methods. In the ensuing discussion, we'll assume that the true behaviour of the different populations is the same. Hence, we are interested in making sure that the predictions of our system do not differ for each population.
A straightforward approach to eliminating bias from datasets would be to remove the protected attribute and other elements of the data that are suspected to contain related information. Unfortunately, such suppression rarely suffices. There are often subtle correlations in the data that mean that the protected attribute can be reconstructed. For example, we might remove race, but retain information about the subject's address, which could be strongly correlated with the race.
The degree to which there are dependencies between the data $\mathbf{x}$ and the protected attribute $p$ can be measured using the mutual information
\begin{equation}
\mbox{LP} = \sum_{\mathbf{x},p} Pr(\mathbf{x},p) \log\left[\frac{Pr(\mathbf{x},p)}{Pr(\mathbf{x})Pr(p)}\right], \tag{2.6}
\end{equation}
which is known as the latent prejudice (Kamishima et al. 2011). As this measure increases, the protected attribute becomes more predictable from the data. Indeed, Feldman et al. (2015) and Menon & Williamson (2017) have shown that the predictability of the protected attribute puts mathematical bounds on the potential discrimination of a classifier.
We'll now discuss four approaches for removing bias by manipulating the dataset. Respectively, these modify the labels $y$, the observed data $\mathbf{x}$, the data/label pairs $\{\mathbf{x},y\}$, and the weighting of these pairs.
Kamiran & Calders (2012) proposed changing some of the training labels which they term massaging the data. They compute a classifier on the original dataset and find examples close to the decision surface. They then swap the labels in such a way that a positive outcome for the disadvantaged group is more likely and retrain. This is a heuristic approach that empirically improves fairness at the cost of accuracy.
Feldman et al. (2015) proposed manipulating individual data dimensions $x$ in a way that depends on the protected attribute $p$. They align the cumulative distributions $F_{0}[x]$ and $F_{1}[x]$ for feature $x$ when the protected attribute $p$ is 0 and 1 respectively to a median cumulative distribution $F_{m}[x]$. This is similar to standardising test scores across different high schools (figure 7) and is termed disparate impact removal. This approach has the disadvantage that it treats each input variable $x\in\mathbf{x}$ separately and ignores their interactions.
Calmon et al. (2017) learn a randomized transformation $Pr(\mathbf{x}^{\prime}, y^{\prime}\mathbf{x},y,p)$ that transforms data pairs $\{\mathbf{x}, y\}$ to new data values $\{\mathbf{x}^{\prime}, y^{\prime}\}$ in a way that depends explicitly on the protected attribute $p$. They formulate this as an optimization problem in which they minimize the change in data utility, subject to limits on the prejudice and distortion of the original values. They show that this optimization problem may be convex in certain conditions.
Unlike disparate impact removal, this takes into account interactions between all of the data dimensions. However, the randomized transformation is formulated as a probability table, so this is only suitable for datasets with small numbers of discrete input and output variables. The randomized transformation, which must also be applied to test data, also violates individual fairness.
Kamiran & Calders (2012) propose to reweight the $\{\mathbf{x}, \mathbf{y}\}$ tuples in the training dataset so that cases where the protected attribute $p$ predicts that the disadvantaged group will get a positive outcome are more highly weighted. They then train a classifier that makes use of these weights in its cost function. Alternately, they propose resampling the training data according to these weights and using a standard classifier.
In the previous section, we introduced the latent prejudice measure based on the mutual information between the data $\mathbf{x}$ and the protected attribute $p$. Similarly, we can measure the dependence between the labels $y$ and the protected attribute $p$:
\begin{equation}
\mbox{IP} = \sum_{y,p} Pr(y,p) \log\left[\frac{Pr(y,p)}{Pr(y)Pr(p)}\right]. \tag{2.7}
\end{equation}
This is known as the indirect prejudice (Kamishima et al. 2011). Intuitively, if there is no way to predict the labels from the protected attribute and viceversa then there is no scope for bias.
One approach to removing bias during training is to explicitly remove this dependency using adversarial learning. Other approaches to removing bias include penalizing the mutual information using regularization, fitting the model under the constraint that it is not biased. We'll briefly discuss each in turn.
Adversarialdebiasing (Beutel et al. 2017; Zhang et al. 2018) reduces evidence of protected attributes in predictions by trying to simultaneously fool a second classifier that tries to guess the protected attribute $p$. Beutel et al. (2017) force both classifiers to use a shared representation and so minimizing the performance of the adversarial classifier means removing all information about the protected attribute from this representation (figure 8).
Zhang et al. (2018) use the adversarial component to predict $p$ from (i) the final classification logits $f[\mathbf{x}]$ (to ensure demographic parity), (ii) the classification logits $f[\mathbf{x}]$ and the true class $y$ (to ensure equality of odds), or (iii) the final classification logits and the true result for just one class (to ensure equality of opportunity).
Kamishima et al. (2011) proposed adding an extra regularization condition to the output of logistic regression classifier that tried to minimize the mutual information between the protected attribute and the prediction $\hat{y}$. They first rearranged the indirect prejudice expression using the definition of conditional probability to get
\begin{eqnarray}
\mbox{PI} &=& \sum_{y,p} Pr(y\mathbf{x},p) \log\left[\frac{Pr(y,p)}{Pr(y)Pr(p)}\right]\nonumber\\
&=& \sum_{y,p} Pr(y\mathbf{x},p) \log\left[\frac{Pr(yp)}{Pr(y)}\right]. \tag{2.8}
\end{eqnarray}
Then, they formulate a regularization loss based on the expectation of this over the data set:
\begin{equation}
\mbox{L}_{reg} = \sum_{i}\sum_{\hat{y},p} Pr(\hat{y}_{i}\mathbf{x}_{i},p_{i})\log\left[\frac{Pr(\hat{y}_{i}p_{i})}{Pr(\hat{y}_{i})}\right] \tag{2.9}
\end{equation}
where $i$ indexes the data examples, which they add to the main training loss.
Zafar et al. (2015) formulated unfairness in terms of the covariance between the protected attribute $\{p_{i}\}_{i=1}^{I}$ and the signed distances $\{d[\mathbf{x}_{i},\boldsymbol\theta]\}_{i=1}^{I}$ of the associated feature vectors $\{\mathbf{x}_{i}\}_{i=1}^{I}$ from the decision boundary, where $\boldsymbol\theta$ denotes the model parameters. Let $\overline{p}$ represent the mean value of the protected attribute. They then minimize the main loss function such that the covariance remains within some threshold $t$.
\begin{equation}
\begin{aligned}
& \underset{\boldsymbol\theta}{\text{minimize}}
& & L[\boldsymbol\theta] \\
& \text{subject to}
& & \frac{1}{I}\sum_{i=1}^{I}(p_{i}\overline{p})d[\mathbf{x}_{i},\boldsymbol\theta] \leq t\\
& & & \frac{1}{I}\sum_{i=1}^{I}(p_{i}\overline{p})d[\mathbf{x}_{i},\boldsymbol\theta] \geq t
\end{aligned} \tag{2.10}
\end{equation}
This constrained optimization problem can also be written as a regularized optimization problem in which the fairness constraints are moved to the objective and the corresponding Lagrange multipliers act as regularizers. Zafar et al. (2015) also introduced a second formulation where they maximize fairness under accuracy constraints.
Zemel et al. (2013) presented a method that maps data to an intermediate space in a way that depends on the protected attribute and obfuscates information about that attribute. Since this mapping is learnt during training, this method could considered either a preprocessing approach or an inprocessing algorithm.
Chen et al. (2018) argue that a trade off between fairness and accuracy may not be acceptable and that these challenges should be addressed through data collection. They aim to diagnose unfairness induced by inadequate data and unmeasured predictive variables and prescribe data collection approaches to remedy these problems.
In this tutorial, we've discussed what it means for a classifier to be fair, how to quantify the degree of bias in a dataset and methods to remedy unfairness at all stages in the pipeline. An empirical analysis of fairnessbased interventions is presented in Friedler et al. (2019). There are a large number of toolkits available to help evaluate fairness, the most comprehensive of which is AI Fairness 360.
This tutorial has been limited to a discussion of supervised learning algorithms, but there is also an orthogonal literature on bias in NLP embeddings (e.g. Zhao et al. 2019).
]]>The online course includes a fourweeklong introduction to machine learning and deep learning, featuring lectures from Mila professors Yoshua Bengio, Laurent Charlin, Audrey Durand and Aaron Courville. Korbi, an NLP based technology that has spun out of research at Mila, will use AI to personalize the learning experience by interacting with students live and guiding them through the course material. The course will be offered online for free to anyone interested in learning the introductory concepts of machine learning and deep learning.
“Borealis AI is excited to be collaborating with the Korbit team on this project. Providing AI training at scale is imperative for our communities, and we are proud to be supporting a Canadian startup company with such strong machine learning expertise” says Dr Simon Prince, Research Director at Borealis AI Montreal.
The program is offered in an effort to democratize AI education by a worldclass research centre in Québec. AI experts are of high demand in recent years and Canada has amongst the world’s brightest minds in the space. Mila and Korbit aim to reduce the AI talent gap in the industry through a platform that can be accessed and used to educate engineers and researchers worldwide.
To date, over 1,600 students around the world have signed up for Korbit’s machine learning course since May 2019. Preliminary results are promising, showing an increase in student engagement as well as an increase in positive learning outcomes after interacting with the AI tutor.
Click here to enrol in the course or to learn more about AI tutoring. See you in class!
]]>XLNet is the latest and greatest model to emerge from the booming field of Natural Language Processing (NLP). The XLNet paper combines recent advances in NLP with innovative choices in how the language modelling problem is approached. When trained on a very large NLP corpus, the model achieves stateoftheart performance for the standard NLP tasks that comprise the GLUE benchmark.
XLNet is an autoregressive language model which outputs the joint probability of a sequence of tokens based on the transformer architecture with recurrence. Its training objective calculates the probability of a word token conditioned on all permutations of word tokens in a sentence, as opposed to just those to the left or just those to the right of the target token.
If the above description made perfect sense, then this post is not for you. If it didn't, then read on to find out how XLNet works, and why it is the new standard for many NLP tasks.
In language modelling we calculate the joint probability distribution for sequences of tokens (words), and this is often achieved by factorizing the joint distribution into conditional distributions of one token given other tokens in the sequence. For example, given the sequence of tokens New, York, is, a, city the language model could be asked to calculate the probability $Pr$(New  is, a, city). This is the probability that the token New is in the sequence given that is, a, and city are also in the sequence (figure 1).
For the purpose of this discussion, consider that generally a language model takes a text sequence of $T$ tokens, $\mathbf{x} = [x_1, x_2,\ldots, x_T]$, and computes the probability of some tokens $\mathbf{x}^{\prime}$ being present in the sequence, given some others $\mathbf{x}^{\prime\prime}$ in the sequence: $Pr(\mathbf{x}^{\prime}  \mathbf{x}^{\prime\prime})$ where $\mathbf{x}^{\prime}$ and $\mathbf{x}^{\prime\prime}$ are nonoverlapping subsets of $\mathbf{x}$.
Why would anyone want a model which can calculate the probability that a word is in a sequence? Actually, noone really cares about that^{1}. However, a model that contains enough information to predict what comes next in a sentence can be applied to other more useful tasks; for example, it might be used to determine who is mentioned in the text, what action is being taken, or if the text has a positive or negative sentiment. Hence, models are pretrained with the language modeling objective and subsequently finetuned to solve more practical tasks.
Let's discuss the architectural foundation of XLNet. The first component of a language model is a wordembedding matrix: a fixedlength vector is assigned for each token in the vocabulary and so the sequence is converted to a set of vectors.
Next we need to relate the embedded tokens in a sequence. A longtime favorite for this task has been the LSTM architecture which relates adjacent tokens (e.g. the ELMo model), but recent state of the art results have been achieved with transformers (e.g. the BERT model^{2}). The transformer architecture allows nonadjacent tokens in the sequence to be combined to generate higherlevel information using an attention mechanism. This helps the model learn from the longdistance relations that exist in text more easily than LSTM based approaches.
Transformers have a drawback: they operate on fixedlength sequences. What if knowing that New should occur in the sentence ____ York is a city also requires that the model have read something about the Empire State building in a previous sentence? TransformerXL resolves this issue by allowing the current sequence to see information from the previous sequences. It is this architecture that XLNet is based on.
XLNet's main contribution is not the architecture^{3}, but a modified language model training objective which learns conditional distributions for all permutations of tokens in a sequence. Before diving into the details of that objective, let's revisit the BERT model to motivate XLNet's choices.
The previous state of the art (BERT) used a training objective that was tasked with recovering words in a sentence which have been masked. For a given sentence, some tokens are replaced with a generic [mask] token, and the model is asked to recover the originals.
The XLNet paper argues that this isn't a great way to train the model. Let's leave the details of this argument to the paper and instead present a less precise argument that captures some of the important concepts.
A language model should encode as much information and nuances from text as possible. The BERT model tries to recover the masked words in the sentence The [mask] was beached on the riverside (figure 2). Words such as boat or canoe are likely here. BERT can know this because a boat can be beached, and is often found on a riverside. But BERT doesn't necessarily need to learn that a boat can be beached, since it can still use riverside as a crutch to infer that boat is the masked token.
Moreover, BERT predicts the masked tokens independently, so it doesn't learn how they influence oneanother. If the example was The [mask] was [mask] on the riverside, then BERT might correclty assign high probabilities to (boat, beached) and (parade, seen) but might also think (parade, beached) is acceptable.
Approaches such as BERT and ELMo improved on the state of the art by incorporating both left and right contexts into predictions. XLNet took this a step further: the model's contribution is to predict each word in a sequence using any combination of other words in that sequence. XLNet might be asked to calculate what word is likely to follow The. Lots of words are likely, but certainly boat is more likely than they, so it's already learned something about a boat (mainly that it's not a pronoun). Next it might be asked to calculate which is a likely 2^{nd }word given [3]was, [4]beached. And, then it might be asked to calculate which is a likely 4^{th} word given: [3]was, [5]on, [7]riverside.
In this way XLNet doesn't really have a crutch to lean on. It's being presented difficult, and at times ambiguous contexts from which to infer whether or not a word is in a sentence. This is what allows it to squeeze more information out of the training corpus (figure 3).
In practice, XLNet samples from all possible permutations, so it doesn't get to see every single relation. It also doesn't use very small contexts as they are found to hinder training. After applying these practical heuristics, it bears more of a resemblance to BERT.
In the next few sections we'll expand on the more challenging aspects of the paper.
Given a sequence $\mathbf{x}$, an autoregressive (AR) model is one which calculates the probability $Pr(x_i  x_{<i})$. In language modelling, this is the probability of a token $x_{i}$ in the sentence, conditioned on the tokens $x_{<i}$ preceding it. These conditioning words are referred to as the context. Such a model is asymmetric and isn't learning from all token relations in the corpus.
Autoregressive models such as ELMo allow a model to also learn from relations between a token and those following it. The AR objective in this case could be seen as $Pr(x_i) = Pr(x_i  x_{>i})$. It is autoregressive in the reversed sequence. But why stop there? There could be interesting relations to learn from if we look at just the two nearest tokens: $Pr(x_i) = Pr(x_i  x_{i1}, x_{i+1})$ or really any combination of tokens $Pr(x_i) = Pr(x_i  x_{i1}, x_{i+2}, x_{i3})$.
XLNet proposes to use an objective which is an expectation over all such permutations. Consider a sequence x = [This, is, a, sentence] with T=4 tokens. Now consider the set of all 4! permutations $\mathcal{Z}$ = {[1, 2, 3, 4], [1, 2, 4, 3],. . ., [4, 3, 2, 1]}. The XLNet model is autoregressive over all such permutations; it can calculate the probability of token $x_i$ given preceding tokens $x_{<i}$ from any order $\mathbf{z}$ from $\mathcal{Z}$.
For example, it can calculate the probability of the 3^{rd} element given the two preceding ones from any permutation. The three permutations [1, 2, 3, 4], [1, 2, 4, 3] and [4, 3, 2, 1] above would correspond to $Pr$(a,  This, is), $Pr$(sentence  This, is) and $Pr$(is  sentence, a). Similarly, the probability of the second element given the first would be $Pr$(is  This), $Pr$(is  This) and $Pr$(a  sentence). Considering all four positions and all 4! permutations the model takes into consideration all possible dependencies.
These ideas are embodied in equation 3 from the paper:
\begin{equation*}
\hat{\boldsymbol\theta} = \mathop{\rm argmax}_{\boldsymbol\theta}\left[\mathbb{E}_{\mathbf{z}\sim\mathcal{Z}}\left[\sum_{t=1}^{T} \log \left[Pr(x_{z[t]}x_{z[<t]}) \right] \right]\right]
\end{equation*}
This criterion finds model parameters $\boldsymbol\theta$ to maximize the probability of tokens $x_{z[t]}$ in a sequence of length $T$ given preceding tokens $x_{z[<t]}$, where $z[t]$ is the t$^{th}$ element of a permutation $\mathbf{z}$ of the token indices and $z[<t]$ are the previous elements in the permutation. The sum of log probabilities means that for any one permutation the model is properly autoregressive as it is the product of the probability for each element in the sequence. The expectation over all the permutations in $\mathcal{Z}$ shows the model is trained to be equally capable of computing probabilities for any token given any context.
There is something missing from the way the model has been presented so far: how does the model know about word order? The model can compute $Pr$(This  is) as well as $Pr$(This  a). Ideally it should know something about the relative position of This and is and also of a. Otherwise it would just think all tokens in the sequence are equally likely to be next to oneanother. What we want is a model which predicts $Pr$(This  is, 2) and $Pr$(This  a, 3). In other words, it should know the indices of the context tokens.
The transformer architecture addresses this problem by adding positional information to token embeddings. You can think of the training objective terms as $Pr$(This  is+2). But if we really shuffled the sentence tokens, this mechanism would break. This problem is resolved by using an attention mask. When the model computes the context which is the input to the probability calculation, it always does so using the same token order, and simply masks those not in the context under consideration (i.e. those that come subsequently in the shuffled order).
As a concrete example, consider the permutation [3, 2, 4, 1]. When calculating the probability of the 1^{st} element in that order (i.e., token 3), the model has no context as the other tokens have not yet been seen. So the mask would be [0, 0, 0, 0]. For the 2^{nd} element (token 2), the mask is [0, 0, 1, 0] as its only context is token 3. Following that logic, the 3^{rd}and 4^{th} elements (tokens 4 and 1) have masks [0, 1, 1, 0] and [0, 1, 1, 1]. Stacking all those in the token order gives the matrix (as seen in fig. 2(c) in the paper):
\begin{equation}
\begin{bmatrix}
0& 1& 1& 1 \\
0& 0& 1& 0\\
0& 0& 0& 0 \\
0& 1& 1& 0
\end{bmatrix}
\label{eqn:matrixmask}
\end{equation}
Another way to look at this is that the training objective will contain the following terms where underscores represent what has been masked:
$Pr$(This  ___, is+2, a+3, sentence+4)
$Pr$(is  ___, ___, a+3, ___)
$Pr$(a  ___, ___, ___, ___)
$Pr$(sentence  ___, is+2, a+3, ___)
There remains one oversight to address: we not only want the probability to be conditioned on the context token indices, but also the index of the token whose probability is being calculated. In other words we want $Pr$(This  1, is+2): the probability of This given that it is the 1^{st} token and that is is the 2^{nd} token. But the transformer architecture encodes the positional information 1 and 2 within the embedding for This and is. So this would look like $Pr$(This  This+1, is+2). Unfortunately, the model now trivially knows that This is part of the sentence and should be likely.
The solution to this problem is a twostream selfattention mechanism. Each token position $i$, has two associated vectors at each selfattention layer $m$: $\mathbf{h}_i^m$ and $\mathbf{g}_i^m$. The $\mathbf{h}$ vectors belong to the content stream, while the $\mathbf{g}$ vectors belong to the query stream. The content stream vectors are initialized with token embeddings added to positional embeddings. The query stream vectors are initialized with a generic embedding vector $\mathbf{w}$ added to positional embeddings. Note that $\mathbf{w}$ is the same no matter the token, and thus cannot be used to distinguish between tokens.
At each layer, each content vector, $\mathbf{h}_i$, is updated using those $\mathbf{h}$'s that remain unmasked and itself (equivalent to unmasking the diagonal from the matrix shown in the previous section). Thus, $\mathbf{h}_3$ is updated with the mask $[0, 0, 1, 0]$, while $\mathbf{h}_2$ is updated with the mask $[0, 1, 1, 0]$. The update uses the content vectors as the query, key and value.
By contrast, at each layer each query vector $\mathbf{g}_{i}$ is updated using the unmasked content vectors and itself. The update uses $\mathbf{g}_i$ as the query while it uses $\mathbf{h}_j$'s as the keys and values, where $j$ is the index of an unmasked token in the context of $i$.
Figure 4 illustrates how the the query $\mathbf{g}_4^m$ for the 4^{th} token at the $m$^{th} layer of selfattention is calculated. It shows that $\mathbf{g}_4^m$ is an aggregation of is+2, a+3 and the position 4, which is precisely the context needed to calculate the probability of the token sentence.
To followalong from the last section, the training objective contains the following terms where $*$ denotes that this is the token position whose probability is being computed:
$Pr$(This  *, is+2, a+3, sentence+4)
$Pr$(is  ___, *, a+3, ___)
$Pr$(a  ___, ___, *, ___)
$Pr$(sentence  ___, is+2, a+3, *).
Does it work? The short answer is yes. The long answer is also yes. Perhaps this is not surprising: XLNet builds on previous state of the art methods. It was trained on a corpus of 30 billion words (an order of magnitude greater than that used to train BERT and drawn from more diverse sources) and this training required significantly more hours of compute time than previous models:
ULMFit  1 GPU day 
ELMo  40 GPU days 
BERT  450 GPU days 
XLNet  2000 GPU days 
Table 1. Approximate computation time for training recent NLP models^{4}.
Perhaps more interestingly, XLNet's ablation study shows that it also works better than BERT in a fair comparison (figure 5). That is, when the model is trained on the same corpus as was BERT, using the same hyperparameters and the same number of layers, it consistently outperforms BERT. Even more interestingly, XLNet also beats TransformerXL in the fair comparison. TransformerXL could be considered as an ablation of the permutation AR objective. The consistent improvement over that score is evidence for the strength of that method.
What is not resolved by the ablation study is the contribution of the twostream selfattention mechanism to XLNet’s performance gains. It both allows the attention mechanism to explicitly take the target token position into account, and it introduces additional latent capacity in the form of the query stream vectors. While it is an intricate part of the XLNet architecture, it is possible that models such as BERT could also benefit from this mechanism without using the same training objective as XLNet.
^{1 }While the main purpose of pretrained language models is to learn linguistic features which are useful in downstream tasks, the actual language model's calculation of word probabilities can be useful for things like outlier detection and autocorrect.
^{2 }The BERT model is technically a masked language model as it isn't trained to maximize the joint probability of a sequence of tokens.
^{3 }In order to implement the NADElike training objective, the XLNet paper also introduces some novel architecture choices which are discussed in later sections. However, for the purpose of this post, it is convenient to first discuss XLNet’s goal to create a model which learns from bidirectional context, and then introduce the architectural work needed to achieve this goal.
^{4 }These values were derived using a speculative TPU to GPU hour conversion as explained in this post and rounded semiarbitrarily.
]]>He earned his PhD in psychology with a focus on human stereo vision. He’d later go on to study visual neurons in the mammalian brain. But, as he says, “somewhere along the way I realized I was an engineer and not a scientist. I didn’t want to understand how biological vision worked, I wanted to build my own version.” To that end, computer vision became an obvious direction and University College of London his academic stomping grounds.
We are fortunate to welcome Simon as our newest research director. Simon joins Greg Mori (Vancouver) and Marcus Brubaker (Toronto) and will head up our Montreal lab, leading a group of strong researchers in one of Canada’s most dynamic AI ecosystems.
He will round out his team with the addition of Dr. Layla El Asri, a former research manager at Microsoft Research Lab in Montreal, and a wellrespected thought leader on multiple AI topics. Along with academic advisor, Prof. Jackie Cheung, the lab will continue to focus its output on natural language processing (NLP), with its scope aimed at continuing to improve the ways clients interact with the bank. The next few months will see a swift rampup of a number of teams, each focused on building a different NLPbased product.
Simon’s longterm plan is to make Borealis AI Montreal “the most intellectually stimulating and fun place to do NLP research in the city,” which is a tall order in a worldclass AI city. But his experience so far bodes well for his ambitions.
“People have been incredibly friendly and welcoming. I think there is a general acceptance that bringing more AI businesses to Montreal is overwhelmingly a good thing, even if it puts shortterm pressure on the hiring market. I’m personally impressed by the quantity and scope of AI activity in the MileEx neighborhood in particular: Borealis AI is at groundzero for AI in Montreal.”
Of course, our awardwinning office design didn’t hurt. He’s grateful to walk each morning into a bright, open, fun space after many years of “working in basement labs in universities and only rarely seeing sunlight.” And the city is its own draw for the British expat: “it combines the best of North America and the best of Europe and it’s an amazing place to live if, like me, you love food, cycling and winter sports.”
We’ve now come to the part of the blog where Simon, in his selfeffacing manner would say, “enough about me.” He’s adamant that the space he cocreates with his colleagues serves the broader community and surpasses the needs of any top researcher looking for a home in a challenging, fastmoving, resourcerich, and stimulating environment with the potential for massive global impact.
In his own words, he says anyone joining Borealis AI can expect to have “excellent academic colleagues and a minimum of interfering middlemanagement. They will work on challenging longterm goals without being forced to constantly switch between projects. They will have academic freedom to publish and an excellent supportive environment in which they can grow their skills.”
Challenge accepted.
]]>To address these issues, we propose Metatrace, a metagradient descent based algorithm to tune the stepsize online. Metatrace leverages the structure of eligibility traces, and works for both tuning a scalar stepsize and a respective stepsize for each parameter. We empirically evaluate Metatrace for actorcritic on the Arcade Learning Environment. Results show Metatrace can speed up learning, and improve performance in nonstationary settings.
]]>Given a fixed history of events and their corresponding times – like those shown below in Fig. 1 – multiple actions are possible in the future. In our CVPR paper of the same name, which we will be presenting this week in Long Beach, we propose a powerful generative approach that can effectively model the distribution over future actions.
To date, much of the work in this domain has focused on taking framelevel data of video as input in order to predict the actions or activities that may occur in the immediate future. Timeseries data often involves regularly spaced data points with interesting events occurring sparsely across time. We hypothesize that in order to model future events in such a scenario, it is beneficial to consider the history of sparse events (action categories and their temporal occurrence in the above example) alone, instead of regularly spaced frame data. This approach also allows us to model highlevel semantic meaning in the timeseries data that can be difficult to discern from framelevel data.
More specifically, we are interested in modeling the distribution over future action category and action timing given the past history of sparse events. For action timing, we aim to model the distribution over interarrival time. The interarrival time is the time difference between the starting time of two consecutive actions.
The contributions of this work center around the APPVAE (Action Point Process VAE), a novel generative model for asynchronous time action sequences. Fig. 2 shows the overall structure of our proposed framework. We formulated our model with the variational autoencoder (VAE) paradigm, a powerful class of probabilistic models that facilitate generation and the ability to model complex distributions. We present a novel form of VAE for action sequences under a point process approach. This approach has a number of advantages, including a probabilistic treatment of action sequences to allow for likelihood evaluation, generation, and anomaly detection.
Fig. 3 shows the architecture of our model. Overall, the input sequence of action categories and interarrival times are encoded using a recurrent VAE model. At each step, the model uses the history of actions to produce a distribution over latent codes $zn$, a sample of which is then decoded into two probability distributions: one over the possible action categories and another over the interarrival time for the next action.
Since the true distribution over latent variables $z_n$ is intractable, we rely on a timedependent posterior network $q_\phi(z_{n}x_{1:n})$ that approximates it with a conditional Gaussian distribution $N(\mu_{\phi_n}, \sigma^2_{\phi_n})$.
To prevent $z_n$ from just copying $x_n$, we force $q_\phi(z_nx_{1:n})$ to be close to the prior distribution $p(z_n)$ using a KLdivergence term. Here, in order to consider the history of past actions in generation phase, we learn a prior that varies across time and is a function of past actions, except the current action $p_\psi(z_nx_{1:n1})$.
The sequence model generates two probability distributions: i) a categorical distribution over the action categories; and ii) a temporal point process distribution over the interarrival times for the next action.
The distribution over action categories $a_n$ is modeled with a multinomial distribution when $a_n$ can only take a finite number of values: \begin{equation}
p^a_\theta(a_n=kz_n) = p_k(z_n) \quad \text{and} \,\,\,\,
\sum_{k=1}^K{p_k(z_n)} =1 \label{eq:action}
\end{equation} where $p_k(z_n)$ is the probability of occurrence of action $k$, and $K$ is the total number of action categories.
The interarrival time $\tau_n$ is assumed to follow an exponential distribution parameterized by $\lambda(z_n)$, similar to a standard temporal point process model:
\begin{equation}
\begin{aligned}
p^{\tau}_{\theta}(\tau_n  z_n) =
\begin{cases}
\lambda(z_n) e^{{\lambda(z_n)}\tau_n} & \text{if}~~ \tau_n \geq 0 \\
0 & \text{if}~~ \tau_n<0
\end{cases}
\end{aligned} \label{eq:time}
\end{equation}
where $p^{\tau}_{\theta}(\tau_nz_n)$ is a probability density function over random variable $\tau_n$ and $\lambda(z_n)$ is the intensity of the process, which depends on the latent variable sample $z_n$. We train the model by optimizing the variational lower bound over the entire sequence comprised of $N$ steps:
\begin{align}
\mathcal{L}_{\theta,\phi}(x_{1:N}) = \sum_{n=1}^N(&{\mathop{\mathbb{E}}}_{q_\phi(z_{n}x_{1:n})}[\log p_\theta{(x_nz_{n})}] \\
& D_{KL}(q_\phi(z_nx_{1:n})p_\psi(z_nx_{1:n1})))
\nonumber
\label{eq:loss}
\end{align}
We empirically validate the efficacy of APPVAE for modeling action sequences on the MultiTHUMOS and Breakfast datasets. Experiments show that our model is effective in capturing the uncertainty inherent in tasks such as action prediction and anomaly detection.
Fig. 4 shows examples of diverse future action sequences that are generated by APPVAE given the history. For different provided histories, sampled sequences of actions are shown. We note that the overall duration and sequence of actions on the Breakfast Dataset are reasonable. Variations, e.g. taking the juice squeezer before using it, adding salt and pepper before cooking eggs, are plausible alternatives generated by our model.
Fig. 5 visualizes a traversal on one of the latent codes for three different sequences by uniformly sampling one z dimension over µ − 5σ, µ + 5σ while fixing others to their sampled values. As shown, this dimension correlates closely with the action add saltnpepper, strifry egg and fry egg.
We further qualitatively examine the ability of the model to score the likelihood of individual test samples. We sort the test action sequences according to the average per timestep likelihood estimated by drawing samples from the approximate posterior distribution following the importance sampling approach. High scoring sequences should be those that our model deems as “normal” while low scoring sequences those that are unusual. Tab. 1 shows some example of sequences with low and high likelihood on the MultiTHUMOS dataset. We note that a regular, structured sequence of actions such as jump, body roll, cliff diving for a diving action or body contract, squat, clean and jerk for a weightlifting action receives high likelihood. However, repeated hammer throws or golf swings with no set up actions receives a low likelihood.
Table 1 (below): Example of test sequences with high and low likelihood according to our learned model:
Test sequences with high likelihood 



Test sequences with low likelihood 



We presented a novel probabilistic model for point process data – a variational autoencoder that captures uncertainty in action times and category labels. As a generative model, it can produce action sequences by sampling from a prior distribution, the parameters of which are updated based on neural networks that control the distributions over the next action type and its temporal occurrence. Our model can be used to analyze and model asynchronous data in a wide variety of ranges like social networks, earthquakes events, health informatics, and the list goes on.
]]>In this work, we propose a local discriminative neural model with a much smaller negative sampling space that can efficiently learn against incorrect orderings. The proposed coherence model is simple in structure, yet it significantly outperforms previous stateoftheart methods on a standard benchmark dataset on the Wall Street Journal corpus, as well as in multiple new challenging settings of transfer to unseen categories of discourse on Wikipedia articles.
Code and dataset here.
]]>The answer boils down to trust.
Trust in a machine, or an algorithm, is difficult to quantify. It’s more than just performance — most people will not be convinced by being told research cars have driven X miles with Y crashes. You may care about when negative events happen. Were they all in snowy conditions? Did they occur at night? How robust is the system, overall?
In machine learning, we typically have a metric to optimize. This could mean we minimize the time to travel between points, maximize the accuracy of a classifier, or maximize the return on an investment. However, trust is much more subjective, domain dependent, and user dependent. We don’t know how to write down a formula for trust, much less how to optimize it.
This post argues that intelligibility is a key component to trust.^{1} The deep learning explosion has brought us many highperforming algorithms that can tackle complex tasks at superhuman levels (e.g., playing the games of Go and Dota 2, or optimizing data centers). However, a common complaint is that such methods are inscrutable “black boxes.”
If we cannot understand exactly how a trained algorithm works, it is difficult to judge its robustness. For example, one group of researchers trained a deep neural network to detect pneumonia from Xrays. The data was collected from both inpatient wards and an emergency department, which had very different rates of the disease. Upon analysis, the researchers realized that the Xray machines added different information to the Xrays — the network was focusing on the word “portable,” which was present only in the emergency department Xrays, rather than medical characteristics of the picture itself. This example highlights how understanding a model can identify problems that would potentially be hidden if one only focuses on the accuracy of the model.
Another reason to focus on intelligibility is in cases where we have properties we want to verify but cannot easily add to the loss function, i.e., the objective we wish to optimize. One may want to respect user preferences, avoid biases, and preserve privacy. An inscrutable algorithm may be difficult to verify, whereas an intelligible algorithm’s output would not be so. For example, the black box algorithm COMPAS is being used for assessing the risk of recidivism and has been accused of being racially biased by an influential ProPublica article. In Cynthia Rudin’s article "Please Stop Explaining Black Box Models for HighStakes Decisions", she argues that her model (CORELS) achieves the same accuracy as COMPAS, but is fully understandable, as it consists of only 3 if/then rules, and it does not take race (or variables correlated with race) into account.
if (age = 18 − 20) and (sex = male) then predict yes 
Rule list to predict 2year recidivism rate found by CORELS.
As mentioned above, intelligibility is contextdependent. But in general, we want to have some construct that can be understood when considering a user’s limited memory (i.e., people have the ability to hold 7 +/ 2 concepts in mind at once). There are three different ways we can think about intelligibility, which I enumerate next.
A local explanation of a model focuses on a particular region of operation. Continuing our autonomous car example, we could consider a local explanation to be one that explains how a car made a decision in one particular instance. The reasoning for a local explanation may or may not hold in other circumstances. A global explanation, in contrast, has to consider the entire model at once and thus is likely more complicated.
A more technically inclined user or model builder may have different requirements. First, they may think about the properties of the algorithm used. Is it guaranteed to converge? Will it find a nearoptimal solution? Do the hyperparameters of the algorithm make sense? Second, they may think about whether all the inputs (features) to the algorithm seem to be useful and are understandable. Third, is the algorithm “simulatable” (where a person can calculate the outputs from inputs) in a reasonable amount of time?
If a solution is intelligibile, a user should be able to generate explanations about how the algorithm works. For instance, there should be a story about how the algorithm gets to a given output or behavior from its inputs. If the algorithm makes a mistake, we should be able to understand what went wrong. Given a particular output, how would the input have to change in order to get a different output?
There are four highlevel ways of achieving intelligibility. First, the user can passively observe input/output sequences and formulate their own understanding of the algorithm. Second, a set of posthoc explanations can be provided to the user that aim to summarize how the system works. Third, the algorithm could be designed with fewer blackbox components so that explanations are easier to generate and/or are more accurate. Fourth, the model could be inherently understandable.
Observing the algorithm act may seem to be too simplistic. Where this becomes interesting is when you consider what input/output sequences should be shown to a user. The HIGHLIGHTS algorithm focuses on reinforcement learning settings and works to find interesting examples. For instance, the authors argue that in order to trust an autonomous car, one wouldn’t want to see lots of examples of driving on a highway in light traffic. Instead, it would be better to see a variety of informative examples, such as driving through an intersection, driving at night, driving in heavy traffic, etc. At the core of the HIGHLIGHTS method is the idea of state importance, or the difference between the value of the best and worst actions in the world at a given moment in time:
\begin{equation}
I(s) = \max_{a} Q^\pi_{(s,a)}  \min_a Q^\pi_{(s,a)}
\end{equation}
In particular, HIGHLIGHTS generates a summary of trajectories that capture important states an agent encountered. To test the quality of the generated examples, a user study was performed where people watched summaries of two agents playing Ms. Pacman and were asked to identify the better agent.
This animation shows the output of the HIGHLIGHTS algorithm in the Ms. Pacman domain https://goo.gl/79dqsd
Once the model learns to perform a task, a second model could be trained to then explain the task. The motivation is that maybe a simpler model can represent most of the true model, while being much more understandable. Explanations could be natural language, visualizations (e.g., saliency maps or tSNE), rules, or other humanunderstandable systems. The underlying assumption is that there’s a fidelity/complexity tradeoff: these explanations can help the user understand the model at some level, even if it is not completely faithful to the model.
For example, the LIME algorithm works on supervised learning methods, where it generates a more interpretable model that is locally faithful to a classifier. The optimization problem is set up so that it minimizes the difference between the interpretable model g from the actual function f in some locality πx, while also minimizing the measure of the complexity of the model g:
\begin{equation}
\xi(x) = \mbox{argmin}_{g \in G} ~~\mathcal{L} (f, g, \pi_x) + \Omega(g)
\end{equation}
The paper also introduces SPLIME, an algorithm to select a set of representative instances by exploiting the submodularity principle to greedily add nonoverlapping examples that cover the input space while giving examples of different, relevant, outputs.
A novel approach to automated rationale generation for reinforcement learning agents is presented by Ehsan et al. Many people are asked to play the game of Frogger. Then, while they’re playing the game, they provide explanations as to why they executed an action in a given state. This large corpus of states/actions/explanations is then fed into a model. The explanation model can then provide socalled rationales for actions from different states, even if the actual agent controlling the game’s avatar uses something like a neural network. The explanations may be plausible, but there’s no guarantee that they match the actual reasons the agent acted the way it did.
The Deep neural network Rule Extraction via Decision tree induction (DeepRED) algorithm is able to extract humanreadable rules that approximate the behavior of multilevel neural networks that perform multiclass classification. The algorithm takes a decompositional approach: starting with the output layer, each layer is explained by the previous layer, and then the rules (produced by the C4.5 algorithm) are merged to produce a rule set for the entire network. One potential drawback of the method is that it is not clear if the resulting rule sets are indeed interpretable, or if the number of terms needed in the rule to reach appropriate fidelity would overwhelm a user.
In order to make a better explanation, or summary of a model’s predictions, the learning algorithm could be modified. That is, rather than having bolton explainability, the underlying training algorithm could be enhanced so that it is easier to generate the posthoc explanation. For example, González et al. build upon the DeepRED algorithm by sparsifying the network and driving hidden units to either maximal or minimal activations. The first goal of the algorithm is to prune connections from the network, without reducing accuracy by much, with the expectation that “rules extracted from minimally connected neurons will be simpler and more accurate.” The second goal of the algorithm is to use a modified loss function that drives activations to maximal or minimal values, attempting to binarize the activation values, again without reducing accuracy by much. Experiments show that models generated in this manner are more compact, both in terms of the number of terms and the number of expressions.
A more radical solution is to focus on models that are easier to understand (e.g., “white box” models like rule lists). Here, the assumption is that there is not a strong performance/complexity tradeoff. Instead the goal is to do (nearly) as well as black box methods while maintaining interpretability. One example of this is the recent work by Okajima and Sadamasa on “Deep Neural Networks Constrained by Decision Rules,” which trains a deep neural network to select humanreadable decision rules. These ruleconstrained networks make decisions by selecting “a decision rule from a given decision rule set so that the observation satisfies the antecedent of the rule and the consequent gives a high probability to the correct class.” Therefore, every decision made is supported by a decision rule, by definition.
The Certifiably Optimal RulE ListS (CORELS) method, as mentioned above, is a way of producing an optimal rule list. In practice, the branch and bound method solves a difficult discrete optimization problem in reasonable amounts of time. One of the takehome arguments of the article is that simple rule lists can perform as well as complex blackbox models — if this is true, shouldn’t whitebox models be preferred?
One current research project we're working on at Borealis AI focuses on making deep reinforcement learning more intelligible. However, the current methods are opaque: it is difficult to explain to clients how the agent works, and it is difficult to be able to explain individual decisions. While we are still early in our research, we are investigating methods in all categories outlined above. The longterm goal of this research is to bring explainability into customerfacing models to ultimately help customers understand, and trust, our algorithms.
^{1}Note that we choose the word “intelligibility” with purpose. The questions discussed in this blog are related to explainability, interpretability, and “XAI,” and more broadly to safety and trust in artificial intelligence. However, we wish to emphasize that it is important for the system to be understood and that this may take some effort on the part of the subject understanding the system. Providing an explanation may or may not lead to this outcome — the example may be unhelpful, inaccurate, or even misleading with respect to the system’s true operation.
]]>By systematically controlling the frequency components of the perturbation, evaluating against the topplacing defense submissions in the NeurIPS 2017 competition, we empirically show that performance improvements in both the whitebox and blackbox transfer settings are yielded only when low frequency components are preserved. In fact, the defended models based on adversarial training are roughly as vulnerable to low frequency perturbations as undefended models, suggesting that the purported robustness of stateoftheart ImageNet defenses is reliant upon adversarial perturbations being high frequency in nature. We do find that under ℓ∞ ϵ=16/255, the competition distortion bound, low frequency perturbations are indeed perceptible. This questions the use of the ℓ∞norm, in particular, as a distortion metric, and, in turn, suggests that explicitly considering the frequency space is promising for learning robust models which better align with human perception.
]]>How do you surpass a bar that’s already high? One way is to build something in the mountains.
With a peek of the Rockies from the window, our Vancouver research centre officially opens its doors today. RBC President and CEO, Dave McKay, will formally inaugurate the space and later join an informal panel moderated by John Stackhouse with Foteini Agrafioti and Vancouver research director, Greg Mori. It's the final ribbon cutting in a jampacked season that also saw the completion of our new Waterloo and Edmonton locations.
Partnering once again with design firm Lemay, our vision was for each office to have its own identity and proudly represent the city it inhabits. This allowed Lemay full creative freedom to come up with interesting concepts that you wouldn’t normally find in a research centre.
Take a look at what we cooked up in the lab:
The main inspiration behind this research centre came from the institution that helped put the KitchenerWaterloo corridor on the map: The University of Waterloo. With this theme in mind, we designed rooms to pay homage to campus life. There’s a student lounge, teacher’s lounge, science lab and track and field pitch. And what’s campus life without movies about campus life? When it came down to the decorative details, we honoured retro classics like Grease, The Breakfast Club and Teachers while acknowledging modern masterpieces like The Big Bang Theory.
While our Edmonton team has been established for two years, we wanted to make sure they had a space that reinforced the excellence of their research. In early February, the team moved into a place that drew inspiration from the city's great winter escape: West Edmonton Mall. The mall is legendary not just for its size, but for providing a scope of attractions that make it plausible to spend three entire months indoors. And when “fun” is your theme, the design possibilities create themselves. We set up a bowling alley meeting room (with pins hanging from the ceiling), a pirate ship kitchen (galley? kitchen?), minigolf, and a water park in the living room. The pièce de résistance, and feature that makes every other nonEdmontonbased researcher jealous, is the sprawling outdoor terrace where the team can enjoy the Aurora Borealis on nights that don’t freeze water on impact.
Vancouverites are known for being a laidback crew, so we tried a more mellow approach to our interior design. After all, it’s impossible to improve upon the natural beauty of the West Coast landscape. Our Vancouver space was designed to evoke scenes of cozy ski lodges, wildlife and mountain climbs. There’s even a room modeled after the city’s favourite mode of transportation – the bicycle –which is the embodiment of high and lowtech in perfect tandem.
These spaces are even better in person. Be sure to check our Careers page regularly for job openings, or our Fellowships section for application opportunities.
]]>