The unintentional unfairness that occurs when a decision has widely different outcomes for different groups is known as disparate impact. As machine learning algorithms are increasingly used to determine important realworld outcomes such as loan approval, pay rates, and parole decisions, it is incumbent on the AI community to minimize unintentional discrimination.
This tutorial discusses how bias can be introduced into the machine learning pipeline, what it means for a decision to be fair, and methods to remove bias and ensure fairness.
There are many possible causes of bias in machine learning predictions. Here we briefly discuss three: (i) the adequacy of the data to represent different groups, (ii) bias inherent in the data, and (iii) the adequacy of the model to describe each group.
Data adequacy. Infrequent and specific patterns may be downweighted by the model in the name of generalization and so minority records can be unfairly neglected. This lack of data may not just be because group membership is small; data collection methodology can exclude or disadvantage certain groups (e.g., if the data collection process is only in one language). Sometimes records are removed if they contain missing values and these may be more prevalent in some groups than others.
Data bias. Even if the amount of data is sufficient to represent each group, training data may reflect existing prejudices (e.g., that female workers are paid less), and this is hard to remove. Such historical unfairness in data is known as negative legacy. Bias may also be introduced by more subtle means. For example, data from two locations may be collected slightly differently. If group membership varies with location this can induce biases. Finally, the choice of attributes to input into the model may induce prejudice.
Model adequacy. The model architecture may describe some groups better than others. For example, a linear model may be suitable for one group but not for another.
A model is considered fair if errors are distributed similarly across protected groups, although there are many ways to define this. Consider taking data $\mathbf{x}$ and using a machine learning model to compute a score $\mbox{f}[\mathbf{x}$$]$ that will be used to predict a binary outcome $\hat{y}\in\{0,1\}$. Each data example $\mathbf{x}$ is associated with a protected attribute $p$. In this tutorial, we consider it to be binary $p\in\{0,1\}$. For example, it might encode subpopulations according to gender or ethnicity.
We will refer to $p=0$ as the deprived population and $p=1$ as the favored population. Similarly we will refer to $\hat{y}=1$ as the favored outcome, assuming it represents the more desirable of the two possible results.
Assume that for some dataset, we know the ground truth outcomes $y\in\{0,1\}$. Note that these outcomes may differ statistically between different populations, either because there are genuine differences between the groups or because the model is somehow biased. According to the situation, we may want our estimate $\hat{y}$ to take account of these differences or to compensate for them.
Most definitions of fairness are based on group fairness, which deals with statistical fairness across the whole population. Complementary to this is individual fairness which mandates that similar individuals should be treated similarly regardless of group membership. In this blog, we'll mainly focus on group fairness, three definitions of which include: (i) demographic parity, (ii) equality of odds, and (iii) equality of opportunity. We now discuss each in turn.
Demographic parity or statistical parity suggests that a predictor is unbiased if the prediction $\hat{y}$ is independent of the protected attribute $p$ so that
\begin{equation}
Pr(\hat{y}p) = Pr(\hat{y}). \tag{2.1}
\end{equation}
Here, the same proportion of each population are classified as positive. However, this may result in different false positive and true positive rates if the true outcome $y$ does actually vary with the protected attribute $p$.
Deviations from statistical parity are sometimes measured by the statistical parity difference
\begin{equation}
\mbox{SPD} = Pr(\hat{y}=1, p=1)  Pr(\hat{y}=1, p=0), \tag{2.2}
\end{equation}
or the disparate impact which replaces the difference in this equation with a ratio. Both of these are measures of discrimination (i.e. deviation from fairness).
Equality of odds is satisfied if the prediction $\hat{y}$ is conditionally independent to the protected attribute $p$, given the true value $y$:
\begin{equation}
Pr(\hat{y}y,p) = Pr(\hat{y} y). \tag{2.3}
\end{equation}
This means that the true positive rate and false positive rate will be the same for each population; each error type is matched between each group.
Equality of opportunity has the same mathematical formulation as equality of odds, but is focused on one particular label $y=1$ of the true value so that:
\begin{equation}
Pr(\hat{y}y=1,p) = Pr(\hat{y} y=1). \tag{2.4}
\end{equation}
In this case, we want the true positive rate $Pr(\hat{y}=1y=1)$ to be the same for each population with no regard for the errors when $y=0$. In effect it means that the same proportion of each population receive the "good'' outcome $y=1$.
Deviation from equality of opportunity is measured by the equal opportunity difference:
\begin{equation}
\mbox{EOD} = Pr(\hat{y}=1,y=1, p=1)  Pr(\hat{y}=1,y=1, p=0). \tag{2.5}
\end{equation}
To make these ideas concrete, we consider the example of an algorithm that predicts credit rating scores for loan decisions. This scenario follows from the work of Hardt et al. (2016) and the associated blog.
There are two pools of loan applicants $p\in\{0,1\}$ that we'll describe as the blue and yellow populations. We assume that we are given historical data, so we know both the credit rating and whether the applicant actually defaulted on the loan ($y=0$) or repaid it ($y=1$).
We can now think of four groups of data corresponding to (i) the blue and yellow populations and (ii) whether they did or did not repay the loan. For each of these four groups we have a distribution of credit ratings (figure 1). In an ideal world, the two distributions for the yellow population would be exactly the same as those for the blue population. However, as figure 1 shows, this is clearly not the case here.
Why might the distributions for blue and yellow populations be different? It could be that the behaviour of the populations is identical, but the credit rating algorithm is biased; it may favor one population over another or simply be more noisy for one group. Alternatively, it could be that that the populations genuinely behave differently. In practice, the differences in blue and yellow distributions are probably attributable to a combination of these factors.
Let's assume that we can't retrain the credit score prediction algorithm; our job is to adjudicate whether each individual is refused the loan ($\hat{y}=0)$ or granted it ($\hat{y}=1$). Since we only have the credit score $\mbox{f}[\mathbf{x}$$]$ to go on, the best we can do is to assign different thresholds $\tau_{0}$ and $\tau_{1}$ for the blue and yellow populations so that the loan is granted if $f[\mathbf{x}$$]$ $>\tau_{0}$ for the blue population and $f[\mathbf{x}$$]$ $>\tau_{1}$ for the yellow population.
We'll now consider different possible ways to set these thresholds that result in different senses of fairness. We emphasize that we are not advocating any particular criterion, but merely exploring the ramifications of different choices.
Blindness to protected attribute: We choose the same threshold for blue and yellow populations. This sounds sensible, but it neither guarantees that the overall frequency of loans, nor the frequency of successful loans will be the same for the two groups. For the thresholds chosen in figure 2a, many more loans are made to the yellow population than the blue population (figure 2b). Moreover, examination of the receiver operating characteristic (ROC) curve shows that both the rate of true positives $Pr(\hat{y}=1y=1)$ and false alarms $Pr(\hat{y}=1y=0)$ differ for the two groups (figure 2c).
Equality of odds: This definition of fairness proposes that the false positive and true positive rates should be the same for both populations. This also sounds reasonable, but figure 2c shows that it is not possible for this example. There is no combination of thresholds that can achieve this because the ROC curves do not intersect. Even if they did, we would be stuck giving loans based on the particular false positive and true positive rates at the intersection which might not be desirable.
Demographic parity: The threshold could be chosen so that the same proportion of each group are classified as $\hat{y} =1$ and given loans (figure 3). We make an equal number of loans to each group despite the different tendencies of each to repay (figure 3b). This has the disadvantage that the true positive and false positive rates might be completely different in different populations (figure 3c). From the perspective of the lender, it is desirable to give loans in proportion to people's ability to pay them back. From the perspective of an individual in a more reliable group, it may seem unfair that the other group gets offered the same number of loans despite the fact they are less reliable.
Equal opportunity: The thresholds are chosen so that so that the true positive rate is is the same for both population (figure 4). Of the people who pay back the loan, the same proportion are offered credit in each group. In terms of the two ROC curves, it means choosing thresholds so that the vertical position on each curve is the same without regard for the horizontal position (figure 2c). However, it means that different proportions of the blue and yellow groups are given loans (figure 4b).
We have seen that there is no straightforward way to choose thresholds on an existing classifier for different populations, so that all definitions of fairness are satisfied. Now we'll investigate a different approach that aims to make the classification performance more similar for the two models.
The ROC curves show that accuracy is higher in predicting whether the blue population will repay the loan as opposed to the yellow group (i.e. the blue ROC curve is everywhere higher than the yellow one). What if we try to reduce the accuracy for the blue population so that this more nearly matches? One way to do this is to add noise to the credit score for the blue population (figure 5). As we add increasing amounts of noise the blue ROC curve moves towards the positive diagonal and at some point will cross the yellow ROC curve. Now equality of odds can be achieved.
Unfortunately, this approach has two unattractive features. First, we now make worse decisions for the blue population; it is a general feature of most remedial approaches that there is a trade off between accuracy and fairness (Kamiran & Calders 2012; CorbettDavies et al. 2017). Second, adding noise violates individual fairness. Two identical members of the blue population may have different noise values added to the scores, resulting in different decisions on their loans.
The conclusion of the worked loan example is that it is very hard to remove bias once the classifier has already been trained, even for very simple cases. For further information, the reader is invited to consult Kamiran & Calders (2012), Hardt et al. (2016), Menon & Williamson (2017) and Pleiss et al. (2017).
PostProcessing  InProcessing  PreProcessing  Data Collection 

• Change thresholds • Trade off accuracy for fairness 
• Adversarial training • Regularize for fairness • Constrain to be fair 
• Modify labels • Modify input data • Modify label/data pairs • Weight label/data pairs 
• Identify lack of examples or variates and collect 
Thankfully, there are approaches to deal with bias at all stages of the data collection, preprocessing, and training pipeline (figure 6). In this section we consider some of these methods. In the ensuing discussion, we'll assume that the true behaviour of the different populations is the same. Hence, we are interested in making sure that the predictions of our system do not differ for each population.
A straightforward approach to eliminating bias from datasets would be to remove the protected attribute and other elements of the data that are suspected to contain related information. Unfortunately, such suppression rarely suffices. There are often subtle correlations in the data that mean that the protected attribute can be reconstructed. For example, we might remove race, but retain information about the subject's address, which could be strongly correlated with the race.
The degree to which there are dependencies between the data $\mathbf{x}$ and the protected attribute $p$ can be measured using the mutual information
\begin{equation}
\mbox{LP} = \sum_{\mathbf{x},p} Pr(\mathbf{x},p) \log\left[\frac{Pr(\mathbf{x},p)}{Pr(\mathbf{x})Pr(p)}\right], \tag{2.6}
\end{equation}
which is known as the latent prejudice (Kamishima et al. 2011). As this measure increases, the protected attribute becomes more predictable from the data. Indeed, Feldman et al. (2015) and Menon & Williamson (2017) have shown that the predictability of the protected attribute puts mathematical bounds on the potential discrimination of a classifier.
We'll now discuss four approaches for removing bias by manipulating the dataset. Respectively, these modify the labels $y$, the observed data $\mathbf{x}$, the data/label pairs $\{\mathbf{x},y\}$, and the weighting of these pairs.
Kamiran & Calders (2012) proposed changing some of the training labels which they term massaging the data. They compute a classifier on the original dataset and find examples close to the decision surface. They then swap the labels in such a way that a positive outcome for the disadvantaged group is more likely and retrain. This is a heuristic approach that empirically improves fairness at the cost of accuracy.
Feldman et al. (2015) proposed manipulating individual data dimensions $x$ in a way that depends on the protected attribute $p$. They align the cumulative distributions $F_{0}[x]$ and $F_{1}[x]$ for feature $x$ when the protected attribute $p$ is 0 and 1 respectively to a median cumulative distribution $F_{m}[x]$. This is similar to standardising test scores across different high schools (figure 7) and is termed disparate impact removal. This approach has the disadvantage that it treats each input variable $x\in\mathbf{x}$ separately and ignores their interactions.
Calmon et al. (2017) learn a randomized transformation $Pr(\mathbf{x}^{\prime}, y^{\prime}\mathbf{x},y,p)$ that transforms data pairs $\{\mathbf{x}, y\}$ to new data values $\{\mathbf{x}^{\prime}, y^{\prime}\}$ in a way that depends explicitly on the protected attribute $p$. They formulate this as an optimization problem in which they minimize the change in data utility, subject to limits on the prejudice and distortion of the original values. They show that this optimization problem may be convex in certain conditions.
Unlike disparate impact removal, this takes into account interactions between all of the data dimensions. However, the randomized transformation is formulated as a probability table, so this is only suitable for datasets with small numbers of discrete input and output variables. The randomized transformation, which must also be applied to test data, also violates individual fairness.
Kamiran & Calders (2012) propose to reweight the $\{\mathbf{x}, \mathbf{y}\}$ tuples in the training dataset so that cases where the protected attribute $p$ predicts that the disadvantaged group will get a positive outcome are more highly weighted. They then train a classifier that makes use of these weights in its cost function. Alternately, they propose resampling the training data according to these weights and using a standard classifier.
In the previous section, we introduced the latent prejudice measure based on the mutual information between the data $\mathbf{x}$ and the protected attribute $p$. Similarly, we can measure the dependence between the labels $y$ and the protected attribute $p$:
\begin{equation}
\mbox{IP} = \sum_{y,p} Pr(y,p) \log\left[\frac{Pr(y,p)}{Pr(y)Pr(p)}\right]. \tag{2.7}
\end{equation}
This is known as the indirect prejudice (Kamishima et al. 2011). Intuitively, if there is no way to predict the labels from the protected attribute and viceversa then there is no scope for bias.
One approach to removing bias during training is to explicitly remove this dependency using adversarial learning. Other approaches to removing bias include penalizing the mutual information using regularization, fitting the model under the constraint that it is not biased. We'll briefly discuss each in turn.
Adversarialdebiasing (Beutel et al. 2017; Zhang et al. 2018) reduces evidence of protected attributes in predictions by trying to simultaneously fool a second classifier that tries to guess the protected attribute $p$. Beutel et al. (2017) force both classifiers to use a shared representation and so minimizing the performance of the adversarial classifier means removing all information about the protected attribute from this representation (figure 8).
Zhang et al. (2018) use the adversarial component to predict $p$ from (i) the final classification logits $f[\mathbf{x}]$ (to ensure demographic parity), (ii) the classification logits $f[\mathbf{x}]$ and the true class $y$ (to ensure equality of odds), or (iii) the final classification logits and the true result for just one class (to ensure equality of opportunity).
Kamishima et al. (2011) proposed adding an extra regularization condition to the output of logistic regression classifier that tried to minimize the mutual information between the protected attribute and the prediction $\hat{y}$. They first rearranged the indirect prejudice expression using the definition of conditional probability to get
\begin{eqnarray}
\mbox{PI} &=& \sum_{y,p} Pr(y\mathbf{x},p) \log\left[\frac{Pr(y,p)}{Pr(y)Pr(p)}\right]\nonumber\\
&=& \sum_{y,p} Pr(y\mathbf{x},p) \log\left[\frac{Pr(yp)}{Pr(y)}\right]. \tag{2.8}
\end{eqnarray}
Then, they formulate a regularization loss based on the expectation of this over the data set:
\begin{equation}
\mbox{L}_{reg} = \sum_{i}\sum_{\hat{y},p} Pr(\hat{y}_{i}\mathbf{x}_{i},p_{i})\log\left[\frac{Pr(\hat{y}_{i}p_{i})}{Pr(\hat{y}_{i})}\right] \tag{2.9}
\end{equation}
where $i$ indexes the data examples, which they add to the main training loss.
Zafar et al. (2015) formulated unfairness in terms of the covariance between the protected attribute $\{p_{i}\}_{i=1}^{I}$ and the signed distances $\{d[\mathbf{x}_{i},\boldsymbol\theta]\}_{i=1}^{I}$ of the associated feature vectors $\{\mathbf{x}_{i}\}_{i=1}^{I}$ from the decision boundary, where $\boldsymbol\theta$ denotes the model parameters. Let $\overline{p}$ represent the mean value of the protected attribute. They then minimize the main loss function such that the covariance remains within some threshold $t$.
\begin{equation}
\begin{aligned}
& \underset{\boldsymbol\theta}{\text{minimize}}
& & L[\boldsymbol\theta] \\
& \text{subject to}
& & \frac{1}{I}\sum_{i=1}^{I}(p_{i}\overline{p})d[\mathbf{x}_{i},\boldsymbol\theta] \leq t\\
& & & \frac{1}{I}\sum_{i=1}^{I}(p_{i}\overline{p})d[\mathbf{x}_{i},\boldsymbol\theta] \geq t
\end{aligned} \tag{2.10}
\end{equation}
This constrained optimization problem can also be written as a regularized optimization problem in which the fairness constraints are moved to the objective and the corresponding Lagrange multipliers act as regularizers. Zafar et al. (2015) also introduced a second formulation where they maximize fairness under accuracy constraints.
Zemel et al. (2013) presented a method that maps data to an intermediate space in a way that depends on the protected attribute and obfuscates information about that attribute. Since this mapping is learnt during training, this method could considered either a preprocessing approach or an inprocessing algorithm.
Chen et al. (2018) argue that a trade off between fairness and accuracy may not be acceptable and that these challenges should be addressed through data collection. They aim to diagnose unfairness induced by inadequate data and unmeasured predictive variables and prescribe data collection approaches to remedy these problems.
In this tutorial, we've discussed what it means for a classifier to be fair, how to quantify the degree of bias in a dataset and methods to remedy unfairness at all stages in the pipeline. An empirical analysis of fairnessbased interventions is presented in Friedler et al. (2019). There are a large number of toolkits available to help evaluate fairness, the most comprehensive of which is AI Fairness 360.
This tutorial has been limited to a discussion of supervised learning algorithms, but there is also an orthogonal literature on bias in NLP embeddings (e.g. Zhao et al. 2019).
]]>The online course includes a fourweeklong introduction to machine learning and deep learning, featuring lectures from Mila professors Yoshua Bengio, Laurent Charlin, Audrey Durand and Aaron Courville. Korbi, an NLP based technology that has spun out of research at Mila, will use AI to personalize the learning experience by interacting with students live and guiding them through the course material. The course will be offered online for free to anyone interested in learning the introductory concepts of machine learning and deep learning.
“Borealis AI is excited to be collaborating with the Korbit team on this project. Providing AI training at scale is imperative for our communities, and we are proud to be supporting a Canadian startup company with such strong machine learning expertise” says Dr Simon Prince, Research Director at Borealis AI Montreal.
The program is offered in an effort to democratize AI education by a worldclass research centre in Québec. AI experts are of high demand in recent years and Canada has amongst the world’s brightest minds in the space. Mila and Korbit aim to reduce the AI talent gap in the industry through a platform that can be accessed and used to educate engineers and researchers worldwide.
To date, over 1,600 students around the world have signed up for Korbit’s machine learning course since May 2019. Preliminary results are promising, showing an increase in student engagement as well as an increase in positive learning outcomes after interacting with the AI tutor.
Click here to enrol in the course or to learn more about AI tutoring. See you in class!
]]>XLNet is the latest and greatest model to emerge from the booming field of Natural Language Processing (NLP). The XLNet paper combines recent advances in NLP with innovative choices in how the language modelling problem is approached. When trained on a very large NLP corpus, the model achieves stateoftheart performance for the standard NLP tasks that comprise the GLUE benchmark.
XLNet is an autoregressive language model which outputs the joint probability of a sequence of tokens based on the transformer architecture with recurrence. Its training objective calculates the probability of a word token conditioned on all permutations of word tokens in a sentence, as opposed to just those to the left or just those to the right of the target token.
If the above description made perfect sense, then this post is not for you. If it didn't, then read on to find out how XLNet works, and why it is the new standard for many NLP tasks.
In language modelling we calculate the joint probability distribution for sequences of tokens (words), and this is often achieved by factorizing the joint distribution into conditional distributions of one token given other tokens in the sequence. For example, given the sequence of tokens New, York, is, a, city the language model could be asked to calculate the probability $Pr$(New  is, a, city). This is the probability that the token New is in the sequence given that is, a, and city are also in the sequence (figure 1).
For the purpose of this discussion, consider that generally a language model takes a text sequence of $T$ tokens, $\mathbf{x} = [x_1, x_2,\ldots, x_T]$, and computes the probability of some tokens $\mathbf{x}^{\prime}$ being present in the sequence, given some others $\mathbf{x}^{\prime\prime}$ in the sequence: $Pr(\mathbf{x}^{\prime}  \mathbf{x}^{\prime\prime})$ where $\mathbf{x}^{\prime}$ and $\mathbf{x}^{\prime\prime}$ are nonoverlapping subsets of $\mathbf{x}$.
Why would anyone want a model which can calculate the probability that a word is in a sequence? Actually, noone really cares about that^{1}. However, a model that contains enough information to predict what comes next in a sentence can be applied to other more useful tasks; for example, it might be used to determine who is mentioned in the text, what action is being taken, or if the text has a positive or negative sentiment. Hence, models are pretrained with the language modeling objective and subsequently finetuned to solve more practical tasks.
Let's discuss the architectural foundation of XLNet. The first component of a language model is a wordembedding matrix: a fixedlength vector is assigned for each token in the vocabulary and so the sequence is converted to a set of vectors.
Next we need to relate the embedded tokens in a sequence. A longtime favorite for this task has been the LSTM architecture which relates adjacent tokens (e.g. the ELMo model), but recent state of the art results have been achieved with transformers (e.g. the BERT model^{2}). The transformer architecture allows nonadjacent tokens in the sequence to be combined to generate higherlevel information using an attention mechanism. This helps the model learn from the longdistance relations that exist in text more easily than LSTM based approaches.
Transformers have a drawback: they operate on fixedlength sequences. What if knowing that New should occur in the sentence ____ York is a city also requires that the model have read something about the Empire State building in a previous sentence? TransformerXL resolves this issue by allowing the current sequence to see information from the previous sequences. It is this architecture that XLNet is based on.
XLNet's main contribution is not the architecture^{3}, but a modified language model training objective which learns conditional distributions for all permutations of tokens in a sequence. Before diving into the details of that objective, let's revisit the BERT model to motivate XLNet's choices.
The previous state of the art (BERT) used a training objective that was tasked with recovering words in a sentence which have been masked. For a given sentence, some tokens are replaced with a generic [mask] token, and the model is asked to recover the originals.
The XLNet paper argues that this isn't a great way to train the model. Let's leave the details of this argument to the paper and instead present a less precise argument that captures some of the important concepts.
A language model should encode as much information and nuances from text as possible. The BERT model tries to recover the masked words in the sentence The [mask] was beached on the riverside (figure 2). Words such as boat or canoe are likely here. BERT can know this because a boat can be beached, and is often found on a riverside. But BERT doesn't necessarily need to learn that a boat can be beached, since it can still use riverside as a crutch to infer that boat is the masked token.
Moreover, BERT predicts the masked tokens independently, so it doesn't learn how they influence oneanother. If the example was The [mask] was [mask] on the riverside, then BERT might correclty assign high probabilities to (boat, beached) and (parade, seen) but might also think (parade, beached) is acceptable.
Approaches such as BERT and ELMo improved on the state of the art by incorporating both left and right contexts into predictions. XLNet took this a step further: the model's contribution is to predict each word in a sequence using any combination of other words in that sequence. XLNet might be asked to calculate what word is likely to follow The. Lots of words are likely, but certainly boat is more likely than they, so it's already learned something about a boat (mainly that it's not a pronoun). Next it might be asked to calculate which is a likely 2^{nd }word given [3]was, [4]beached. And, then it might be asked to calculate which is a likely 4^{th} word given: [3]was, [5]on, [7]riverside.
In this way XLNet doesn't really have a crutch to lean on. It's being presented difficult, and at times ambiguous contexts from which to infer whether or not a word is in a sentence. This is what allows it to squeeze more information out of the training corpus (figure 3).
In practice, XLNet samples from all possible permutations, so it doesn't get to see every single relation. It also doesn't use very small contexts as they are found to hinder training. After applying these practical heuristics, it bears more of a resemblance to BERT.
In the next few sections we'll expand on the more challenging aspects of the paper.
Given a sequence $\mathbf{x}$, an autoregressive (AR) model is one which calculates the probability $Pr(x_i  x_{<i})$. In language modelling, this is the probability of a token $x_{i}$ in the sentence, conditioned on the tokens $x_{<i}$ preceding it. These conditioning words are referred to as the context. Such a model is asymmetric and isn't learning from all token relations in the corpus.
Autoregressive models such as ELMo allow a model to also learn from relations between a token and those following it. The AR objective in this case could be seen as $Pr(x_i) = Pr(x_i  x_{>i})$. It is autoregressive in the reversed sequence. But why stop there? There could be interesting relations to learn from if we look at just the two nearest tokens: $Pr(x_i) = Pr(x_i  x_{i1}, x_{i+1})$ or really any combination of tokens $Pr(x_i) = Pr(x_i  x_{i1}, x_{i+2}, x_{i3})$.
XLNet proposes to use an objective which is an expectation over all such permutations. Consider a sequence x = [This, is, a, sentence] with T=4 tokens. Now consider the set of all 4! permutations $\mathcal{Z}$ = {[1, 2, 3, 4], [1, 2, 4, 3],. . ., [4, 3, 2, 1]}. The XLNet model is autoregressive over all such permutations; it can calculate the probability of token $x_i$ given preceding tokens $x_{<i}$ from any order $\mathbf{z}$ from $\mathcal{Z}$.
For example, it can calculate the probability of the 3^{rd} element given the two preceding ones from any permutation. The three permutations [1, 2, 3, 4], [1, 2, 4, 3] and [4, 3, 2, 1] above would correspond to $Pr$(a,  This, is), $Pr$(sentence  This, is) and $Pr$(is  sentence, a). Similarly, the probability of the second element given the first would be $Pr$(is  This), $Pr$(is  This) and $Pr$(a  sentence). Considering all four positions and all 4! permutations the model takes into consideration all possible dependencies.
These ideas are embodied in equation 3 from the paper:
\begin{equation*}
\hat{\boldsymbol\theta} = \mathop{\rm argmax}_{\boldsymbol\theta}\left[\mathbb{E}_{\mathbf{z}\sim\mathcal{Z}}\left[\sum_{t=1}^{T} \log \left[Pr(x_{z[t]}x_{z[<t]}) \right] \right]\right]
\end{equation*}
This criterion finds model parameters $\boldsymbol\theta$ to maximize the probability of tokens $x_{z[t]}$ in a sequence of length $T$ given preceding tokens $x_{z[<t]}$, where $z[t]$ is the t$^{th}$ element of a permutation $\mathbf{z}$ of the token indices and $z[<t]$ are the previous elements in the permutation. The sum of log probabilities means that for any one permutation the model is properly autoregressive as it is the product of the probability for each element in the sequence. The expectation over all the permutations in $\mathcal{Z}$ shows the model is trained to be equally capable of computing probabilities for any token given any context.
There is something missing from the way the model has been presented so far: how does the model know about word order? The model can compute $Pr$(This  is) as well as $Pr$(This  a). Ideally it should know something about the relative position of This and is and also of a. Otherwise it would just think all tokens in the sequence are equally likely to be next to oneanother. What we want is a model which predicts $Pr$(This  is, 2) and $Pr$(This  a, 3). In other words, it should know the indices of the context tokens.
The transformer architecture addresses this problem by adding positional information to token embeddings. You can think of the training objective terms as $Pr$(This  is+2). But if we really shuffled the sentence tokens, this mechanism would break. This problem is resolved by using an attention mask. When the model computes the context which is the input to the probability calculation, it always does so using the same token order, and simply masks those not in the context under consideration (i.e. those that come subsequently in the shuffled order).
As a concrete example, consider the permutation [3, 2, 4, 1]. When calculating the probability of the 1^{st} element in that order (i.e., token 3), the model has no context as the other tokens have not yet been seen. So the mask would be [0, 0, 0, 0]. For the 2^{nd} element (token 2), the mask is [0, 0, 1, 0] as its only context is token 3. Following that logic, the 3^{rd}and 4^{th} elements (tokens 4 and 1) have masks [0, 1, 1, 0] and [0, 1, 1, 1]. Stacking all those in the token order gives the matrix (as seen in fig. 2(c) in the paper):
\begin{equation}
\begin{bmatrix}
0& 1& 1& 1 \\
0& 0& 1& 0\\
0& 0& 0& 0 \\
0& 1& 1& 0
\end{bmatrix}
\label{eqn:matrixmask}
\end{equation}
Another way to look at this is that the training objective will contain the following terms where underscores represent what has been masked:
$Pr$(This  ___, is+2, a+3, sentence+4)
$Pr$(is  ___, ___, a+3, ___)
$Pr$(a  ___, ___, ___, ___)
$Pr$(sentence  ___, is+2, a+3, ___)
There remains one oversight to address: we not only want the probability to be conditioned on the context token indices, but also the index of the token whose probability is being calculated. In other words we want $Pr$(This  1, is+2): the probability of This given that it is the 1^{st} token and that is is the 2^{nd} token. But the transformer architecture encodes the positional information 1 and 2 within the embedding for This and is. So this would look like $Pr$(This  This+1, is+2). Unfortunately, the model now trivially knows that This is part of the sentence and should be likely.
The solution to this problem is a twostream selfattention mechanism. Each token position $i$, has two associated vectors at each selfattention layer $m$: $\mathbf{h}_i^m$ and $\mathbf{g}_i^m$. The $\mathbf{h}$ vectors belong to the content stream, while the $\mathbf{g}$ vectors belong to the query stream. The content stream vectors are initialized with token embeddings added to positional embeddings. The query stream vectors are initialized with a generic embedding vector $\mathbf{w}$ added to positional embeddings. Note that $\mathbf{w}$ is the same no matter the token, and thus cannot be used to distinguish between tokens.
At each layer, each content vector, $\mathbf{h}_i$, is updated using those $\mathbf{h}$'s that remain unmasked and itself (equivalent to unmasking the diagonal from the matrix shown in the previous section). Thus, $\mathbf{h}_3$ is updated with the mask $[0, 0, 1, 0]$, while $\mathbf{h}_2$ is updated with the mask $[0, 1, 1, 0]$. The update uses the content vectors as the query, key and value.
By contrast, at each layer each query vector $\mathbf{g}_{i}$ is updated using the unmasked content vectors and itself. The update uses $\mathbf{g}_i$ as the query while it uses $\mathbf{h}_j$'s as the keys and values, where $j$ is the index of an unmasked token in the context of $i$.
Figure 4 illustrates how the the query $\mathbf{g}_4^m$ for the 4^{th} token at the $m$^{th} layer of selfattention is calculated. It shows that $\mathbf{g}_4^m$ is an aggregation of is+2, a+3 and the position 4, which is precisely the context needed to calculate the probability of the token sentence.
To followalong from the last section, the training objective contains the following terms where $*$ denotes that this is the token position whose probability is being computed:
$Pr$(This  *, is+2, a+3, sentence+4)
$Pr$(is  ___, *, a+3, ___)
$Pr$(a  ___, ___, *, ___)
$Pr$(sentence  ___, is+2, a+3, *).
Does it work? The short answer is yes. The long answer is also yes. Perhaps this is not surprising: XLNet builds on previous state of the art methods. It was trained on a corpus of 30 billion words (an order of magnitude greater than that used to train BERT and drawn from more diverse sources) and this training required significantly more hours of compute time than previous models:
ULMFit  1 GPU day 
ELMo  40 GPU days 
BERT  450 GPU days 
XLNet  2000 GPU days 
Table 1. Approximate computation time for training recent NLP models^{4}.
Perhaps more interestingly, XLNet's ablation study shows that it also works better than BERT in a fair comparison (figure 5). That is, when the model is trained on the same corpus as was BERT, using the same hyperparameters and the same number of layers, it consistently outperforms BERT. Even more interestingly, XLNet also beats TransformerXL in the fair comparison. TransformerXL could be considered as an ablation of the permutation AR objective. The consistent improvement over that score is evidence for the strength of that method.
What is not resolved by the ablation study is the contribution of the twostream selfattention mechanism to XLNet’s performance gains. It both allows the attention mechanism to explicitly take the target token position into account, and it introduces additional latent capacity in the form of the query stream vectors. While it is an intricate part of the XLNet architecture, it is possible that models such as BERT could also benefit from this mechanism without using the same training objective as XLNet.
^{1 }While the main purpose of pretrained language models is to learn linguistic features which are useful in downstream tasks, the actual language model's calculation of word probabilities can be useful for things like outlier detection and autocorrect.
^{2 }The BERT model is technically a masked language model as it isn't trained to maximize the joint probability of a sequence of tokens.
^{3 }In order to implement the NADElike training objective, the XLNet paper also introduces some novel architecture choices which are discussed in later sections. However, for the purpose of this post, it is convenient to first discuss XLNet’s goal to create a model which learns from bidirectional context, and then introduce the architectural work needed to achieve this goal.
^{4 }These values were derived using a speculative TPU to GPU hour conversion as explained in this post and rounded semiarbitrarily.
]]>He earned his PhD in psychology with a focus on human stereo vision. He’d later go on to study visual neurons in the mammalian brain. But, as he says, “somewhere along the way I realized I was an engineer and not a scientist. I didn’t want to understand how biological vision worked, I wanted to build my own version.” To that end, computer vision became an obvious direction and University College of London his academic stomping grounds.
We are fortunate to welcome Simon as our newest research director. Simon joins Greg Mori (Vancouver) and Marcus Brubaker (Toronto) and will head up our Montreal lab, leading a group of strong researchers in one of Canada’s most dynamic AI ecosystems.
He will round out his team with the addition of Dr. Layla El Asri, a former research manager at Microsoft Research Lab in Montreal, and a wellrespected thought leader on multiple AI topics. Along with academic advisor, Prof. Jackie Cheung, the lab will continue to focus its output on natural language processing (NLP), with its scope aimed at continuing to improve the ways clients interact with the bank. The next few months will see a swift rampup of a number of teams, each focused on building a different NLPbased product.
Simon’s longterm plan is to make Borealis AI Montreal “the most intellectually stimulating and fun place to do NLP research in the city,” which is a tall order in a worldclass AI city. But his experience so far bodes well for his ambitions.
“People have been incredibly friendly and welcoming. I think there is a general acceptance that bringing more AI businesses to Montreal is overwhelmingly a good thing, even if it puts shortterm pressure on the hiring market. I’m personally impressed by the quantity and scope of AI activity in the MileEx neighborhood in particular: Borealis AI is at groundzero for AI in Montreal.”
Of course, our awardwinning office design didn’t hurt. He’s grateful to walk each morning into a bright, open, fun space after many years of “working in basement labs in universities and only rarely seeing sunlight.” And the city is its own draw for the British expat: “it combines the best of North America and the best of Europe and it’s an amazing place to live if, like me, you love food, cycling and winter sports.”
We’ve now come to the part of the blog where Simon, in his selfeffacing manner would say, “enough about me.” He’s adamant that the space he cocreates with his colleagues serves the broader community and surpasses the needs of any top researcher looking for a home in a challenging, fastmoving, resourcerich, and stimulating environment with the potential for massive global impact.
In his own words, he says anyone joining Borealis AI can expect to have “excellent academic colleagues and a minimum of interfering middlemanagement. They will work on challenging longterm goals without being forced to constantly switch between projects. They will have academic freedom to publish and an excellent supportive environment in which they can grow their skills.”
Challenge accepted.
]]>To address these issues, we propose Metatrace, a metagradient descent based algorithm to tune the stepsize online. Metatrace leverages the structure of eligibility traces, and works for both tuning a scalar stepsize and a respective stepsize for each parameter. We empirically evaluate Metatrace for actorcritic on the Arcade Learning Environment. Results show Metatrace can speed up learning, and improve performance in nonstationary settings.
]]>Given a fixed history of events and their corresponding times – like those shown below in Fig. 1 – multiple actions are possible in the future. In our CVPR paper of the same name, which we will be presenting this week in Long Beach, we propose a powerful generative approach that can effectively model the distribution over future actions.
To date, much of the work in this domain has focused on taking framelevel data of video as input in order to predict the actions or activities that may occur in the immediate future. Timeseries data often involves regularly spaced data points with interesting events occurring sparsely across time. We hypothesize that in order to model future events in such a scenario, it is beneficial to consider the history of sparse events (action categories and their temporal occurrence in the above example) alone, instead of regularly spaced frame data. This approach also allows us to model highlevel semantic meaning in the timeseries data that can be difficult to discern from framelevel data.
More specifically, we are interested in modeling the distribution over future action category and action timing given the past history of sparse events. For action timing, we aim to model the distribution over interarrival time. The interarrival time is the time difference between the starting time of two consecutive actions.
The contributions of this work center around the APPVAE (Action Point Process VAE), a novel generative model for asynchronous time action sequences. Fig. 2 shows the overall structure of our proposed framework. We formulated our model with the variational autoencoder (VAE) paradigm, a powerful class of probabilistic models that facilitate generation and the ability to model complex distributions. We present a novel form of VAE for action sequences under a point process approach. This approach has a number of advantages, including a probabilistic treatment of action sequences to allow for likelihood evaluation, generation, and anomaly detection.
Fig. 3 shows the architecture of our model. Overall, the input sequence of action categories and interarrival times are encoded using a recurrent VAE model. At each step, the model uses the history of actions to produce a distribution over latent codes $zn$, a sample of which is then decoded into two probability distributions: one over the possible action categories and another over the interarrival time for the next action.
Since the true distribution over latent variables $z_n$ is intractable, we rely on a timedependent posterior network $q_\phi(z_{n}x_{1:n})$ that approximates it with a conditional Gaussian distribution $N(\mu_{\phi_n}, \sigma^2_{\phi_n})$.
To prevent $z_n$ from just copying $x_n$, we force $q_\phi(z_nx_{1:n})$ to be close to the prior distribution $p(z_n)$ using a KLdivergence term. Here, in order to consider the history of past actions in generation phase, we learn a prior that varies across time and is a function of past actions, except the current action $p_\psi(z_nx_{1:n1})$.
The sequence model generates two probability distributions: i) a categorical distribution over the action categories; and ii) a temporal point process distribution over the interarrival times for the next action.
The distribution over action categories $a_n$ is modeled with a multinomial distribution when $a_n$ can only take a finite number of values: \begin{equation}
p^a_\theta(a_n=kz_n) = p_k(z_n) \quad \text{and} \,\,\,\,
\sum_{k=1}^K{p_k(z_n)} =1 \label{eq:action}
\end{equation} where $p_k(z_n)$ is the probability of occurrence of action $k$, and $K$ is the total number of action categories.
The interarrival time $\tau_n$ is assumed to follow an exponential distribution parameterized by $\lambda(z_n)$, similar to a standard temporal point process model:
\begin{equation}
\begin{aligned}
p^{\tau}_{\theta}(\tau_n  z_n) =
\begin{cases}
\lambda(z_n) e^{{\lambda(z_n)}\tau_n} & \text{if}~~ \tau_n \geq 0 \\
0 & \text{if}~~ \tau_n<0
\end{cases}
\end{aligned} \label{eq:time}
\end{equation}
where $p^{\tau}_{\theta}(\tau_nz_n)$ is a probability density function over random variable $\tau_n$ and $\lambda(z_n)$ is the intensity of the process, which depends on the latent variable sample $z_n$. We train the model by optimizing the variational lower bound over the entire sequence comprised of $N$ steps:
\begin{align}
\mathcal{L}_{\theta,\phi}(x_{1:N}) = \sum_{n=1}^N(&{\mathop{\mathbb{E}}}_{q_\phi(z_{n}x_{1:n})}[\log p_\theta{(x_nz_{n})}] \\
& D_{KL}(q_\phi(z_nx_{1:n})p_\psi(z_nx_{1:n1})))
\nonumber
\label{eq:loss}
\end{align}
We empirically validate the efficacy of APPVAE for modeling action sequences on the MultiTHUMOS and Breakfast datasets. Experiments show that our model is effective in capturing the uncertainty inherent in tasks such as action prediction and anomaly detection.
Fig. 4 shows examples of diverse future action sequences that are generated by APPVAE given the history. For different provided histories, sampled sequences of actions are shown. We note that the overall duration and sequence of actions on the Breakfast Dataset are reasonable. Variations, e.g. taking the juice squeezer before using it, adding salt and pepper before cooking eggs, are plausible alternatives generated by our model.
Fig. 5 visualizes a traversal on one of the latent codes for three different sequences by uniformly sampling one z dimension over µ − 5σ, µ + 5σ while fixing others to their sampled values. As shown, this dimension correlates closely with the action add saltnpepper, strifry egg and fry egg.
We further qualitatively examine the ability of the model to score the likelihood of individual test samples. We sort the test action sequences according to the average per timestep likelihood estimated by drawing samples from the approximate posterior distribution following the importance sampling approach. High scoring sequences should be those that our model deems as “normal” while low scoring sequences those that are unusual. Tab. 1 shows some example of sequences with low and high likelihood on the MultiTHUMOS dataset. We note that a regular, structured sequence of actions such as jump, body roll, cliff diving for a diving action or body contract, squat, clean and jerk for a weightlifting action receives high likelihood. However, repeated hammer throws or golf swings with no set up actions receives a low likelihood.
Table 1 (below): Example of test sequences with high and low likelihood according to our learned model:
Test sequences with high likelihood 



Test sequences with low likelihood 



We presented a novel probabilistic model for point process data – a variational autoencoder that captures uncertainty in action times and category labels. As a generative model, it can produce action sequences by sampling from a prior distribution, the parameters of which are updated based on neural networks that control the distributions over the next action type and its temporal occurrence. Our model can be used to analyze and model asynchronous data in a wide variety of ranges like social networks, earthquakes events, health informatics, and the list goes on.
]]>In this work, we propose a local discriminative neural model with a much smaller negative sampling space that can efficiently learn against incorrect orderings. The proposed coherence model is simple in structure, yet it significantly outperforms previous stateoftheart methods on a standard benchmark dataset on the Wall Street Journal corpus, as well as in multiple new challenging settings of transfer to unseen categories of discourse on Wikipedia articles.
Code and dataset here.
]]>The answer boils down to trust.
Trust in a machine, or an algorithm, is difficult to quantify. It’s more than just performance — most people will not be convinced by being told research cars have driven X miles with Y crashes. You may care about when negative events happen. Were they all in snowy conditions? Did they occur at night? How robust is the system, overall?
In machine learning, we typically have a metric to optimize. This could mean we minimize the time to travel between points, maximize the accuracy of a classifier, or maximize the return on an investment. However, trust is much more subjective, domain dependent, and user dependent. We don’t know how to write down a formula for trust, much less how to optimize it.
This post argues that intelligibility is a key component to trust.^{1} The deep learning explosion has brought us many highperforming algorithms that can tackle complex tasks at superhuman levels (e.g., playing the games of Go and Dota 2, or optimizing data centers). However, a common complaint is that such methods are inscrutable “black boxes.”
If we cannot understand exactly how a trained algorithm works, it is difficult to judge its robustness. For example, one group of researchers trained a deep neural network to detect pneumonia from Xrays. The data was collected from both inpatient wards and an emergency department, which had very different rates of the disease. Upon analysis, the researchers realized that the Xray machines added different information to the Xrays — the network was focusing on the word “portable,” which was present only in the emergency department Xrays, rather than medical characteristics of the picture itself. This example highlights how understanding a model can identify problems that would potentially be hidden if one only focuses on the accuracy of the model.
Another reason to focus on intelligibility is in cases where we have properties we want to verify but cannot easily add to the loss function, i.e., the objective we wish to optimize. One may want to respect user preferences, avoid biases, and preserve privacy. An inscrutable algorithm may be difficult to verify, whereas an intelligible algorithm’s output would not be so. For example, the black box algorithm COMPAS is being used for assessing the risk of recidivism and has been accused of being racially biased by an influential ProPublica article. In Cynthia Rudin’s article "Please Stop Explaining Black Box Models for HighStakes Decisions", she argues that her model (CORELS) achieves the same accuracy as COMPAS, but is fully understandable, as it consists of only 3 if/then rules, and it does not take race (or variables correlated with race) into account.
if (age = 18 − 20) and (sex = male) then predict yes 
Rule list to predict 2year recidivism rate found by CORELS.
As mentioned above, intelligibility is contextdependent. But in general, we want to have some construct that can be understood when considering a user’s limited memory (i.e., people have the ability to hold 7 +/ 2 concepts in mind at once). There are three different ways we can think about intelligibility, which I enumerate next.
A local explanation of a model focuses on a particular region of operation. Continuing our autonomous car example, we could consider a local explanation to be one that explains how a car made a decision in one particular instance. The reasoning for a local explanation may or may not hold in other circumstances. A global explanation, in contrast, has to consider the entire model at once and thus is likely more complicated.
A more technically inclined user or model builder may have different requirements. First, they may think about the properties of the algorithm used. Is it guaranteed to converge? Will it find a nearoptimal solution? Do the hyperparameters of the algorithm make sense? Second, they may think about whether all the inputs (features) to the algorithm seem to be useful and are understandable. Third, is the algorithm “simulatable” (where a person can calculate the outputs from inputs) in a reasonable amount of time?
If a solution is intelligibile, a user should be able to generate explanations about how the algorithm works. For instance, there should be a story about how the algorithm gets to a given output or behavior from its inputs. If the algorithm makes a mistake, we should be able to understand what went wrong. Given a particular output, how would the input have to change in order to get a different output?
There are four highlevel ways of achieving intelligibility. First, the user can passively observe input/output sequences and formulate their own understanding of the algorithm. Second, a set of posthoc explanations can be provided to the user that aim to summarize how the system works. Third, the algorithm could be designed with fewer blackbox components so that explanations are easier to generate and/or are more accurate. Fourth, the model could be inherently understandable.
Observing the algorithm act may seem to be too simplistic. Where this becomes interesting is when you consider what input/output sequences should be shown to a user. The HIGHLIGHTS algorithm focuses on reinforcement learning settings and works to find interesting examples. For instance, the authors argue that in order to trust an autonomous car, one wouldn’t want to see lots of examples of driving on a highway in light traffic. Instead, it would be better to see a variety of informative examples, such as driving through an intersection, driving at night, driving in heavy traffic, etc. At the core of the HIGHLIGHTS method is the idea of state importance, or the difference between the value of the best and worst actions in the world at a given moment in time:
\begin{equation}
I(s) = \max_{a} Q^\pi_{(s,a)}  \min_a Q^\pi_{(s,a)}
\end{equation}
In particular, HIGHLIGHTS generates a summary of trajectories that capture important states an agent encountered. To test the quality of the generated examples, a user study was performed where people watched summaries of two agents playing Ms. Pacman and were asked to identify the better agent.
This animation shows the output of the HIGHLIGHTS algorithm in the Ms. Pacman domain https://goo.gl/79dqsd
Once the model learns to perform a task, a second model could be trained to then explain the task. The motivation is that maybe a simpler model can represent most of the true model, while being much more understandable. Explanations could be natural language, visualizations (e.g., saliency maps or tSNE), rules, or other humanunderstandable systems. The underlying assumption is that there’s a fidelity/complexity tradeoff: these explanations can help the user understand the model at some level, even if it is not completely faithful to the model.
For example, the LIME algorithm works on supervised learning methods, where it generates a more interpretable model that is locally faithful to a classifier. The optimization problem is set up so that it minimizes the difference between the interpretable model g from the actual function f in some locality πx, while also minimizing the measure of the complexity of the model g:
\begin{equation}
\xi(x) = \mbox{argmin}_{g \in G} ~~\mathcal{L} (f, g, \pi_x) + \Omega(g)
\end{equation}
The paper also introduces SPLIME, an algorithm to select a set of representative instances by exploiting the submodularity principle to greedily add nonoverlapping examples that cover the input space while giving examples of different, relevant, outputs.
A novel approach to automated rationale generation for reinforcement learning agents is presented by Ehsan et al. Many people are asked to play the game of Frogger. Then, while they’re playing the game, they provide explanations as to why they executed an action in a given state. This large corpus of states/actions/explanations is then fed into a model. The explanation model can then provide socalled rationales for actions from different states, even if the actual agent controlling the game’s avatar uses something like a neural network. The explanations may be plausible, but there’s no guarantee that they match the actual reasons the agent acted the way it did.
The Deep neural network Rule Extraction via Decision tree induction (DeepRED) algorithm is able to extract humanreadable rules that approximate the behavior of multilevel neural networks that perform multiclass classification. The algorithm takes a decompositional approach: starting with the output layer, each layer is explained by the previous layer, and then the rules (produced by the C4.5 algorithm) are merged to produce a rule set for the entire network. One potential drawback of the method is that it is not clear if the resulting rule sets are indeed interpretable, or if the number of terms needed in the rule to reach appropriate fidelity would overwhelm a user.
In order to make a better explanation, or summary of a model’s predictions, the learning algorithm could be modified. That is, rather than having bolton explainability, the underlying training algorithm could be enhanced so that it is easier to generate the posthoc explanation. For example, González et al. build upon the DeepRED algorithm by sparsifying the network and driving hidden units to either maximal or minimal activations. The first goal of the algorithm is to prune connections from the network, without reducing accuracy by much, with the expectation that “rules extracted from minimally connected neurons will be simpler and more accurate.” The second goal of the algorithm is to use a modified loss function that drives activations to maximal or minimal values, attempting to binarize the activation values, again without reducing accuracy by much. Experiments show that models generated in this manner are more compact, both in terms of the number of terms and the number of expressions.
A more radical solution is to focus on models that are easier to understand (e.g., “white box” models like rule lists). Here, the assumption is that there is not a strong performance/complexity tradeoff. Instead the goal is to do (nearly) as well as black box methods while maintaining interpretability. One example of this is the recent work by Okajima and Sadamasa on “Deep Neural Networks Constrained by Decision Rules,” which trains a deep neural network to select humanreadable decision rules. These ruleconstrained networks make decisions by selecting “a decision rule from a given decision rule set so that the observation satisfies the antecedent of the rule and the consequent gives a high probability to the correct class.” Therefore, every decision made is supported by a decision rule, by definition.
The Certifiably Optimal RulE ListS (CORELS) method, as mentioned above, is a way of producing an optimal rule list. In practice, the branch and bound method solves a difficult discrete optimization problem in reasonable amounts of time. One of the takehome arguments of the article is that simple rule lists can perform as well as complex blackbox models — if this is true, shouldn’t whitebox models be preferred?
One current research project we're working on at Borealis AI focuses on making deep reinforcement learning more intelligible. However, the current methods are opaque: it is difficult to explain to clients how the agent works, and it is difficult to be able to explain individual decisions. While we are still early in our research, we are investigating methods in all categories outlined above. The longterm goal of this research is to bring explainability into customerfacing models to ultimately help customers understand, and trust, our algorithms.
^{1}Note that we choose the word “intelligibility” with purpose. The questions discussed in this blog are related to explainability, interpretability, and “XAI,” and more broadly to safety and trust in artificial intelligence. However, we wish to emphasize that it is important for the system to be understood and that this may take some effort on the part of the subject understanding the system. Providing an explanation may or may not lead to this outcome — the example may be unhelpful, inaccurate, or even misleading with respect to the system’s true operation.
]]>By systematically controlling the frequency components of the perturbation, evaluating against the topplacing defense submissions in the NeurIPS 2017 competition, we empirically show that performance improvements in both the whitebox and blackbox transfer settings are yielded only when low frequency components are preserved. In fact, the defended models based on adversarial training are roughly as vulnerable to low frequency perturbations as undefended models, suggesting that the purported robustness of stateoftheart ImageNet defenses is reliant upon adversarial perturbations being high frequency in nature. We do find that under ℓ∞ ϵ=16/255, the competition distortion bound, low frequency perturbations are indeed perceptible. This questions the use of the ℓ∞norm, in particular, as a distortion metric, and, in turn, suggests that explicitly considering the frequency space is promising for learning robust models which better align with human perception.
]]>How do you surpass a bar that’s already high? One way is to build something in the mountains.
With a peek of the Rockies from the window, our Vancouver research centre officially opens its doors today. RBC President and CEO, Dave McKay, will formally inaugurate the space and later join an informal panel moderated by John Stackhouse with Foteini Agrafioti and Vancouver research director, Greg Mori. It's the final ribbon cutting in a jampacked season that also saw the completion of our new Waterloo and Edmonton locations.
Partnering once again with design firm Lemay, our vision was for each office to have its own identity and proudly represent the city it inhabits. This allowed Lemay full creative freedom to come up with interesting concepts that you wouldn’t normally find in a research centre.
Take a look at what we cooked up in the lab:
The main inspiration behind this research centre came from the institution that helped put the KitchenerWaterloo corridor on the map: The University of Waterloo. With this theme in mind, we designed rooms to pay homage to campus life. There’s a student lounge, teacher’s lounge, science lab and track and field pitch. And what’s campus life without movies about campus life? When it came down to the decorative details, we honoured retro classics like Grease, The Breakfast Club and Teachers while acknowledging modern masterpieces like The Big Bang Theory.
While our Edmonton team has been established for two years, we wanted to make sure they had a space that reinforced the excellence of their research. In early February, the team moved into a place that drew inspiration from the city's great winter escape: West Edmonton Mall. The mall is legendary not just for its size, but for providing a scope of attractions that make it plausible to spend three entire months indoors. And when “fun” is your theme, the design possibilities create themselves. We set up a bowling alley meeting room (with pins hanging from the ceiling), a pirate ship kitchen (galley? kitchen?), minigolf, and a water park in the living room. The pièce de résistance, and feature that makes every other nonEdmontonbased researcher jealous, is the sprawling outdoor terrace where the team can enjoy the Aurora Borealis on nights that don’t freeze water on impact.
Vancouverites are known for being a laidback crew, so we tried a more mellow approach to our interior design. After all, it’s impossible to improve upon the natural beauty of the West Coast landscape. Our Vancouver space was designed to evoke scenes of cozy ski lodges, wildlife and mountain climbs. There’s even a room modeled after the city’s favourite mode of transportation – the bicycle –which is the embodiment of high and lowtech in perfect tandem.
These spaces are even better in person. Be sure to check our Careers page regularly for job openings, or our Fellowships section for application opportunities.
]]>When the world lined up to start the AI race, Canada was out of the gate like Andre de Grasse, taking the lead on research and development.
Now, can we lead on tackling its ethical and societal implications?
The news is flooded with examples of AI fails: algorithms that favour male job applicants over women, or image recognition software failing to correctly identify people of colour.
Dr. Foteini Agrafioti, the Head of Borealis AI and one of the country’s strongest voices on ensuring AI is ethical, was announced last week as the cochair of Canada’s new Advisory Council on AI. She led the latest RBC Disruptors conversation about battling bias in AI with Dr. Elissa Strome, Executive Director, PanCanadian AI Strategy at CIFAR, and Dr. Layla El Asri, Research Manager, Microsoft Research Montréal.
Here are their thoughts on what the scientific community, governments and ordinary citizens can do to confront bias in artificial intelligence, and position Canada as a leader in ethical AI.
Bias has long existed in our society – and so it exists in our data. El Asri sees this as an opportunity. Unlike our own unconscious bias, we can at least uncover bias in an algorithm. To do this, companies need to be auditing their AI for bias every step of the way, as the major labs are now doing. El Asri credited Canadian leaders, such as AI pioneer Yoshua Bengio, for developing a will in Canada’s tech community to develop AI in a responsible way.
Right now, artificial intelligence is being developed by a very narrow subset of society: mainly highlyeducated men who went to the same schools, and now live in the same cities. Only 18% of AI researchers are women, a fact that Strome called “terrible.” Organizations like CIFAR are working to bring more voices into the development of AI, with initiatives such as the AI for Good Summer Lab, a sevenweek training program for undergraduate women in AI.
AI is only as good as the data it’s trained on. “If your data is not representative enough, your model is not going to work,” El Asri said. There needs to be more vigilance in ensuring data is representative — an area where Canada has a homegrown advantage. If you’re working with data collected in a multicultural country like ours, you’re likely working with data that represents different ethnic backgrounds. This kind of data will be essential to building technology that works for everyone, especially when it comes to something like health care.
Right now, it’s really just the tech community and policymakers talking about issues that are going to transform our society. We need to broaden that perspective, building in consultation with social scientists as an integral part of the development process. A recent CIFAR initiative brought together computer scientists and social scientists for a day to discuss the social, legal and ethical implications of AI. “The computer scientists were so eager to get their advice and insights,” Strome said. Similarly, at Microsoft, El Asri noted that their AI and ethics committees are made up of people from different disciplines, including anthropologists and historians.
“There’s a lot of fear and misunderstanding and myths about AI,” Strome said. Over the next few years, it’s going to be critical to bring the public into the AI conversation. People need to be aware of the positive implications, as well as the risks, that AI will have on their lives. The better the next generation understands AI and its societal and ethical implications, the better prepared they’ll be to ask tough questions of their leaders. Agrafioti suggested that Canadian culture is particularly attuned to ensuring fairness, casting a critical eye on technology before implementing it. Our balance of technical expertise and social values is exactly what’s needed to make sure the product that gets to market is ethical.
AI has been advancing much faster than any government can regulate it — so it was big news this week when the OECD adopted a set of AI principles, which set valuesbased standards for developing AI. Our leaders have an incredibly important role to play in developing policy and regulations around the use of AI, both domestically and internationally. Strome noted that Canada’s solid international reputation could go a long way in urging the world to play catchup. Last summer, Prime Minister Trudeau and President Macron announced a joint CanadaFrance initiative on an International Panel on AI to support and guide the responsible adoption of AI, grounded in human rights. The first symposium will be in Paris this fall.
Solving bias in machines will take a human touch — and there’s no country better positioned than Canada to take the reins.
]]>Our team is composed of 2 neural networks trained with state of the art deep reinforcement learning algorithms and makes use of concepts like reward shaping, curriculum learning, and an automatic reasoning module for action pruning. Here, we describe these elements and, additionally, we present a collection of opensourced agents that can be used for training and testing in the Pommerman environment. Code available here.
]]>We discover this sensitivity by analyzing the Bayes classifier's clean accuracy and robust accuracy. Extensive empirical investigation confirms our finding. Numerous neural nets trained on MNIST and CIFAR10 variants achieve comparable clean accuracies, but they exhibit very different robustness when adversarially trained. This counterintuitive phenomenon suggests that input data distribution alone can affect the adversarial robustness of trained neural networks, not necessarily the tasks themselves. Lastly, we discuss practical implications on evaluating adversarial robustness, and make initial attempts to understand this complex phenomenon.
]]>Machine learning models have demonstrated a vulnerability to adversarial perturbations. Adversarial perturbations are minor modifications to the input data that can cause machine learning models to output completely different predictions but that are not perceptible to the human eye.
Here’s an example (with the jupyter notebook) of what that looks like: In the figure below, after a very small perturbation (middle image) is added to the panda image (right), the neural network model returns recognizes the perturbed image (left) as a bucket. However, to a human observer, the perturbed panda looks exactly the same as the original panda.
Adversarial perturbation poses certain risks to machine learning applications that can have potential realworld impact. For example, researchers have shown that by putting black and white patches on a stop sign, stateoftheart object detection systems cannot recognize the stop sign correctly anymore.
This problem is not only restricted to images: speech recognition systems and malware detection systems have all been shown to be vulnerable to similar attacks. More broadly, realistic adversarial attacks could happen on machine learning systems whenever it might be profitable for adversaries, for instance, on fraudulent detection systems, identity recognition systems, and decision making systems.
AdverTorch (repo, report) is a tool we built at the Borealis AI research lab that implements a series of attackanddefense strategies. The idea behind it emerged back in 2017, when my team began to do some focused research around adversarial robustness. At the time, we only had two tools at our disposal: CleverHans and Foolbox.
While these are both good tools, they had their respective limitations. Back then, CleverHans was only set up for TensorFlow, which limits its usage in other deep learning frameworks (in our case, PyTorch). Moreover, the static computational graph nature of TensorFlow makes the implementation of attacks less straight forward. For anyone new to this type of research, it can be hard to understand what’s going on if the attack is written in static graph language.
Foolbox, on the other hand, contains various types of attack methods but it only supports running attacks imagebyimage, and not batchbybatch. These parameters make it slow to run and thus only suitable for running evaluations. At the time, Foolbox also lacked variety in the number of attacks, e.g. the projected gradient descent attack (PGD) and the CarliniWagner $\ell_2$norm constrained attack.
In the absence of a toolbox that would serve more of our needs, we decide to implement our own. Creating a proprietary tool would also allow us to use our favorite language – PyTorch – which was not an option with the others.
Our aim was to provide researchers the tools for conducting research in different research directions for adversarial robustness. For now, we’ve built AdverTorch primarily for researchers and practitioners who have some algorithmic understanding of the methods.
We had the following design goal in mind:
Resources permitted, we are also working to make it more user friendly in the future.
For gradientbased attacks, we have the fast gradient (sign) methods (Goodfellow et al., 2014), projected gradient descent methods (Madry et al., 2017), CarliniWagner Attack (Carlini and Wagner, 2017), spatial transformation attack (Xiao et al., 2018) and more. We also implemented a few gradientfree attacks including single pixel attack, local search attack (Narodytska and Kasiviswanathan, 2016), and the Jacobian saliency map attack (Papernot et al., 2016).
Besides specific attacks, we also implemented a convenient wrapper for the Backward Pass Differentiable Approximation (Athalye et al., 2018), which is an attack technique that enhances gradientbased attacks when attacking defended models that have nondifferentiable or gradientobfuscating components.
In terms of defenses, we considered two strategies: i) preprocessingbased defenses and ii) robust training. For preprocessing based defenses, we implement the JPEG filter, bit squeezing, and different kinds of spatial smoothing filters.
For robust training methods, we implemented them as examples in our repo. So far, we have a script for adversarial training on MNIST which you can access here and we plan to add more examples with different methods on various datasets.
We used the fast gradient sign attack as an example of how to create an attack in AdverTorch. The GradientSignAttack
can be found at advertorch.attacks.one_step_gradient.
To create an attack on classifier, we’ll need the Attack
and LabelMixin
from advertorch.attacks.base.
from advertorch.attacks.base import Attack
from advertorch.attacks.base import LabelMixin
Attack
is the base class of all attacks in AdverTorch. It defines the API of an Attack. The core of it looks like this:
def __init__(self, predict, loss_fn, clip_min, clip_max):
self.predict = predict
self.loss_fn = loss_fn
self.clip_min = clip_min
self.clip_max = clip_max
def perturb(self, x, **kwargs):
error = "Subclasses must implement perturb."
raise NotImplementedError(error)
def __call__(self, *args, **kwargs):
return self.perturb(*args, **kwargs)
An attack contains three core components:
predict
: the function we want to attack;loss_fn
: the loss function we maximize in order to during attack; perturb
: the method that implements the attack algorithm.Let’s illustrate these components with GradientSignAttack
as an example.
class GradientSignAttack(Attack, LabelMixin):
"""
One step fast gradient sign method (Goodfellow et al, 2014).
Paper: https://arxiv.org/abs/1412.6572
:param predict: forward pass function.
:param loss_fn: loss function.
:param eps: attack step size.
:param clip_min: minimum value per input dimension.
:param clip_max: maximum value per input dimension.
:param targeted: indicate if this is a targeted attack.
"""
def __init__(self, predict, loss_fn=None, eps=0.3, clip_min=0.,
clip_max=1., targeted=False):
"""
Create an instance of the GradientSignAttack.
"""
super(GradientSignAttack, self).__init__(
predict, loss_fn, clip_min, clip_max)
self.eps = eps
self.targeted = targeted
if self.loss_fn is None:
self.loss_fn = nn.CrossEntropyLoss(reduction="sum")
def perturb(self, x, y=None):
"""
Given examples (x, y), returns their adversarial counterparts with
an attack length of eps.
:param x: input tensor.
:param y: label tensor.
 if None and self.targeted=False, compute y as predicted
labels.
 if self.targeted=True, then y must be the targeted labels.
:return: tensor containing perturbed inputs.
"""
x, y = self._verify_and_process_inputs(x, y)
xadv = x.requires_grad_()
###############################
# start: the attack algorithm #
outputs = self.predict(xadv)
loss = self.loss_fn(outputs, y)
if self.targeted:
loss = loss
loss.backward()
grad_sign = xadv.grad.detach().sign()
xadv = xadv + self.eps * grad_sign
xadv = clamp(xadv, self.clip_min, self.clip_max)
# end: the attack algorithm #
###############################
return xadv
predict
is the classifier, while loss_fn
is the loss function for gradient calculation. The perturb
method takes x
and y
as its arguments, where x
is the input to be attacked, y
is the true label of x. predict(x)and
contains the “logits” of the neural work. The loss_fn
could be the crossentropy loss function or another suitable loss function who takes predict(x)
and y
as its arguments.
Thanks to the dynamic computation graph nature of PyTorch, the actual attack algorithm can be implemented in a straightforward way with a few lines. For other types of attacks, we just need replace the algorithm part of the code in perturb
and change what parameters to pass to __init__
.
Note that the decoupling of these three core components is flexible enough to allow more versatile attacks. In general, we require the predict
and loss_fn
to be designed in such a way that loss_fn
always takes predict(x)
and y
as its inputs. As such, no knowledge about predict
and loss_fn
is required by the perturb
method. For example, FastFeatureAttack and PGDAttack share the same underlying perturb_iterative
function, but differ in the predict
and loss_fn
. In FastFeatureAttack
, the predict(x)
outputs the feature representation from a specific layer, the y is the guide feature representation that we want predict(x)
to match, and the loss_fn becomes the mean squared error.
More generally, y
can be any target of the adversarial perturbation, while predict(x)
can output more complex data structures as long as the loss_fn
can take them as its inputs. For example, we might want to generate one perturbation that fools both model A’s classification result and model B’s feature representation at the same time. In this case, we just need to make y and predict(x)
to be tuples of labels and features, and modify the loss_fn
accordingly. There is no need to modify the original perturbation implementation.
As mentioned above, AdverTorch provides modules for preprocessingbased defense and examples for robust training.
We use MedianSmoothing2D as an example to illustrate how to define a preprocessingbased defense.
class MedianSmoothing2D(Processor):
"""
Median Smoothing 2D.
:param kernel_size: aperture linear size; must be odd and greater than 1.
:param stride: stride of the convolution.
"""
def __init__(self, kernel_size=3, stride=1):
super(MedianSmoothing2D, self).__init__()
self.kernel_size = kernel_size
self.stride = stride
padding = int(kernel_size) // 2
if _is_even(kernel_size):
# both ways of padding should be fine here
# self.padding = (padding, 0, padding, 0)
self.padding = (0, padding, 0, padding)
else:
self.padding = _quadruple(padding)
def forward(self, x):
x = F.pad(x, pad=self.padding, mode="reflect")
x = x.unfold(2, self.kernel_size, self.stride)
x = x.unfold(3, self.kernel_size, self.stride)
x = x.contiguous().view(x.shape[:4] + (1, )).median(dim=1)[0]
return x
The preprocessor is simply a torch.nn.Module.
Its __init__
function takes the necessary parameters, and the forward
function implements the actual preprocessing algorithm. When using MedianSmoothing2D
, it can be composed with the original model to become a new model:
median_filter = MedianSmoothing2D()
new_model = torch.nn.Sequential(median_filter, model)
y = new_model(x)
or to be called sequentially
processed_x = median_filter(x)
y = model(processed_x)
We provide an example of how to use AdverTorch to do adversarial training (Madry et al. 2018) in tutorial_train_mnist.py
. Compared to regular training, we only need to two changes. The first is to initialize an adversary before training starts.
if flag_advtrain:
from advertorch.attacks import LinfPGDAttack
adversary = LinfPGDAttack(
model, loss_fn=nn.CrossEntropyLoss(reduction="sum"), eps=0.3,
nb_iter=40, eps_iter=0.01, rand_init=True, clip_min=0.0,
clip_max=1.0, targeted=False)
The second is to generate the “adversarial minibatch” during training, and use it to train the model instead of the original minibatch.
if flag_advtrain:
advdata = adversary.perturb(clndata, target)
with torch.no_grad():
output = model(advdata)
test_advloss += F.cross_entropy(
output, target, reduction='sum').item()
pred = output.max(1, keepdim=True)[1]
advcorrect += pred.eq(target.view_as(pred)).sum().item()
Since building the toolkit, we’ve already used it for two papers: i) On the Sensitivity of Adversarial Robustness to Input Data Distributions; and ii) MMA Training: Direct Input Space Margin Maximization through Adversarial Training. It’s our sincere hope that AdverTorch helps you in your research and that you find its components useful. Of course, we welcome any contributions for the community and would love to hear your feedback. You can open an issue or a pull request for AdverTorch, or email me at gavin.ding@borealisai.com.
]]>While deep learning applications have successfully integrated into multiple product categories, RL has had a slower initiation. The recent momentum behind RLbased commercialization has been propelled by research advancements that have naturally lent themselves to product ideas in specific sectors, like financial markets, health care and marketing. Once competition ignites, this early trickle is predicted to burst into a gushing pipeline.
But RL algorithms are not your standard, runofthemill solutions and it’s unwise to treat them as such. Most pressingly, they’re continual learning algorithms, which means the type of data they require, combined with their potential industry disruption, demands that privacy techniques catch up with the privacy challenges these algorithms pose. One such technique is differential privacy.
The notion of privacy is inuitively difficult to translate into a technical definition. One standard definition around which the academic community has coalesced is through a framework called differential privacy. Differential privacy centers around the notion of whether an individual’s participation in a dataset is discernable. So, an algorithm that acts on a dataset is called differentially private if an individual’s presence or removal from that dataset makes minimal impact on the algorithm’s output. Differential privacy would then be achieved when perturbation – or “noise” – is added during the algorithmic training process. The level and location of the noise is finely calibrated according the degree of privacy and accuracy desired and the properties of the dataset and algorithm.
Standard differential privacy techniques work on fixed data sets that are already known to researchers. This prior knowledge allows researchers to decide how much noise they want to add to the data set in order to protect the individual’s privacy. A standard example of how this works is by compiling some aggregate statistics on how many people have done ‘x’ activity, then setting the parameters so that we end up with the same statistical result whether we keep or remove any individual from within the data set. But what happens when the data set is from a continuous state space, dynamic, constantly changing and we are continuously learning? For that, we need a new approach.
In our paper, Private QLearning with Functional Noise in Continuous Spaces, we focus on finding new avenues to address this complexity. We do this by approaching general concepts of differential privacy, then abstracting them and applying them to a different space. So, instead of adding scalar noise to a vector, we focus on protecting the reward function, adding perturbations as it gets updated by the algorithm.
This step is important because a reward function reveals the value of actions and, therefore, the latent preferences of users. So, for example, when you click the thumbsup button on a social media app, this action gets codified as a “reward” that informs the “policy” for what the algorithm should do next time it identifies a similar user in a similar state. Our approach protects the “why” – the motivation or intent – of the individual’s decision. It blocks individual preferences from being identified while still allowing for the abstraction of the policy. This protects the motivation for the reward instead of the outcome. We want to protect the fact that the system has learned about your diehard fandom for indie music, while enabling the algorithm to build intelligence so it can personalize recommendations to different users.
We applied privacy to a setting that can be generalized to a variety of learning tasks – the Qlearning framework of reinforcement learning – where the objective was to maximize the actionvalue function. We used function approximators (i.e. a neural network) parameterized by θ to learn the optimal actionvalue function. In particular, we considered the continuous state space setting, where the actionstate value Q(s, a) was assumed to a set of m functions defined on the interval [0, 1] and, similarly, the reward was a set of m functions each defined on the interval [0, 1].
Standard perturbation methods for ML models achieve DP by adding noise to vectors – the input to the algorithm, the output of the algorithm, or to gradient vectors within the ML model. In our case, we aimed to protect the reward function, which can depend on highdimensional context. Using standard methods to add perturbation would mean that the amount of noise to be added would grow quickly to infinity if the continuous state space were to be discretized. Since we wanted to perturb the actionvalue functions, we added functional noise, rather than vectorvalued noise as in standard methods. This functional noise was a sample path of an appropriately parametrized Gaussian process and was added to the function released by our Qlearning algorithm. As in the standard methods, the noise was parametrized by the sensitivity of the query, which, for vectorvalued noise was the l2 norm of the difference in output with two datasets that differ in one individual’s value. Here, since we considered reward functions which change in value according to the state of the environment (that has randomness), we used the notion of Mahalanobis distance for sensitivity, which captured the idea of a distance from one point to a set of sampled points.
Say Patient Zero is exhibiting medical symptoms and goes to see the doctor. The doctor gives Patient Zero Drug A (first state). Drug A doesn’t alleviate the symptoms, so now the doctor tries Drug B (second state). Patient Zero then moves into third state, and so on, until the problem (illness) gets solved. Here, the agent is programmed to be able to take a limited number of actions, then the system observes the state of the agent (symptoms relieved? Not relieved?), and based on the observations of that state, the agent must make a decision about what to do. The algorithm observes the outcome and, depending on the results, the agent gets a reward or punishment. The quality of the reward will depend on the longterm goals it’s trying to achieve. Are you getting closer to the goal (symptom alleviation) or moving away from it (even sicker than before the drugs)?
The privacy measures applied to RL in the past have mostly centered around protecting an individual’s movement (or itinerary) within a particular state. The policy, then, would be defined around why a user took a specific action in this state. This approach works well for the above scenario, where we’re protecting the user’s movement from statetostate but not protecting a policy that can be extrapolated to many other users. It falls short, however, when applied to areas like marketing, with far more dynamic data sets and continual learning.
Differential privacy in deep RL is a more general and scalable technique, as it protects a higherlevel model that captures behaviors rather than just limiting itself to a particular data point. This approach is important for the future as we move to continuous, online learning systems: by blocking individual preferences from being identified while allowing for the policy to be abstracted, we protect the motivation for the reward instead of the outcome. These kinds of safety guarantees are vital in order to make RL practical.
]]>