View the code here.

]]>Reinforcement learning (RL) can now produce super-human performance on a variety of tasks, including board games such as chess and go, video games, and multi-player games such as poker. However, current algorithms require enormous quantities of data to learn these tasks. For example, OpenAI Five generates 180 years of gameplay data per day, and AlphaStar used 200 years of Starcraft II gameplay data.

It follows that one of the biggest challenges for RL is *sample efficiency*. In many realistic scenarios, the reward signals are sparse, delayed or noisy which makes learning particularly inefficient; most of the collected experiences do not produce a learning signal. This problem is exacerbated because RL simultaneously learns both the policy (i.e., to make decisions) and the representation on which these decisions are based. Until the representation is reasonable, the system will be unable to develop a sensible policy.

This article focuses on the use of *auxiliary tasks* to improve the speed of learning. These are additional tasks that are learned simultaneously with the main RL goal and that generate a more consistent learning signal. The system uses these signals to learn a shared representation and hence speed up the progress on the main RL task.

An auxiliary task is an additional cost-function that an RL agent can predict and observe from the environment in a self-supervised fashion. This means that losses are defined via surrogate annotations that are synthesized from unlabeled inputs, even in the absence of a strong reward signal.

Auxiliary tasks usually consist of estimating quantities that are relevant to solving the main RL problem. For example, we might estimate depth in a navigation task. However, in other cases, they are be more general. For example, we might try to predict how close the agent is to a terminal state. Accordingly, they may take the form of classification and regression algorithms or alternatively may maximize reinforcement learning objectives.

We note that auxiliary tasks are different from model-based RL. Here, a model of how the environment transitions between states given the actions is used to support planning (Oh et al. 2015; Leibfried et al. 2016) and hence ultimately to directly improve the main RL objective. In contrast, auxiliary tasks do not directly improve the main RL objective, but are used to facilitate the representation learning process (Bellemare et al. 2019) and improve learning stability (Jaderberg et al. 2017).

Auxiliary tasks were originally developed for neural networks and referred to as *hints*. Suddarth & Kergosien (1990) argued that for the hint to be effective, it needs to "special relationship with the original input-output being learned." They demonstrated that adding auxiliary tasks to a minimal neural network effectively removed local minima.

The idea of adding supplementary cost functions was first used in reinforcement learning by Sutton et al. (2011) in the form of *general value functions* (GVFs). As the name suggests, GVFs are similar to the well-known value functions of reinforcement learning. However, instead of caring about environmental rewards, they consider other signals. They differ from auxiliary tasks in that they usually predict long term features. Hence, they employ summation across multiple time-steps similar to the state-value computation from rewards in standard RL.

Auxiliary tasks are naturally and succinctly implemented by splitting the last layer of the network into multiple parts (heads), each working on a different task. The multiple heads propagate errors back to the shared part of the network, which forms the representations that support all the tasks (Sutton & Barto 2018).

To see how this works in practice, we'll consider Asynchronous Advantage Actor-Critic (A3C) (Mnih et al. 2016), which is a popular and representative actor-critic algorithm. The loss function of A3C is composed of two terms: the policy loss (actor), $\mathcal{L}_{\boldsymbol\pi}$, and the value loss (critic), $\mathcal{L}_{v}$. An entropy loss $H[\boldsymbol\pi]$ for the policy $\boldsymbol\pi$, is also commonly added. This helps discouraging premature convergence to sub-optimal deterministic policies (Mnih et al. 2016). The complete loss function is given by:

\begin{equation}

\mathcal{L}_{\text{A3C}} = \lambda_v \mathcal{L}_{v} + \lambda_{\pi} \mathcal{L}_{\pi} - \lambda_{H} \mathbb{E}_{s \sim \pi} \left[H[\boldsymbol\pi[s]]\right]\tag{1}

\end{equation}

where $s$ is the state and scalars $\lambda_{v},\lambda_{\boldsymbol\pi}$, and $\lambda_{H}$ weight the component losses.

Auxiliary tasks are introduced to A3C via the Unsupervised Reinforcement and Auxiliary Learning (UNREAL) framework (Jaderberg et al. 2017). UNREAL optimizes the loss function:

\begin{equation}

\mathcal{L}_{\text{UNREAL}} = \mathcal{L}_{\text{A3C}} + \sum_i \lambda_{AT_i} \mathcal{L}_{AT_i}\tag{2}

\end{equation}

that combines the A3C loss, $\mathcal{L}_{\text{A3C}}$, together with auxiliary task losses $\mathcal{L}_{AT_i}$, where $\lambda_{AT_i}$ are weight terms (Figure 1). For a single auxiliary task, the loss computation code might look like:

` def loss_func_a3c(s, a, v_t):
# Compute logits from policy head and values from the value head
self.train()
logits, values = self.forward(s)
# Critic loss computation
td = 0.5*(v_t - values)
c_loss = td.pow(2)
# Actor loss
probs = F.softmax(logits, dim=1)
m = self.distribution(probs)
exp_v = m.log_prob(a) * td.detach()
a_loss = -exp_v
# Entropy loss
ent_loss = -(F.log_softmax(logits, dim=1) * F.softmax(logits, dim=1)).sum(1)
# Computing total loss
total_loss = (CRITIC_LOSS_W * c_loss + ACTOR_LOSS_W * a_loss - ENT_LOSS_W * ent_loss).mean()
# Auxiliary task loss
aux_task_loss = aux_task_computation()
#Computing total loss
total_loss = (AUX_TASK_WEIGHT_LOSS * aux_task_loss + a3c_loss).mean()`

The use of auxiliary tasks is not limited to actor-critic algorithms; they have also been implemented on top of Q-learning algorithms such as DRQN (Hausknecht & Stone 2015). For example, Lample & Chaplot (2017) extended the DRQN architecture with another head used to predict game features. In this case, the loss is the standard DRQN loss and the cross-entropy loss of the auxiliary task.

We now consider five different auxiliary tasks that have obtained good results in various RL domains. We provide insights as to the applicability of these tasks.

Sutton et al. (2011) speculated:

"

Suppose we are playing a game for which base terminal rewards are +1 for winning and -1 for losing. In addition to this, we might pose an independent question about how many more moves the game will last. This could be posed as a general value function."

The first part of the quote refers to the standard RL problem where we learn to maximize rewards (winning the game). The second part describes an auxiliary task in which we predict how many moves remain before termination.

Kartal et al. (2019) investigated this idea of *terminal prediction*. The agent predicts how close it is to a terminal state while learning the standard policy, with the goal of facilitating representation learning. Kartal et al. (2019) added this auxiliary task to A3C and named this A3C-TP. The architecture was identical to A3C, except for the addition of the terminal state prediction head.

The loss $\mathcal{L}_{TP}$ for the terminal state prediction is the mean squared error between the estimated closeness $\hat{y}$ to a terminal state of any given state and target values $y$ approximately computed from completed episodes:

\begin{equation}

\mathcal{L}_{TP}= \frac{1}{N} \sum_{i=0}^{N}(y_{i} - \hat{y}_{i})^2\tag{3}

\end{equation}

where $N$ represents the episode length during training. The target for the $i^{th}$ state is approximated with $y_{i} = i/N $ implying $y_{N}=1$ for the actual terminal state and $y_{0}=0$ for the initial state for each episode.

Kartal et al. (2019) initially used the actual current episode length for $N$ to compute the targets $y_{i}$. However, this delays access to the labels until the episode is over and did not provide significant benefit in practice. As an alternative, they approximate the current episode length by the *running average* of episode lengths computed from the most recent $100$ episodes, which provides a dense signal.* ^{1}* This improves learning performance, and is memory efficient for distributed on-policy deep RL as CPU workers do not have to retain the computation graph until episode termination to compute terminal prediction loss.

Since terminal prediction targets are computed in a self-supervised fashion, they have the advantage that they are independent of reward sparsity or any other domain dynamics that might render representation learning challenging (such as drastic changes in domain visuals, which happen in some Atari games). However, terminal prediction is applicable only for episodic environments.

*Agent modeling* (Hernandez-Leal et al. 2019) is an auxiliary task that is designed to work in a multi-agent setting. It takes ideas from game theory and in particular from the concept of *best response*: the strategy that produces the most favorable outcome for a player, taking other players' strategies as given.

The goal of agent modeling is to learn other agents' policies while itself learning a policy. For example, consider a game in which you face an opponent. Here, learning the opponent's behavior is useful to develop a strategy against it. However, agent modeling is not limited to only opponents; it can also model teammates, and can be applied to an arbitrary number of them.

There are two main approaches to implementing the agent modeling task. The first uses the conventional approach of adding new heads for the auxiliary task to a shared network base, as discussed in previous sections. The second uses a more sophisticated architecture in which latent features from the auxiliary network are used as inputs to the main value/policy prediction stream. We consider each in turn.

In this scheme, agents share the same network base, but the outputs represent different agent actions (Foerster et al. 2017). The goal is to predict opponent policies as well as the standard actor and critic, with the key characteristic that the previous layers share parameters (Figure 2a).

This architecture builds on the concept of *parameter sharing* where the idea is to perform centralized learning:

The AMS architecture uses the loss function:

\begin{equation}

\mathcal{L}_{\text{AMS}}= \mathcal{L}_{\text{A3C}} + \frac{1}{\mathcal{N}} \sum_i^{\mathcal{N}} \lambda_{AM_i} \mathcal{L}_{AM_i}\tag{4}

\end{equation}

where $\lambda_{AM_i}$ is a weight term and $\mathcal{L}_{AM_i}$ is the auxiliary loss for opponent $i$:

\begin{equation}

\mathcal{L}_{AM_i}= -\frac{1}{M} \sum_j^M \sum_{k}^{K} a^j_{ik} \log [\hat{a}^j_{ik}\tag{5}

\end{equation}

which is the cross entropy loss between the observed one-hot encoded opponent action, $\mathbf{a}^j_{i}$, and the prediction over opponent actions, $\hat{\mathbf{a}}^j_{i}$. Here $i$ indexes the opponents, $j$ indexes time for a trajectory of length $M$, and $k$ indexes the $K$ possible actions.

*Policy features* (Hong et al. 2018) are intermediate *features* from the latent space that is used to predict the opponent policy. The AMF framework exploits these features to improve the main reward prediction.

In this architecture, convolutional layers are shared, but the fully connected layers are divided in two sections (Figure 2b). The first is specialized for the actor and critic of the learning agent and the second for the opponent policies. The intermediate opponent policy features, $\mathbf{h}_{i}$ from the second path are used to condition (via an element-wise multiplication) the computation of the actor and critic. The loss function is similar to that for AMS.

Note that both AMS and AMF need to observe the opponent's actions to generate ground truth and for the auxiliary loss function. This is a limitation, and further research is required to handle partially observable environments.

In the previous two sections, we considered auxiliary tasks that related to the structure of the learning (terminal prediction) and to other agents' actions (agent modeling). In this section, we consider predicting the reward received at the next time-step — an idea that seems quite natural in the context of RL. More precisely, given state sequence $\{\mathbf{s}_{t-3}, \mathbf{s}_{t-2}, \mathbf{s}_{t-1}\}$, we aim to predict the reward $r_t$. Note that is similar to value learning with $\gamma=0$, so that the agent only cares about the immediate reward.

Jaderberg et al. (2017) formulated this task as multi-class classification with three classes: positive reward, negative reward, or zero. To mitigate data imbalance problems, the same number of samples with zero and non-zero rewards were provided during training.

In general, data imbalance is a disadvantage of reward prediction. This is particularly troublesome for hard-exploration problems with sparse rewards. For example, in the Pommerman game, an episode can last up to 800 timesteps, and the only non-zero reward is obtained at episode termination. Here, class-balancing would require many episodes, and this is in contradiction with the stated goal of using auxiliary tasks (i.e., to speed up learning).

Mirowski et al. (2016) studied auxiliary tasks in a navigation problem in which the agent needs to reach a goal in first-person 3D mazes from a random starting location. If the goal is reached, the agent is re-spawned to a new start location and must return to the goal. The 8 discrete actions permitted rotation and acceleration.

The agent sees RGB images as input. However, the authors speculated that depth information might supply valuable information about how to navigate the 3D environment. Thus, one of the auxiliary tasks is to predict depth, which can be cast as a regression or as a classification problem.

Mirowski et al. (2016) performed different experiments and we highlight two of these. In the first, they considered using the auxiliary task directly as input to the network, instead of just using it for computing the loss. In the second, they consider where to add the auxiliary task within the network. For example, the auxiliary task module can be set just after the convolutional layers, or after the convolutional and recurrent layers (Figure 3).

The results showed that:

- Using depth as input to the CNN (not shown in above Figure) resulted in worse performance than when predicting the depth.
- Treating depth estimation as classification (discretizing over 8 regions) outperformed casting it as regression.
- Placing the auxiliary task after the convolutional and recurrent networks obtained better results than moving it before the recurrent layers.

The auxiliary tasks discussed so far have involved estimating various quantities. A *control task* actually tries to manipulate the environment in some way. Jaderberg et al. (2017) proposed *pixel control* auxiliary tasks. An auxiliary control task $c$ is defined by a reward function $r^{c}: \mathcal{S} \times \mathcal{A} \rightarrow \mathbb{R}$ where $\mathcal{S}$ is the space of possible states and $\mathcal{A}$ is the space of available actions. Given a base policy $\pi$, and auxiliary control task with policy $\pi^c$, the learning objective becomes:

\begin{equation}

\mathcal{L}_{pc}= \mathbb{E}_{\pi}[R] + \lambda_{c} \mathbb{E}_{\pi^c}[R^{c}],\tag{6}

\end{equation}

where $R^c$ is the return obtained from the auxiliary control task and $\lambda_{c}$ is a weight. As for previous auxiliary tasks, some parameters are shared with the main task.

The system used off-policy learning; the data was replayed from an experience buffer and the system was optimized using a n-step Q-learning loss:

\begin{equation}

\mathcal{L}^c=(R_{t:t+n} + \gamma^n \max_{a'}\left[ Q^c(s',a',\boldsymbol\theta^{-})- Q^c(s,a,\boldsymbol\theta))^2\right]\tag{7}

\end{equation}

where $\boldsymbol\theta$ and $\boldsymbol\theta^{-}$ refers are the current and previous parameters, respectively.

Jaderberg et al. (2017) noted that changes in the perceptual stream are important since they generally correspond to important events in an environment. Thus, the agent learns a policy for maximally changing the pixels in each cell of an $n \times n$ non-overlapping grid superimposed on the input image. The immediate reward in each cell was defined as the average absolute difference from the previous frame. Results show that these types of auxiliary task can significantly improve learning.

In this section we consider some unresolved challenges associated with using auxiliary tasks in deep RL.

We have seen that adding an auxiliary task is relatively simple. However, the first and most important challenge is to define a *good* auxiliary task. Exactly how to do this remains an open question.

As a step towards this, Du et al. (2018) devised a method to detect *when* a given auxiliary task might be useful. Their approach uses the intuition that an algorithm should take advantage of the auxiliary task when it is helpful for the main task and block it otherwise.

Their proposal has two parts. First, they determine whether the auxiliary task and the main task are related. Second, they modulate how useful the auxiliary task is with respect to the main task using a weight. In particular, they propose to detect when an auxiliary loss $\mathcal{L}_{aux}$ is helpful to the main loss $\mathcal{L}_{main}$ by using the cosine similarity between gradients of the two losses:

**Algorithm 1:** Use of auxiliary task by gradient similarity

**if **$\cos[\nabla_{\theta}\mathcal{L}_{main},\nabla_{\theta} \mathcal{L}_{aux}]\ge 0$ **then**

| Update $\theta$ using $\nabla_{\theta}\mathcal{L}_{main} +\nabla_{\theta} \mathcal{L}_{aux}$

**else**

| Update $\theta$ using $\nabla_{\theta}\mathcal{L}_{main}$

**end**

The goal of this algorithm is to avoid adding an auxiliary loss that impedes learning progress for the main task. This is sensible, but would ideally be augmented by a better theoretical understanding of the benefits (Bellemare et al. 2019).

The auxiliary task is always accompanied by a weight but it's not obvious how to choose its value. Ideally, this weight should give enough importance to the auxiliary task to drive the representation learning but also one needs to be careful not to ignore the main task.

Weights need not be fixed, and can vary over time. Hernandez-Leal et al. (2019) compared five different weight parameterizations for the agent modeling task. In some the weight was fixed and in others it decayed. Figure 4 shows that, in the same environment, this choice directly affects the performance.

It also appears that the optimal auxiliary weight is dependent on the domain. For the tasks discussed in this post:

- Terminal prediction used a weight of 0.5 (Kartal et al. 2019).
- Reward prediction used a weight of 1.0 (Jaderberg et al. 2017).
- Auxiliary control varied between 0.0001 and 0.01 (Jaderberg et al. 2017).
- Depth prediction chose from the set {1, 3.33, 10} (Mirowski et al. 2016).

In the last part of this article we discuss the two major benefits of using auxiliary tasks: improving performance and increasing robustness.

The main benefit of auxiliary tasks is to drive representation learning and hence improve the agent's performance; the system learns faster and achieves better performance in terms of rewards with an appropriate auxiliary task.

For example, when auxiliary tasks where added to A3C agents, scores improved in domains such as Pommerman (Figure 5), Q*bert (Figure 6a) and the Bipedal walker (Figure 6b). Similar benefits of auxiliary tasks have been shown in Q-learning style algorithms (Lample & Chaplot 2017; Fedus et al. 2019).

The second benefit is related to robustness, and we believe this has been somewhat under-appreciated. One problem with deep RL is the high variance over different runs. This can even happen in the same experiment while just varying the random seed (Henderson et al. 2018). This is a major complication because algorithms sometimes diverge (i.e., they fail to learn) and thus we prefer robust algorithms that can learn under a variety of different values for their parameters.

Auxiliary tasks have been shown to improve robustness of the learning process. In the UNREAL work (Jaderberg et al. 2017), the authors varied two hyperparameters, entropy cost and learning rate, and kept track of the final performance while adding different auxiliary tasks. The results showed that adding auxiliary tasks increased performance over a variety of hyperparameter values (Figure 7).

In this tutorial we reviewed auxiliary tasks in the context of deep reinforcement learning and we presented examples from a variety of domains. Auxiliary tasks been used to accelerate and provide robustness to the learning process. However, there are still open questions and challenges such as defining what constitutes a good auxiliary task and forming a better theoretical understanding of how they contribute.

^{1}*Note that the RL agent does not have access to the time stamp or a memory, so it must predict its time relative to the terminal state afresh at each time step.*

Deep reinforcement learning (DRL) has had many successes on complex tasks, but is typically considered a black box. Opening this black box would enable better understanding and trust of the model which can be helpful for researchers and end users to better interact with the learner. In this paper, we propose a new visualization to better analyze DRL agents and present a case study using the Pommerman benchmark domain. This visualization combines two previously proven methods for improving human un-derstanding of systems: saliency mapping and immersive visualization.

]]>In part I of this tutorial we argued that few-shot learning can be made tractable by incorporating prior knowledge, and that this prior knowledge can be divided into three groups:

**Prior knowledge about class similarity:** We learn embeddings from training tasks that allow us to easily separate unseen classes with few examples.

**Prior knowledge about learning: **We use prior knowledge to constrain the learning algorithm to choose parameters that generalize well from few examples.

**Prior knowledge of data:** We exploit prior knowledge about the structure and variability of the data and this allows us to learn viable models from few examples.

We also discussed the family of methods that exploit prior knowledge of class similarity. In part II we will discuss the remaining two families that exploit prior knowledge about learning, and prior knowledge about the data respectively.

Perhaps the most obvious approach to few-shot learning would be transfer learning; we first find a similar task for which there is plentiful data and train a network for this. Then we adapt this network for the few-shot task. We might either (i) fine-tune this network using the few-shot data, or (ii) use the hidden layers as input for a new classifier trained with the few-shot data. Unfortunately, when training data is really sparse, the resulting classifier typically fails to generalize well.

In this section we'll discuss three related methods that are superior for the few-shot scenario. In the first approach ("learning to initialize"), we explicitly learn networks with parameters that can be fine-tuned with a few examples and still generalize well. In the second approach ("learning to optimize''), the optimization scheme becomes the focus of learning. We constrain the optimization algorithm to produce only models that generalize well from small datasets. Finally, the third approach ("sequence methods'') learns models that treat the data/label pairs as a temporal sequence and that learns an algorithm that takes this sequence and predicts missing labels from new data.

Algorithms in this class aim to choose a set of parameters that can be fine-tuned very easily to another task via one or more gradient learning steps. This criterion encourages the network to learn a stable feature set that is applicable to many different domains, with a set of parameters on top of these that can be easily modified to exploit this representation.

*Model agnostic meta-learning* or *MAML *(Finn *et* al. 2017) is a meta-learning framework that can be applied to any model that is trained with a gradient descent procedure. The aim is to learn a general model that can easily be fine-tuned for many different tasks, even when the training data is scarce.

The parameters $\boldsymbol\phi$ of this general model can be adapted to the $j^{th}$ task $\mathcal{T}_{j}$ by taking a single gradient step

\begin{equation}\label{eq:MAML_obj1}

\boldsymbol\phi_{j} = \boldsymbol\phi - \alpha \frac{\partial}{\partial \boldsymbol\phi} \mathcal{L}\left[\mathbf{f}[\boldsymbol\phi],\mathcal{T}_{j}]\right], \tag{1}

\end{equation}

to create a task-specific set of parameters $\boldsymbol\phi_{j}$. Here, the network is denoted by $\mathbf{f}[\bullet]$ with parameters $\boldsymbol\phi$. The loss $\mathcal{L}[\bullet, \bullet]$ takes the model $\mathbf{f}[\bullet]$ and the task data $\mathcal{T}_{j}$ as parameters. The parameter $\alpha$ represents the size of the gradient step.^{1}

Our goal is that on average for a number of different tasks, the loss will be small with these parameters. The *meta-cost function* $\mathcal{M}[\bullet]$ encompasses this idea

\begin{equation}

\mathcal{M}[\boldsymbol\phi] = \sum_{j=1}^{J} \mathcal{L}\left[\mathbf{f}[\boldsymbol\phi_{j}],\mathcal{T}_{j}]\right], \tag{2}

\end{equation}

where each set of parameters $\boldsymbol\phi_{j}$ is itself a function of $\boldsymbol\phi$ as given by equation 1. We wish to minimize this cost, which we can do by taking gradient descent steps

\begin{equation}\label{eq:MAML_obj2}

\boldsymbol\phi \leftarrow \boldsymbol\phi - \beta \frac{\partial}{\partial \boldsymbol\phi} \mathcal{M}[\boldsymbol\phi], \tag{3}

\end{equation}

where $\beta$ is the step size. This would typically be done in a stochastic fashion, updating the meta-cost function with respect to a few tasks at a time (figure 1a-b) In this way, MAML gradually learns parameters $\boldsymbol\phi$ which can be adapted to many tasks by fine tuning.

MAML has the disadvantage that both the meta-learning objective (equation 3) and the task learning objective within it (equation 1) contain gradients, and so we have to take gradients of gradients (via a Hessian matrix) to perform each update.

To improve the efficiency of learning Finn *et* al. (2017) introduced *first order model agnostic meta learning* or *FOMAML* which simply omitted the second derivatives. Surprisingly, this did not impact performance. They speculate that this might be because that RELU networks are almost linear and so the second derivatives are close to zero.

Nichol *et* al. (2018) introduced a simpler variant of first order MAML which they called *Reptile*. As for MAML, the algorithm repeatedly samples tasks $\mathcal{T}_{j}$, and optimizes the global parameters $\boldsymbol\phi$ to create task specific parameters $\boldsymbol\phi_{j}$. Then it updates the global parameters using the rule:

\begin{equation}

\boldsymbol\phi \longleftarrow \boldsymbol\phi + \alpha (\boldsymbol\phi_{j}-\boldsymbol\phi). \tag{4}

\end{equation}

This is illustrated in figure 1c. One interpretation of this is that we are performing stochastic gradient descent on a task level rather than a data level.

Jayathilaka (2019) improved this method by adding two thresholds $\beta$ and $\gamma$ as hyper-parameters. After $\beta$ steps, the gradient is pruned so that if the change in parameters $(\boldsymbol\phi_{j}-\boldsymbol\phi) < \gamma$ then no change is made. The logic of this approach is that the end of the meta-training procedure is over-learning the training tasks and hence this part of the regime is damped.

In the previous section we discussed algorithms that learn good starting positions for optimization. In this section we consider methods that learn the optimization algorithm itself. The idea is that these new optimization schemes will be constrained to produce models that generalize well when trained with few examples.

We will discuss two approaches. In the first we consider the learning rule as the cell-state update in a long short-term memory (LSTM) network; the LSTM is a model used to analyse sequences of data, and here these are sequences of optimization steps. In the second approach we frame the optimization updates in terms of reinforcement learning.

Ravi & Larochelle (2016) propose training a LSTM based meta-learner, where the cell state represents the model parameters. This was inspired by their realization that the standard gradient descent update rule has a very similar form to the cell update in an LSTM. We'll review each in turn to make this connection explicit.

The gradient descent update rule for a model with parameters $\boldsymbol\phi$ is given by:

\begin{equation}\label{eq:ravi_gradient}

\boldsymbol\phi_{t} = \boldsymbol\phi_{t-1}-\alpha\cdot\mathbf{g}_{t-1}, \tag{5}

\end{equation}

where $t$ represents the time step, $\alpha$ is the learning rate and $\mathbf{g}_{t}$ is the gradient vector. The cell state update rule in an LSTM is given by:

\begin{equation}\label{eq:ravi_optimize}

\mathbf{c}_{t} = \mathbf{f}_{t} \odot \mathbf{c}_{t-1} + \mathbf{i}_{t}\odot \tilde{\mathbf{c}}_{t-1}, \tag{6}

\end{equation}

where the cell state $\mathbf{c}_{t}$ at time $t$ is updated based on i) previous cell state $\mathbf{c}_{t-1}$ moderated by the forget gate $\mathbf{f}_{t}$ and ii) the candidate value for the cell state $\tilde{\mathbf{c}}_{t-1}$ moderated by the input gate $\mathbf{i}_{t}$.

The mapping from equation 5 to 6 is now clear. The cell state of the LSTM takes the place of the parameters $\boldsymbol\phi$ and the candidate value for the cell state $\tilde{\mathbf{c}}_{t-1}$ takes the place of the gradient $\mathbf{g}_{t-1}$. For the gradient descent case, the forget gate $\mathbf{f}_{t} = \mathbf{1}$, and the input gate $\mathbf{i}_{t} = -\alpha\mathbf{1}$.

Hence, Ravi & Larochelle (2016) propose representing the parameters of the models by the cell state of an LSTM and learning more general functions for the forget and input gates (figure 3). Each of these are two-layer neural networks that take a vector containing the previous gradient, previous loss function, previous parameters and previous value of the gate.

At each step of the training, the LSTM sees a sequence that corresponds to iterative optimization of the parameters $\boldsymbol\phi$ for the $j^{th}$ task. The LSTM learns the update rule from these sequences by updating the parameters in the forget and input gates; the parameters of these networks are manipulated to select an update rule that tends to produce good generalization.

In practice, each parameter is updated separately, so that there is a different input and forget gate for each. Similarly to ADAM, each parameter has a different learning rate, but now this learning rate is a complex function of the history of the optimization.

For the test task, the LSTM is run to provide gradient updates that incorporate prior knowledge from all of the training tasks and converge fast to a meaningful set of parameters without over-learning. Andrychowicz *et* al. (2016) present a similar scheme although this is not explicitly aimed at the few-shot learning situation.

Li & Malik (2016) observed that an optimization algorithm can be viewed as a Markov decision process. The state consists of the set of relevant quantities for optimization (current parameters, objective values, gradients, etc.). The action is the parameter update $\delta\boldsymbol\phi$ and so the policy is a probabilistic parameter update formula (figure 3).

Inspired by this observation, Li & Malik (2016) described the mean of the policy is a recurrent neural net that takes features relevant to the optimization (iterates, gradient and objective values from recent iterations) and the previous memory state and outputs the action (parameter update).

As with the LSTM system above, this system learns how best to update the model parameters in unseen test tasks, based on experience gained from a diverse collection of training tasks.

Bello *et* al. (2017) developed a reinforcement learning system where the action consists of an optimization update rule in a domain specific language. Each rule consists of two operands, two unary functions to apply to the first and second operand respectively, and a binary function to combine their outputs:

\begin{equation}

\boldsymbol\phi \leftarrow \boldsymbol\phi + \alpha \cdot \mbox{b}\left[\mbox{u}_{1}[o_{1}], \mbox{u}_{2}[o_{2}]\right], \tag{7}

\end{equation}

where $\mbox{u}_{1}[\bullet]$ and $\mbox{u}_{2}[\bullet]$ are the unary functions, $o_{1}$ and $o_{2}$ are the operands and $\mbox{b}[\bullet]$ is the binary function. The term $\alpha$ represents the learning rate.

Examples of operands include the gradient, sign of gradient, random noise, and moving average of gradient. Unary functions include clipping, square rooting, exponentiating and identity. Binary functions include addition, subtraction, multiplication, and division. Many existing optimization schemes can be expressed in this language, including stochastic gradient descent, RMSProp and Adam.

The controller consists of a recurrent neural network which samples strings of length $5$, each of which represents a different rule. A child classification network is trained with this rule and the accuracy is fed back to change the parameters of the RNN so that it is more likely to output better rules.

Perhaps surprisingly, the system finds interpretable optimization rules; for example, the *powersign* classifier compares the sign of the gradient and its running average and adjusts the step size according to whether those values agree.

In the previous two sections, we considered methods that find good initial parameters for fine tuning networks and methods that learn optimization rules that tend to produce good generalization. Both of these methods are obviously directly connected to the optimization process.

In this section, we introduce *sequence methods* for meta-learning. Sequence methods ingest the entire support set as a sequence of tuples $\{\mathbf{x}, y\}$ each containing a data example $\mathbf{x}$ and a label $y$. The last tuple consists of just a data example from the query set and the system is trained to predict the missing label. The parameters of the sequence model are updated so this prediction is consistently accurate over different tasks.

At first sight, this might seem unrelated to the previous methods, but consider the situation when we have already passed the support set into the system. From this perspective the situation is very similar to a standard network. The query example will be passed in, and the query label returned. Working backwards, we can think of passing each support set sequence as optimizing the network for this task, and the training of the sequence model itself (across many different tasks) as meta-learning of how to optimize the model for different tasks.

We'll consider two sequence methods. The first is based on a recurrent neural network (RNN) and the second uses an attention mechanism.

Santoro *et* al. (2016) introduced *memory augmented neural networks.* Their system is trained one task at a time, with each task considered as a sequence of data $\mathbf{x}$ and label $y$ pairs (figure 4). However, the label $y_{t}$ for the data example $\mathbf{x}_{t}$ at time $t$ is not provided until time $t+1$. Hence, the system must learn to classify the current example based on past information. The data is shuffled every time that a task is presented so that the network doesn't erroneously learn the sequence rather than the relation between the data and the label.

The network consists of a controller which stores memories in a network and retrieves them to use for classification. In practice, the controller that is used to place the memories and retrieve them is an LSTM or a feed-forward network. Memories are retrieved based on a key computed from the data which is compared to every memory by cosine similarity; the retrieved memory is a weighted sum of all of the stored memories weighted by the soft-max transformed cosine similarities.

As the sequence for a new task is passed in, the system stores memories from the support set sequence and uses these to predict the subsequent labels. Over time, the memory content becomes more suited to the current task and classification improves. During meta-training, we learn the parameters of the controller so that this process works well on average over many tasks; it learns the algorithm for storing and retrieving memories.

Mishra *et* al. (2017) also described a sequence method, in which the system is trained to take a set of (data, label) tuples $\{\mathbf{x}, y\}$ and then predict the missing label for the last example. Their system is not recurrent, and takes the entire sequence of support data at once. The architecture is based on alternating causal temporal convolutions and soft-attention layers (figure 5). This allows the decision for the query example to depend on the previously observed pairs. They term this architecture the *simple neural attentive meta-learner* or SNAIL.

The final family of few-shot learning methods exploit prior knowledge about the process that generates the classes and their examples. There are two main approaches here. First, we can try to characterize a *generative model* of all possible classes with a small number of parameters. These parameters can then be learned from a just a few examples and be used as the basis for discriminating new classes. Second, we can exploit knowledge of the data creation process to synthesize new examples and hence *augment* the dataset. We can then use this expanded dataset to train a conventional model. We consider each approach in turn.

We will describe two generative models. First we consider a model that is specialized to recognizing new classes of hand written characters. The structure of the model contains significant information about how images of characters are created and this is exploited to understand new types of character. Second, we consider a more generic generative model that learns how to generate families of data classes.

Lake *et* al. (2015) construct a hierarchical non-parametric generative model for hand-written characters using a probabilistic programming approach . At the first level, a set of primitives are combined into parts, each of which is a single pen-stroke. Multiple parts are then connected to create characters. This process is illustrated in figure 6a. At the second level, a realisation of the character is created by simulating the pen-strokes for that character under noisy conditions.

During the meta-learning process, the set of primitives are learned such that they can be combined to describe sets of unseen characters. The system has access to the actual pen-strokes for training which makes this learning easier. The support set of the test task is used to describe the new classes in terms of this fixed set of primitives. For a query in the test task, the posterior probability that it was generated from each of the character classes is computed.

The structure of the model (primitive pen strokes, and likely combinations) is hence prior information learned from previous sets of characters that can be exploited to discriminate unseen classes.

Edwards & Storkey (2016) presented a more generic model which they termed the *neural statistician* as it learns the statistics of both classes and examples in a dataset. This model can generate new examples of classes and examples of any data and contains no prior information about the generation process (e.g., about pen-strokes or image transformations).

The model has a similar generative structure to the pen stroke model, but is based on the variational auto-encoder (figure 6b). At the top level is a context vector $\mathbf{c}$. A probability distribution over the context vector describes the statistics of a single class. In the context of few-shot learning, we might have $N$ context vectors indexed as $\mathbf{c}_{n}$ representing the $N$ classes. Each generates $K$ hidden variables $\{\mathbf{z}_{nk}\}_{k=1}^{K}$ and each of these generates a data example $\mathbf{x}_{nk}$. In this way, a single task for the N-way-K-shot problem is generated.^{2}

The support sets from the training tasks are used to learn the parameters of this model using a modification of the variational auto-encoder that allows inference of both context variables and hidden variables. For the test task, the support set is used to infer the context vectors and hidden variables that explain this dataset. Classification of the query set can be done by evaluating the probability that each data example was generated by the context vector for each class. As for the pen-stroke model, prior knowledge is accumulated in building the structure of the model during meta-learning, which means that unseen classes can be modelled effectively from only a few data examples.

Rezende *et* al. (2016) presented a related model that was also based on the VAE but differed in that (i) it was specialized to images and contained prior knowledge about image transformations (ii) generation was conditioned explicitly on a new class example, as opposed to inferring a hidden variable representing the class.

Hariharan & Girshick (2017) proposed a method for hallucinating new examples to augment datasets where few data examples are available. Their proposed approach is based on the intuition that learned intra-class variations are both transferable and generalize well to novel classes.

They assume that they have a large body of data with many examples per class from which they can learn about intra-class variation. They then exploit this knowledge to create extra examples in the few-shot test scenario. Their learning approach is based on analogy; if in the training data we observe embeddings $\mathbf{z}_{11}$ and $\mathbf{z}_{12}$ for class 1, then perhaps we can use the embedding $\mathbf{z}_{21}$ from class 2 to predict a new variation $\mathbf{z}_{22}$. In other words, we aim to answer the question ``if $\mathbf{z}_{11}$ is to $\mathbf{z}_{12}$ then $\mathbf{z}_{21}$ is to what?'' (figure 7).

This analogy task is performed using a multi-layer perceptron that takes $\mathbf{z}_{11}$, $\mathbf{z}_{12}$ and $\mathbf{z}_{21}$ and predicts $\mathbf{z}_{22}$. They learn this network from quadruplets of features taken from training tasks with plentiful data.^{3} The loss function encourages both accurate prediction of the missing feature vector and also correct classification of the synthesized example. For few-shot test tasks, the data is augmented using this generator by analogy with the plentiful training classes and it is shown that this significantly improves performance.

Subsequently Wang *et* al. (2018) proposed an end-to-end framework that jointly optimizes a meta-learner (such as a prototypical network) and a hallucinator (which generates additional training examples). Samples are hallucinated by a multi-layer perceptron that takes a training example and a noise vector and generates a new example. The new samples are added to the original training set and this augmented training set is used to learn parameters of classifier. Loss is back-propagated and both parameters of the classification algorithm and parameters of the hallucinator are updated. A key notion here is that it is not the plausibility of the new examples that is important, but rather it is their ability to improve classification performance in a few-shot setting.

There is enormous diversity in approaches to meta-learning and few-shot learning, but there is currently no consensus on the best approach. Probably the most thorough empirical comparison was by Chen *et* al. (2019) (figure 8), but this mainly focuses on approaches that learn embeddings. It should be noted as well, that many of the approaches are complementary to one another and a practical solution would be to combine them.

Few-shot learning and meta-learning are likely to gain in importance as AI penetrates more specific problem domains where the cost of gathering data is too great to justify a brute force approach. They remain interesting open problems in artificial intelligence.

^{1 }*This update can be iterated, but we will describe a single update for simplicity of notation.*

^{2 }*In practice, the model is somewhat more complicated than this with a sequence of dependent hidden variables describing each data example.*

^{3 }*The analogies are actually learned from cluster centers of the data and are chosen so that the cosine similarity between $\mathbf{z}_{11}-\mathbf{z}_{12}$ and $\mathbf{z}_{21}-\mathbf{z}_{22}$ is greater than zero.*

View the code here.

]]>Located at the heart of the University of Waterloo and housed in the new Evolv1 building, a leading space in sustainable design and the first ever zero-carbon office building in Canada, Borealis AI’s office features a unique design that draws inspiration from the campus life with a teacher’s lounge, a science lab, a track, and a field pitch.

It was a natural fit for Borealis AI to establish its fifth research lab in Waterloo, a city anchored by the University of Waterloo, a world-class institution, and flanked by a number of innovative AI start-ups and tech companies. Our Waterloo centre strengthened our existing ties to the city and its strong research community, dating back to Borealis AI’s early days in 2016. Borealis AI is a proud supporter of Waterloo.ai, the university’s artificial intelligence institute.

True to our vision of supporting Waterloo’s AI community, we are also pleased to announce Borealis AI’s support for the Leader’s Prize at True North, powered by COMMUNITECH. The Prize is “a national competition that challenges Canadian thinkers to solve a major societal or industry problem of global proportion and consequence.” Teams will compete in employing AI/ML to produce solutions that automate “the fact-checking process and flag whether a claim is true or false.” Professor Pascal Poupart, Principal Researcher at Borealis AI, will be heading up the scientific committee for the competition.

For more photos from our event, click below.

]]>At Borealis AI, we believe that the value of ML research emerges when it is engineered into beautiful products that have a positive impact on the world. Joseph started his career in financial software at Spectra Securities Software (now part of FIS) and then led product groups at Yahoo in the U.S. and 500px in Toronto.

Our teams at Borealis AI are thrilled to be welcoming him on board.

I wanted an opportunity to leverage my experience and passion for tech and machine learning specifically to build something big that benefits Canada. I had just spent three years working for Communitech, a not-for-profit, helping startups, scale-ups and large organizations learn how to use data to drive innovation. It was time to get back in the game. Through the vision of Dave McKay and Foteini, and with the hard work of an amazing team, Borealis AI had built a world-class, integrated team of AI researchers and engineers. It had a huge, friendly market in RBC with over 86,000 employees serving 16 million clients in 35 countries - the scale to drive real impact. It was a once-in-a-lifetime chance.

We’re a distributed organization; our leadership is spread across Vancouver, Edmonton, Waterloo, Toronto, and Montreal. This is critical not only to our ability to recruit, but also to our ability to make an impact across Canada. We try to be very intentional in how we communicate so that we can work well despite the distance.

As we accelerate our work on applications it's critical that we have access to amazing engineering, product and design talent to complement our great research team. Waterloo is an amazing tech ecosystem, with a strong history of revolutionary products based on deep tech — BlackBerry, Open Text, Sandvine, Miovision and so on. Everyone forgets how technically challenging BlackBerry was. Those technical and business skills, that passion for deep tech, is very real and makes Waterloo unique.

And, of course, we train the best software developers in the world, which you need to apply ML at scale. I can see them walking to Davis Centre from my desk. So that's why Borealis AI and RBC continue to bet on Waterloo.

First, the opportunity: the financial system is the plumbing of the global economy. It's this foundational thing that changes very rapidly and has huge impact. Everywhere I look there are problems to solve and opportunities to unlock. I'm convinced that AI will revolutionize large areas of human endeavour very soon and a bank is a great place to be able to observe and effect that. So we're in a problem-rich environment with lots of long levers to pull.

And second, the team. With our roots in research, we have incredible capacity in ML. I've had to suspend my disbelief and take a wide view as to what we can/can't do with AI. Dr. Simon Prince, one of our directors of research, put it best in my first week; "you bring us the user problem, the business opportunity, and let us figure out if or how AI can solve it." AI cannot, yet, do anything a human can, but my starting point is to assume that it can, and then to let the research and engineering teams help us understand any constraints, and move beyond them.

This is something that we've been wrestling with. Machine learning is still an immature field; capabilities are changing so fast that it's sometimes hard to separate the impossible from the very hard. User research is difficult to do without real results and real screens. Most times, when you're building an app or website you can get useful user feedback just by walking them through static mockups or a prototype. With a machine learning problem, the performance of the underlying algorithm is hugely important to the overall experience.

It's also hard to turn a business problem into a machine learning problem. Doing so takes skill, and requires that you slice the business problem from multiple directions. You need to understand what training data, simulation environments, or other signal is available. You also need to work with very smart people from very different disciplines with different temperaments and perspectives towards a common goal.

And finally, we're a product organization that's been built around a world-class research team. To earn our keep we need to be finding problems and applications of ML that move us towards unsolved problems. We've got to keep chasing huge business opportunities with an element of scientific risk.

To build a world-class product culture in Borealis AI, and to ship products that make people stop and say wow. We have the potential to develop and deploy some amazing, transformative technology that creates a big impact and I want to get us there, and quickly.

]]>Today, Borealis AI announced it will collaborate with MILA to support a machine learning research initiative on climate change.

Climate change is indisputably one of the biggest challenges of our time. Global temperature rise, glacial retreat, sea-level rise and extreme weather events are just a few examples of the impact that humans are having on Earth. While modern society is at the centre of this change, there is currently a disconnect between human responsibility and awareness. People have a hard time understanding how climate change affects them personally and what it means for their future.

MILA researchers, led by Prof Yoshua Bengio, have developed computer vision algorithms to personalize the effect of extreme weather events on locations of interest. Given an address, this machine learning model can generate a photo-realistic image can visualise the impact of extreme weather phenomena in that region as predicted by a climate model associated with that geography. Generative models are used to synthesize images showing flooding and other weather effects that are hyper-personalized and depicting of your own home or street.

This project falls under MILA’s research portfolio on “AI for Humanity” which involves a number of projects that are socially responsible and beneficial to society.

]]>Humans can recognize new object classes from very few instances. However, most machine learning techniques require thousands of examples to achieve similar performance. The goal of *few-shot learning* is to classify new data having seen only a few training examples. In the extreme, there might only be a single example of each class (*one shot learning*). In practice, few-shot learning is useful when training examples are hard to find (e.g., cases of a rare disease), or where the cost of labelling data is high.

Few-shot learning is usually studied using *N-way-K-shot classification*. Here, we aim to discriminate between $N$ classes with $K$ examples of each. A typical problem size might be to discriminate between $N=10$ classes with only $K=5$ samples from each to train from. We cannot train a classifier using conventional methods here; any modern classification algorithm will depend on far more parameters than there are training examples, and will generalize poorly.

If the data is insufficient to constrain the problem, then one possible solution is to gain experience from other similar problems. To this end, most approaches characterize few-shot learning as a *meta-learning* problem.

In the classical learning framework, we learn a how to classify from training data and evaluate the results using test data. In the meta-learning framework, we *learn how to learn* to classify given a set of *training tasks* and evaluate using a set of t*est tasks* (figure 1); In other words, we use one set of classification problems to help solve other unrelated sets.

Here, each task mimics the few-shot scenario, so for N-way-K-shot classification, each task includes $N$ classes with $K$ examples of each. These are known as the *support set* for the task and are used for learning how to solve this task. In addition, there are further examples of the same classes, known as a *query set*, which are used to evaluating the performance on this task. Each task can be completely non-overlapping; we may never see the classes from one task in any of the others. The idea is that the system repeatedly sees instances (tasks) during training that match the structure of the final few-shot task, but contain different classes.

At each step of meta-learning, we update the model parameters based on a randomly selected training task. The loss function is determined by the classification performance on the query set of this training task, based on knowledge gained from its support set. Since the network is presented with a different task at each time step, it must learn how to discriminate data classes in general, rather than a particular subset of classes.

To evaluate few-shot performance, we use a set of test tasks. Each contains only unseen classes that were not in any of the training tasks. For each, we measure performance on the query set based on knowledge of their support set.

Approaches to meta-learning are diverse and there is no consensus on the best approach. However, there are three distinct families, each of which exploits a different type of prior knowledge:

**Prior knowledge about similarity: **We learn embeddings in training tasks that tend to separate different classes even when they are unseen.

**Prior knowledge about learning:** We use prior knowledge to constrain the learning algorithm to choose parameters that generalize well from few examples.

**Prior knowledge of data:** We exploit prior knowledge about the structure and variability of the data and this allows us to learn viable models from few examples.

An overview these methods can be seen in figure 2. In this review, we will consider each family of methods in turn.

This family of algorithms aims to learn compact representations (embeddings) in which the data vector is mostly unaffected by intra-class variations but retains information about class membership. Early work focused on pairwise comparators which aim to judge whether two data examples are from the same or different classes, even though the system may not have seen these classes before. Subsequent research focused on multi-class comparators which allow assignment of new examples to one of several classes.

Pairwise comparators take two examples and classify them as either belonging to the same or different classes. This differs from the standard N-way-K-shot configuration and does not obviously map onto the above description of meta-learning although as we will see later there is in fact a close relationship.

Koch *et al.* (2015) trained a model that outputs the probability $Pr(y_a=y_{b})$ that two data examples $\mathbf{x}_{a}$ and $\mathbf{x}_{b}$ belong to the same class (figure 3a). The two examples are passed through identical multi-layer neural networks (hence Siamese) to create two embeddings. The component-wise absolute distance between the embeddings is computed and passed to a subsequent comparison network that reduces this distance vector to a single number. This is passed though a sigmoidal output for classification as being the same or different with a cross-entropy loss.

During training, each pair of examples are randomly drawn from a super-set of training classes. Hence, the system learns to discriminate between classes is general, rather than two classes in particular. In testing, completely different classes are used. Although this does not have the formal structure of the N-way-K-shot task, the spirit is similar.

Triplet networks (Hoffer & Ailon 2015) consist of three identical networks that are trained by triplets $\{\mathbf{x}_{+},\mathbf{x}_{a},\mathbf{x}_{-}\}$ of the form (positive, anchor, negative). The positive and anchor samples are from the same class, whereas the negative sample is from a different class. The learning criterion is *triplet loss* which encourages the anchor to be closer to the positive example than it is to the negative example in the embedding space (figure 3b). Hence it is based on two pairwise comparisons.

After training, the system can take two examples and establish whether they are from the same or different classes, by thresholding the distance in the learned embedding space. This was employed in the context of face verification by Schroff *et al.* (2015). This line of work is part of a greater literature on learning distance metrics (see Suarez *et al.* 2018 for overview).

Pairwise comparators can be adapted to the N-way-K-shot setting by assigning the class for an example in the query set based on its maximum similarity to one of the examples in the support set. However, multi-class comparators attempt to do the same thing in a more principled way; here the representation and final classification are learned in an end-to-end fashion.

In this section, we'll use the notation $\mathbf{x}_{nk}$ to denote the $k$th support example from the $n$th class in the N-Way-K-Shot classification task, and $y_{nk}$ to denote the corresponding label. For simplicity, we'll assume there is a single query example $\hat{\mathbf{x}}$ and the goal is to predict the associated label $\hat{y}$.

Matching networks (Vinyals *et al.* 2016) predict the one-hot encoded query-set label $\hat{\mathbf{y}}$ as a weighted sum of all of the one-hot encoded support-set labels $\{\mathbf{y}_{nk}\}_{n,k=1}^{NK}$. The weight is based on a computed similarity $a[\hat{\mathbf{x}},\mathbf{x}_{nk}]$ between the query-set data $\hat{\mathbf{x}}$ and each training example $\{\mathbf{x}_{nk}\}_{n,k=1}^{N,K}$.

\begin{equation}

\hat{\mathbf{y}} = \sum_{n=1}^{N}\sum_{k=1}^{K} a[\mathbf{x}_{nk},\hat{\mathbf{x}}]\mathbf{y}_{nk} \tag{1.1}

\end{equation}

where the similarities have been constrained to be positive and sum to one.

To compute the similarity $a[\hat{\mathbf{x}},\mathbf{x}_{nk}]$, they pass each support example $\mathbf{x}_{nk}$ through a network $\mbox{ f}[\bullet]$ to produce an embedding and pass the query example $\hat{\mathbf{x}}$ through a different network $\mbox{ g}[\bullet]$ to produce a different embedding. They then compute the cosine similarity between these embeddings (figure 5a)

\begin{equation}

d[\mathbf{x}_{nk}, \hat{\mathbf{x}}] = \frac{\mbox{ f}[\mathbf{x}_{nk}]^{T}\mbox{ g}[\hat{\mathbf{x}}]} {||\mbox{ f}[\mathbf{x}_{nk}]||\cdot||\mbox{ g}[\hat{\mathbf{x}}]||}, \tag{1.2}

\end{equation}

and normalise using a softmax function:

\begin{equation}

a[\hat{\mathbf{x}}_{nk},\mathbf{x}] = \frac{\exp[d[\mathbf{x}_{nk},\hat{\mathbf{x}}]]}{\sum_{n=1}^{N}\sum_{k=1}^{K}\exp[d[\mathbf{x}_{nk},\hat{\mathbf{x}}]]}. \tag{1.3}

\end{equation}

to produce positive similarities that sum to one. This system can be trained end to end for the N-way-K-shot learning task.^{1 }At each learning iteration, the system is presented with a training task; the predicted labels are computed for the query set (the calculation is based on the support set) and the loss function is the cross entropy of the ground truth and predicted labels.

Matching networks compute similarities between the embeddings of each support example and the query example. This has the disadvantage that the algorithm is not robust to data imbalance; if there are more support examples for some classes than others (i.e., we have departed from the N-way-K-shot scenario), the ones with more frequent training data may dominate.

Prototypical networks (Snell et al. 2017) are robust to data imbalance by construction; they average the embeddings $\{\mathbf{z}_{nk}\}_{k=1}^{K}$ of the examples for class $n$ to compute their mean embedding or *prototype* $\mathbf{p}_{n}$. They then use the similarity between each prototype and the query embedding (figures 4 and 5 b) as a basis for classification.

The similarity is computed as a negative multiple of the Euclidean distance (so that larger distances now give smaller numbers). They pass these similarities to a softmax function to give a probability over classes. This model effectively learns a metric space where the average of a few examples of a class is a good representation of that class and class membership can be assigned based on distance.

They noted that (i) the choice of distance function is vital as squared Euclidean distance outperformed cosine distance, (ii) having a higher number of classes in the support set helps to achieve better performance, and that (iii) the system works best when the support size of each class is matched in the training and test tasks.

Ren et al. (2018) extended this system to take advantage of additional unlabeled data which might be from the test task classes or from other distractor classes. Oreshkin et al. (2018) extended this approach by learning a task-dependent metric on the feature space, so that the distance metric changes from place to place in the embedding space.

Matching networks and prototypical networks both focus on learning the embedding and compare examples using a pre-defined metric (cosine and Euclidean distance, respectively). Relation networks (Santoro et al. 2016) also learn a metric for comparison of the embeddings (figure 5c). Similarly to prototypical networks, the relation network averages the embeddings of each class in the support set together to form a single prototype. Each prototype is then concatenated with the query embedding and passed to a *relation module*. This is a learnable non-linear operator that produces a similarity score between 0 and 1 where 1 indicates that the query example belongs to this class prototype. This approach is clean and elegant and can be trained end-to-end.

All of the pairwise and multi-class comparators are closely related to one another. Each learns an embedding space for data examples. In matching networks, there are different embeddings for support and query examples, but in the other models, they are the same. For prototypical networks and relation networks, multiple embeddings from the same class are averaged to form prototypes. Distances between support set embeddings/prototypes and query set embeddings are computed using either pre-determined distance functions such as Euclidean or cosine distance (triplet networks, matching networks, prototypical networks) or by learning a distance metric (Siamese networks and relation networks).

The multi-class networks have the advantage that they can be trained end-to-end for the N-way-K-shot classification task. This is not true for the pairwise comparators which are trained to produce a similarity or distance between pairs of data examples (which could itself subsequently be used to support multi-class classification).

Although it is not obvious how the pairwise comparators map to the meta-learning framework, it is possible to consider their data as consisting of minimal training and test tasks. For Siamese networks, each pair of examples is a training task, consisting of one support example and one query example, where their classes may not necessarily match. For triplet networks, there are two support examples (from different classes) and one query example (from one of the classes).

In part I of this tutorial we have described the few-shot and meta-learning problems and introduced a taxonomy of methods. We have also discussed methods that use a series of training tasks to learn prior knowledge about the similarity and dissimilarity of classes that can be exploited for future few-shot tasks. This knowledge takes the form of data embeddings that reduce within-class variance relative to between-class variance, and hence make it easier to learn from just a few data points.

In part II of this tutorial, we'll discuss methods that incorporate prior knowledge about how to learn models, and that incorporate prior knowledge about the data itself.

^{1}Vinyals et al. (2016). also introduced a novel *context embedding* method which took the full context of the support set $\mathcal{S}$ into account so that $\mbox{ g}[\bullet] = \mbox{ g}[\mathbf{x}, \mathcal{S}]$. Here, the support set was considered as a sequence and encoded by a bi-directional LSTM. Snell et al. (2017) later argued that this context embedding was problematic and redundant.

Adversarial training (Madry et al. 2017) directly optimizes for adversarial robustness by (i) minimizing the loss $\mathcal{L}[\bullet]$ on $I$ data/label pairs $\{\mathbf{x}_{i},y_{i}\}$ while simultaneously (ii) maximizing the loss for each example with respect to an adversarial change $\boldsymbol\delta_{i}$:

\begin{equation}

\min_{\boldsymbol\phi}\frac{1}{|I|} \sum_{i=1}^{I} \max_{\|\boldsymbol\delta_{i}\| \leq \epsilon} \mathcal{L}\left[\mbox{f}[\mathbf{x}_{i} + \boldsymbol\delta_{i},\boldsymbol\phi], y_{i}\right], \tag{1.1}

\end{equation}

where $\boldsymbol\delta_{i}$ is constrained to lie within a specified $\epsilon$-ball and $\mbox{f}[\bullet,\boldsymbol\phi]$ is the network function with parameters $\boldsymbol\phi$.

Unfortunately, generating adversarial examples is a non-convex optimization problem, and so this worst-case objective can only be approximately solved (Kolter & Madry 2018). Finding a lower bound is equivalent to finding an adversarial sample, and empirically it has been observed that search algorithms almost exclusively produce high frequency solutions (Guo et al. 2018). These are samples with small pixel-wise perturbations dispersed across an image. This suggests that defenses designed to counter such perturbations may be vulnerable to low frequency solutions, which is the hypothesis we focused on analyzing in our latest paper (Sharma et al. 2019).

Recent work has shown the effectiveness of low frequency perturbations. Guo *et al.* (2018) improved the query efficiency of the decision-based gradient-free boundary attack (Brendel *et al.* 2017) by constraining the perturbation to lie within a low frequency subspace. Sharma *et al. *(2018) applied a 2D Gaussian filter on the gradient with respect to the input image during the iterative optimization process to win the CAAD 2018 competition.

However, two questions still remain unanswered:

- Are the results seen in recent work simply due to the
*reduced search space*or specifically due to the use of*low frequency components*? - Under what conditions are low frequency perturbations more effective than unconstrained perturbations?

To answer these questions, we utilize the discrete cosine transform (DCT) to test the effectiveness of perturbations manipulating specified frequency components. We remove certain frequency components of the perturbation $\boldsymbol\delta$ by applying a mask to its DCT transform $\text{DCT}(\boldsymbol\delta)$. We then reconstruct the perturbation by applying the inverse discrete cosine transform (IDCT) to the masked DCT transform:

\begin{align}

\text{FreqMask}[\boldsymbol\delta]=\text{IDCT}[\text{Mask}[\text{DCT}[\boldsymbol\delta]]]~. \tag{1.2}

\end{align}

Accordingly in our attack, we use the following gradient:

\begin{equation}

\nabla_{\boldsymbol\delta} \mathcal{L}[\mathbf{x}+\text{FreqMask}(\boldsymbol\delta),y]. \tag{1.3}

\end{equation}

As can be seen in Figure 1, the condition `DCT_High` only preserves high frequency components; `DCT_Low` only preserves low frequency components; `DCT_Mid` only preserves mid frequency components; and `DCT_Rand` preserves randomly sampled components. For a given reduced dimensionality $n$, we preserve $n \times n$ components. Note that when $n=128$, we only preserve $128^2 / 299^2 \approx 18.3\%$ of the frequency components, which is a relatively small fraction of the original unconstrained perturbation.

Though adversarial examples are defined with regards to generally inducing decision change, one can restrict the attack further by only prescribing success if the decision is changed to a specific target. We evaluate attacks both with and without specified targets, termed in the literature as *targeted* and *non-targeted* attacks, respectively. We use the ImageNet dataset, where the 1000 distinct classes make targeted attacks significantly harder.

We use $l_\infty$-constrained projected gradient descent (Kurakin *et al.* 2016; Madry *et al.* 2017; Kolter & Madry 2018) with momentum (Dong *et al.* 2017) which is referred to as the momentum iterative method or MIM for short. We test $\epsilon=16/255$ and $\text{iterations}=[1,10]$ for the non-targeted case; $\epsilon=32/255$ and $\text{iterations}=10$ for the targeted case. We benchmark the attack with and without frequency constraints. For each mask type, we test $n=[256,128,64,32]$ with $d = 299$. For `DCT_Rand`, we average results over $3$ random seeds.

Furthermore, we evaluate attacks in the *white-box*, the *grey-box*, and the *black-box* settings. For each setting, given models $A$ and $B$, where the perturbation is generated on $A$, evaluation is conducted on $A$, ''defended'' $A$, and distinct $B$, respectively. For defenses, we use the top-4 winners of the NeurIPS 2017 competition (Kurakin *et al.* 2018), which were all prepended to the strongest released adversarially trained model at the time (Tramer *et al.* 2017).^{1} For our representative undefended model, we evaluate against the state-of-the-art found by neural architecture search (Zoph *et al.* 2017).^{2}

Low frequency perturbations can be generated more efficiently (figure 2) and appear more effective (figure 4 and figure 5), when evaluated against defended models. However, against undefended models (figure 3), no tangible benefit can be observed.

This can be seen more clearly when tracking each individual source-target pair (figure 6). Specifically, we can see that the NeurIPS 2017 competition winners provide almost no additional robustness to the underlying model when low frequency perturbations are applied.

However, we do observe that low frequency perturbations do not improve black-box transfer between undefended models. Figure 7 presents the normalized difference between attack success rate (ASR) on each of the target models with ASR on the undefended model, showing that defended models are roughly as vulnerable as undefended models when encountered by low frequency perturbations.

Our results demonstrate that given the same search space size, only low frequency perturbations yield performance improvement, namely in generation efficiency and effectiveness when evaluated against defended ImageNet models. When confronted with low frequency perturbations, the top-4 NeurIPS 2017 defenses provide no robustness benefit, and are roughly as vulnerable as undefended models. The question remains though: does adversarially perturbing the low frequency components of the input affect human perception?

Representative examples are shown in figures 8 and 9. Though the perturbations do not significantly change human perceptual judgement (*e.g.*, the top example still appears to be a standing woman), the perturbations with $n\leq 128$ are indeed perceptible. Although it is well-known that $\ell_p$-norms (in input space) are far from metrics aligned with human perception, it is still assumed that with a small enough bound (e.g. $\ell_\infty$ $\epsilon=16/255$), the resulting ball will constitute a subset of the imperceptible region (Kolter & Madry 2018). The fact that low frequency perturbations are fairly visible challenges this common belief.

In all, we hope our study encourages researchers to not only consider the frequency space, but perceptual priors in general, when bounding perturbations and proposing tractable, reliable defenses.

^{1}*https://github.com/tensorflow/models/tree/master/research/adv_imagenet_models*

^{2}*https://github.com/tensorflow/models/tree/master/research/slim*

*Discrimination* is the unequal treatment of individuals of certain groups, resulting in members of one group being deprived of benefits or opportunities. Common groups that suffer discrimination include those based on age, gender, skin colour, religion, race, language, culture, marital status, or economic condition.

The unintentional unfairness that occurs when a decision has widely different outcomes for different groups is known as *disparate impact*. As machine learning algorithms are increasingly used to determine important real-world outcomes such as loan approval, pay rates, and parole decisions, it is incumbent on the AI community to minimize unintentional discrimination.

This tutorial discusses how *bias* can be introduced into the machine learning pipeline, what it means for a decision to be *fair*, and methods to remove bias and ensure fairness.

There are many possible causes of bias in machine learning predictions. Here we briefly discuss three: (i) the adequacy of the data to represent different groups, (ii) bias inherent in the data, and (iii) the adequacy of the model to describe each group.

**Data adequacy.** Infrequent and specific patterns may be down-weighted by the model in the name of generalization and so minority records can be unfairly neglected. This lack of data may not just be because group membership is small; data collection methodology can exclude or disadvantage certain groups (e.g., if the data collection process is only in one language). Sometimes records are removed if they contain missing values and these may be more prevalent in some groups than others.

**Data bias.** Even if the amount of data is sufficient to represent each group, training data may reflect existing prejudices (e.g., that female workers are paid less), and this is hard to remove. Such historical unfairness in data is known as *negative legacy*. Bias may also be introduced by more subtle means. For example, data from two locations may be collected slightly differently. If group membership varies with location this can induce biases. Finally, the choice of attributes to input into the model may induce prejudice.

**Model adequacy.** The model architecture may describe some groups better than others. For example, a linear model may be suitable for one group but not for another.

A model is considered *fair* if errors are distributed similarly across protected groups, although there are many ways to define this. Consider taking data $\mathbf{x}$ and using a machine learning model to compute a score $\mbox{f}[\mathbf{x}$$]$ that will be used to predict a binary outcome $\hat{y}\in\{0,1\}$. Each data example $\mathbf{x}$ is associated with a *protected attribute* $p$. In this tutorial, we consider it to be binary $p\in\{0,1\}$. For example, it might encode sub-populations according to gender or ethnicity.

We will refer to $p=0$ as the *deprived population* and $p=1$ as the *favored population*. Similarly we will refer to $\hat{y}=1$ as the *favored outcome*, assuming it represents the more desirable of the two possible results.

Assume that for some dataset, we know the ground truth outcomes $y\in\{0,1\}$. Note that these outcomes may differ statistically between different populations, either because there are genuine differences between the groups or because the model is somehow biased. According to the situation, we may want our estimate $\hat{y}$ to take account of these differences or to compensate for them.

Most definitions of fairness are based on *group fairness*, which deals with statistical fairness across the whole population. Complementary to this is *individual fairness* which mandates that similar individuals should be treated similarly regardless of group membership. In this blog, we'll mainly focus on group fairness, three definitions of which include: (i) demographic parity, (ii) equality of odds, and (iii) equality of opportunity. We now discuss each in turn.

*Demographic parity* or *statistical parity* suggests that a predictor is unbiased if the prediction $\hat{y}$ is independent of the protected attribute $p$ so that

\begin{equation}

Pr(\hat{y}|p) = Pr(\hat{y}). \tag{2.1}

\end{equation}

Here, the same proportion of each population are classified as positive. However, this may result in different false positive and true positive rates if the true outcome $y$ does actually vary with the protected attribute $p$.

Deviations from statistical parity are sometimes measured by the *statistical parity difference*

\begin{equation}

\mbox{SPD} = Pr(\hat{y}=1, p=1) - Pr(\hat{y}=1, p=0), \tag{2.2}

\end{equation}

or the *disparate impact* which replaces the difference in this equation with a ratio. Both of these are measures of *discrimination* (i.e. deviation from fairness).

*Equality of odds* is satisfied if the prediction $\hat{y}$ is conditionally independent to the protected attribute $p$, given the true value $y$:

\begin{equation}

Pr(\hat{y}|y,p) = Pr(\hat{y}| y). \tag{2.3}

\end{equation}

This means that the true positive rate and false positive rate will be the same for each population; each error type is matched between each group.

*Equality of opportunity* has the same mathematical formulation as equality of odds, but is focused on one particular label $y=1$ of the true value so that:

\begin{equation}

Pr(\hat{y}|y=1,p) = Pr(\hat{y}| y=1). \tag{2.4}

\end{equation}

In this case, we want the true positive rate $Pr(\hat{y}=1|y=1)$ to be the same for each population with no regard for the errors when $y=0$. In effect it means that the same proportion of each population receive the "good'' outcome $y=1$.

Deviation from equality of opportunity is measured by the *equal opportunity difference*:

\begin{equation}

\mbox{EOD} = Pr(\hat{y}=1,y=1, p=1) - Pr(\hat{y}=1,y=1, p=0). \tag{2.5}

\end{equation}

To make these ideas concrete, we consider the example of an algorithm that predicts credit rating scores for loan decisions. This scenario follows from the work of Hardt *et al.* (2016) and the associated blog.

There are two pools of loan applicants $p\in\{0,1\}$ that we'll describe as the blue and yellow populations. We assume that we are given historical data, so we know both the credit rating and whether the applicant actually defaulted on the loan ($y=0$) or repaid it ($y=1$).

We can now think of four groups of data corresponding to (i) the blue and yellow populations and (ii) whether they did or did not repay the loan. For each of these four groups we have a distribution of credit ratings (figure 1). In an ideal world, the two distributions for the yellow population would be exactly the same as those for the blue population. However, as figure 1 shows, this is clearly not the case here.

Why might the distributions for blue and yellow populations be different? It could be that the behaviour of the populations is identical, but the credit rating algorithm is biased; it may favor one population over another or simply be more noisy for one group. Alternatively, it could be that that the populations genuinely behave differently. In practice, the differences in blue and yellow distributions are probably attributable to a combination of these factors.

Let's assume that we can't retrain the credit score prediction algorithm; our job is to adjudicate whether each individual is refused the loan ($\hat{y}=0)$ or granted it ($\hat{y}=1$). Since we only have the credit score $\mbox{f}[\mathbf{x}$$]$ to go on, the best we can do is to assign different thresholds $\tau_{0}$ and $\tau_{1}$ for the blue and yellow populations so that the loan is granted if $f[\mathbf{x}$$]$ $>\tau_{0}$ for the blue population and $f[\mathbf{x}$$]$ $>\tau_{1}$ for the yellow population.

We'll now consider different possible ways to set these thresholds that result in different senses of fairness. We emphasize that we are not advocating any particular criterion, but merely exploring the ramifications of different choices.

**Blindness to protected attribute:** We choose the same threshold for blue and yellow populations. This sounds sensible, but it neither guarantees that the overall frequency of loans, nor the frequency of successful loans will be the same for the two groups. For the thresholds chosen in figure 2a, many more loans are made to the yellow population than the blue population (figure 2b). Moreover, examination of the receiver operating characteristic (ROC) curve shows that both the rate of true positives $Pr(\hat{y}=1|y=1)$ and false alarms $Pr(\hat{y}=1|y=0)$ differ for the two groups (figure 2c).

**Equality of odds:** This definition of fairness proposes that the false positive and true positive rates should be the same for both populations. This also sounds reasonable, but figure 2c shows that it is not possible for this example. There is no combination of thresholds that can achieve this because the ROC curves do not intersect. Even if they did, we would be stuck giving loans based on the particular false positive and true positive rates at the intersection which might not be desirable.

**Demographic parity:** The threshold could be chosen so that the same proportion of each group are classified as $\hat{y} =1$ and given loans (figure 3). We make an equal number of loans to each group despite the different tendencies of each to repay (figure 3b). This has the disadvantage that the true positive and false positive rates might be completely different in different populations (figure 3c). From the perspective of the lender, it is desirable to give loans in proportion to people's ability to pay them back. From the perspective of an individual in a more reliable group, it may seem unfair that the other group gets offered the same number of loans despite the fact they are less reliable.

**Equal opportunity:** The thresholds are chosen so that so that the true positive rate is is the same for both population (figure 4). Of the people who pay back the loan, the same proportion are offered credit in each group. In terms of the two ROC curves, it means choosing thresholds so that the vertical position on each curve is the same without regard for the horizontal position (figure 2c). However, it means that different proportions of the blue and yellow groups are given loans (figure 4b).

We have seen that there is no straightforward way to choose thresholds on an existing classifier for different populations, so that all definitions of fairness are satisfied. Now we'll investigate a different approach that aims to make the classification performance more similar for the two models.

The ROC curves show that accuracy is higher in predicting whether the blue population will repay the loan as opposed to the yellow group (i.e. the blue ROC curve is everywhere higher than the yellow one). What if we try to reduce the accuracy for the blue population so that this more nearly matches? One way to do this is to add noise to the credit score for the blue population (figure 5). As we add increasing amounts of noise the blue ROC curve moves towards the positive diagonal and at some point will cross the yellow ROC curve. Now equality of odds can be achieved.

Unfortunately, this approach has two unattractive features. First, we now make worse decisions for the blue population; it is a general feature of most remedial approaches that there is a trade off between accuracy and fairness (Kamiran & Calders 2012; Corbett-Davies *et al.* 2017). Second, adding noise violates individual fairness. Two identical members of the blue population may have different noise values added to the scores, resulting in different decisions on their loans.

The conclusion of the worked loan example is that it is very hard to remove bias once the classifier has already been trained, even for very simple cases. For further information, the reader is invited to consult Kamiran & Calders (2012), Hardt* et al*. (2016), Menon & Williamson (2017) and Pleiss* et al.* (2017).

Post-Processing | In-Processing | Pre-Processing | Data Collection |
---|---|---|---|

• Change thresholds • Trade off accuracy for fairness |
• Adversarial training • Regularize for fairness • Constrain to be fair |
• Modify labels • Modify input data • Modify label/data pairs • Weight label/data pairs |
• Identify lack of examples or variates and collect |

Thankfully, there are approaches to deal with bias at all stages of the data collection, preprocessing, and training pipeline (figure 6). In this section we consider some of these methods. In the ensuing discussion, we'll assume that the true behaviour of the different populations is the same. Hence, we are interested in making sure that the predictions of our system do not differ for each population.

A straightforward approach to eliminating bias from datasets would be to remove the protected attribute and other elements of the data that are suspected to contain related information. Unfortunately, such suppression rarely suffices. There are often subtle correlations in the data that mean that the protected attribute can be reconstructed. For example, we might remove race, but retain information about the subject's address, which could be strongly correlated with the race.

The degree to which there are dependencies between the data $\mathbf{x}$ and the protected attribute $p$ can be measured using the mutual information

\begin{equation}

\mbox{LP} = \sum_{\mathbf{x},p} Pr(\mathbf{x},p) \log\left[\frac{Pr(\mathbf{x},p)}{Pr(\mathbf{x})Pr(p)}\right], \tag{2.6}

\end{equation}

which is known as the *latent prejudice* (Kamishima *et al.* 2011). As this measure increases, the protected attribute becomes more predictable from the data. Indeed, Feldman *et al. *(2015) and Menon & Williamson (2017) have shown that the predictability of the protected attribute puts mathematical bounds on the potential discrimination of a classifier.

We'll now discuss four approaches for removing bias by manipulating the dataset. Respectively, these modify the labels $y$, the observed data $\mathbf{x}$, the data/label pairs $\{\mathbf{x},y\}$, and the weighting of these pairs.

Kamiran & Calders (2012) proposed changing some of the training labels which they term *massaging* the data. They compute a classifier on the original dataset and find examples close to the decision surface. They then swap the labels in such a way that a positive outcome for the disadvantaged group is more likely and re-train. This is a heuristic approach that empirically improves fairness at the cost of accuracy.

Feldman *et al.* (2015) proposed manipulating individual data dimensions $x$ in a way that depends on the protected attribute $p$. They align the cumulative distributions $F_{0}[x]$ and $F_{1}[x]$ for feature $x$ when the protected attribute $p$ is 0 and 1 respectively to a median cumulative distribution $F_{m}[x]$. This is similar to standardising test scores across different high schools (figure 7) and is termed *disparate impact removal*. This approach has the disadvantage that it treats each input variable $x\in\mathbf{x}$ separately and ignores their interactions.

Calmon *et al.* (2017) learn a randomized transformation $Pr(\mathbf{x}^{\prime}, y^{\prime}|\mathbf{x},y,p)$ that transforms data pairs $\{\mathbf{x}, y\}$ to new data values $\{\mathbf{x}^{\prime}, y^{\prime}\}$ in a way that depends explicitly on the protected attribute $p$. They formulate this as an optimization problem in which they minimize the change in data utility, subject to limits on the prejudice and distortion of the original values. They show that this optimization problem may be convex in certain conditions.

Unlike disparate impact removal, this takes into account interactions between all of the data dimensions. However, the randomized transformation is formulated as a probability table, so this is only suitable for datasets with small numbers of discrete input and output variables. The randomized transformation, which must also be applied to test data, also violates individual fairness.

Kamiran & Calders (2012) propose to *re-weight* the $\{\mathbf{x}, \mathbf{y}\}$ tuples in the training dataset so that cases where the protected attribute $p$ predicts that the disadvantaged group will get a positive outcome are more highly weighted. They then train a classifier that makes use of these weights in its cost function. Alternately, they propose *re-sampling* the training data according to these weights and using a standard classifier.

In the previous section, we introduced the latent prejudice measure based on the mutual information between the data $\mathbf{x}$ and the protected attribute $p$. Similarly, we can measure the dependence between the labels $y$ and the protected attribute $p$:

\begin{equation}

\mbox{IP} = \sum_{y,p} Pr(y,p) \log\left[\frac{Pr(y,p)}{Pr(y)Pr(p)}\right]. \tag{2.7}

\end{equation}

This is known as the *indirect prejudice* (Kamishima *et al.* 2011). Intuitively, if there is no way to predict the labels from the protected attribute and vice-versa then there is no scope for bias.

One approach to removing bias during training is to explicitly remove this dependency using adversarial learning. Other approaches to removing bias include penalizing the mutual information using regularization, fitting the model under the constraint that it is not biased. We'll briefly discuss each in turn.

Adversarial-debiasing (Beutel *et al. *2017; Zhang* et al.* 2018) reduces evidence of protected attributes in predictions by trying to simultaneously fool a second classifier that tries to guess the protected attribute $p$. Beutel *et al. *(2017) force both classifiers to use a shared representation and so minimizing the performance of the adversarial classifier means removing all information about the protected attribute from this representation (figure 8).

Zhang *et al. *(2018) use the adversarial component to predict $p$ from (i) the final classification logits $f[\mathbf{x}]$ (to ensure demographic parity), (ii) the classification logits $f[\mathbf{x}]$ and the true class $y$ (to ensure equality of odds), or (iii) the final classification logits and the true result for just one class (to ensure equality of opportunity).

Kamishima *et al.* (2011) proposed adding an extra regularization condition to the output of logistic regression classifier that tried to minimize the mutual information between the protected attribute and the prediction $\hat{y}$. They first re-arranged the indirect prejudice expression using the definition of conditional probability to get

\begin{eqnarray}

\mbox{PI} &=& \sum_{y,p} Pr(y|\mathbf{x},p) \log\left[\frac{Pr(y,p)}{Pr(y)Pr(p)}\right]\nonumber\\

&=& \sum_{y,p} Pr(y|\mathbf{x},p) \log\left[\frac{Pr(y|p)}{Pr(y)}\right]. \tag{2.8}

\end{eqnarray}

Then, they formulate a regularization loss based on the expectation of this over the data set:

\begin{equation}

\mbox{L}_{reg} = \sum_{i}\sum_{\hat{y},p} Pr(\hat{y}_{i}|\mathbf{x}_{i},p_{i})\log\left[\frac{Pr(\hat{y}_{i}|p_{i})}{Pr(\hat{y}_{i})}\right] \tag{2.9}

\end{equation}

where $i$ indexes the data examples, which they add to the main training loss.

Zafar *et al.* (2015) formulated unfairness in terms of the covariance between the protected attribute $\{p_{i}\}_{i=1}^{I}$ and the signed distances $\{d[\mathbf{x}_{i},\boldsymbol\theta]\}_{i=1}^{I}$ of the associated feature vectors $\{\mathbf{x}_{i}\}_{i=1}^{I}$ from the decision boundary, where $\boldsymbol\theta$ denotes the model parameters. Let $\overline{p}$ represent the mean value of the protected attribute. They then minimize the main loss function such that the covariance remains within some threshold $t$.

\begin{equation}

\begin{aligned}

& \underset{\boldsymbol\theta}{\text{minimize}}

& & L[\boldsymbol\theta] \\

& \text{subject to}

& & \frac{1}{I}\sum_{i=1}^{I}(p_{i}-\overline{p})d[\mathbf{x}_{i},\boldsymbol\theta] \leq t\\

& & & \frac{1}{I}\sum_{i=1}^{I}(p_{i}-\overline{p})d[\mathbf{x}_{i},\boldsymbol\theta] \geq -t

\end{aligned} \tag{2.10}

\end{equation}

This constrained optimization problem can also be written as a regularized optimization problem in which the fairness constraints are moved to the objective and the corresponding Lagrange multipliers act as regularizers. Zafar *et al. *(2015) also introduced a second formulation where they maximize fairness under accuracy constraints.

Zemel *et al.* (2013) presented a method that maps data to an intermediate space in a way that depends on the protected attribute and obfuscates information about that attribute. Since this mapping is learnt during training, this method could considered either a pre-processing approach or an in-processing algorithm.

Chen *et al.* (2018) argue that a trade off between fairness and accuracy may not be acceptable and that these challenges should be addressed through data collection. They aim to diagnose unfairness induced by inadequate data and unmeasured predictive variables and prescribe data collection approaches to remedy these problems.

In this tutorial, we've discussed what it means for a classifier to be fair, how to quantify the degree of bias in a dataset and methods to remedy unfairness at all stages in the pipeline. An empirical analysis of fairness-based interventions is presented in Friedler *et al.* (2019). There are a large number of toolkits available to help evaluate fairness, the most comprehensive of which is AI Fairness 360.

This tutorial has been limited to a discussion of supervised learning algorithms, but there is also an orthogonal literature on bias in NLP embeddings (e.g. Zhao *et al.* 2019).

By systematically controlling the frequency components of the perturbation, evaluating against the top-placing defense submissions in the NeurIPS 2017 competition, we empirically show that performance improvements in both the white-box and black-box transfer settings are yielded only when low frequency components are preserved. In fact, the defended models based on adversarial training are roughly as vulnerable to low frequency perturbations as undefended models, suggesting that the purported robustness of state-of-the-art ImageNet defenses is reliant upon adversarial perturbations being high frequency in nature. We do find that under ℓ∞ ϵ=16/255, the competition distortion bound, low frequency perturbations are indeed perceptible. This questions the use of the ℓ∞-norm, in particular, as a distortion metric, and, in turn, suggests that explicitly considering the frequency space is promising for learning robust models which better align with human perception.

]]>