Our study shows that maximizing margins can be achieved by minimizing the adversarial loss on the decision boundary at the "shortest successful perturbation", demonstrating a close connection between adversarial losses and the margins. We propose Max-Margin Adversarial (MMA) training to directly maximize the margins to achieve adversarial robustness.

Instead of adversarial training with a fixed $\epsilon$, MMA offers an improvement by enabling adaptive selection of the "correct" $\epsilon$ as the margin individually for each datapoint. In addition, we rigorously analyze adversarial training with the perspective of margin maximization, and provide an alternative interpretation for adversarial training, maximizing either a lower or an upper bound of the margins. Our experiments empirically confirm our theory and demonstrate MMA training's efficacy on the MNIST and CIFAR10 datasets w.r.t. $\ell_\infty$ and $\ell_2$ robustness.

]]>We will not focus on the input type; we assume that the input has been processed by a suitable *encoder* to create an embedding in a latent space. Instead, we concentrate on the *decoder* which takes this embedding and generates sequences of natural language tokens.

We will use the running example of *neural machine translation*: given a sentence in language A, we aim to generate a translation in language B. The input sentence is processed by the encoder to create a fixed-length embedding. The decoder then uses this embedding to output the translation word-by-word.

We'll now describe the *encoder-decoder architecture* for this translation model in more detail (figure 1). The encoder takes the sequence $\mathbf{x}$ which consists of $K$ words $\{x_t\}_{t=1}^{K}$ and outputs the latent space embedding $\mathbf{h}$. The *decoder* takes this latent representation $\mathbf{h}$ and outputs a sequence of $L$ word tokens $\{\hat{y}_t\}_{t=1}^{L}$ one by one.

We consider an encoder which converts each word token to a fixed length word embedding using a method such as the SkipGram algorithm (Mikolov *et* al. 2013). Then these word embeddings are combined by a neural architecture such as a recurrent neural network (Sundermeyer *et* al. 2012), self-attention network (Vaswani *et* al. 2017), or convolutional neural network (Dauphin *et* al. 2017) to create a fixed-length hidden state $\mathbf{h}$ that describes the whole input sequence.

At each step $t$, the decoder takes this hidden state and an input token, and outputs a probability distribution over the vocabulary $\mathcal{V}$ from which the output word $\hat{y}_{t}$ can be chosen. The hidden state itself is also updated, so that it knows about the history of the generated words. The first input token is always a special *start of sentence* (SOS) token. Subsequent tokens correspond to the predicted output word from the previous time-step during inference, or the ground truth word during training. There is a special *end of sentence* (EOS) token. When this is generated, it signifies that no more tokens will be produced.

This tutorial is divided into two parts. In this first part, we assume that the system has been trained with a maximum likelihood criterion and discuss algorithms for the decoder. We will see that maximum likelihood training has some inherent problems related to the fact that the cost function does not consider the whole output sequence at once and we'll consider some possible solutions.

In the second part of the tutorial we change our focus to consider alternative training methods. We consider fine-tuning the system using reinforcement learning or minimum risk training which use sequence-level cost functions. Finally, we review a series of methods that frame the problem as structured prediction. A summary of these methods is given in table 1.

Type of method | Training | Inference |
---|---|---|

Decoding algorithms | Maximum likelihood Maximum likelihood Maximum likelihood Maximum likelihood Maximum likelihood Maximum likelihood |
greedy search beam search diverse beam search iterative beam search top-k sampling nucleus sampling |

Sequence-level fine-tuning | Fine-tune with reinforcement learning Fine-tune with minimum risk training |
greedy search/beam search beam search |

Sequence-level training | Scheduled sampling Beam search optimization SeaRNN Reward augmented max likelihood |
greedy search / beam search beam search greedy search / beam search beam search |

In this section, we describe the standard approach to train encoder-decoder architectures, which uses the maximum likelihood criterion. Specifically, the model is trained to maximize the conditional log-likelihood for each of $I$ input-output sequence pairs $\{\mathbf{x}_{i},\mathbf{y}_{i}\}_{i=1}^I$ in the training corpus so that:

\begin{equation}\label{eq:local_max_like}

L = \sum_{i=1}^{I} \log\left[ Pr(\mathbf{y}_i | \mathbf{x}_i, \boldsymbol\phi)\right] \tag{1}

\end{equation}

where $\boldsymbol\phi$ are the weights of the model.

The probabilities of the output words are evaluated sequentially and each depends on the previously generated tokens, so the probability term in equation 1 takes the form of an auto-regressive model and can be decomposed as:

\begin{equation}

Pr(\mathbf{y}_i | \mathbf{x}_i, \boldsymbol\phi)= \prod_{t=1}^{L_i} Pr(y_{i,t} |\mathbf{y}_{i,<t}, \mathbf{x}_i, \boldsymbol\phi). \tag{2}

\end{equation}

Here, the probability of token $y_{i,t}$ from the $i^{th}$ sequence at time $t$ depends on the tokens $\mathbf{y}_{i,<t} = \{y_{i,0}, y_{i,1},\ldots y_{i,t-1}\}$ seen up to this point as well as the input sentence $\mathbf{x}_{i}$ (via the latent embedding from the encoder).

This training criterion seems straightforward, but there is a subtle problem. In training, the previously seen ground truth tokens $\mathbf{y}_{i,<t}$ are used to compute the probability of the current token $y_{i,t}$. However, when we perform inference with the model and generate text, we no longer have access to the ground truth tokens; we only know the actual tokens $\hat{\mathbf{y}}_{i,<t}$ that we have generated so far (figure 2).

The approach of using the ground truth tokens for training is known as *teacher forcing*. In a sense, it means that the training scenario is unrealistic and does not map to the real situation when we perform inference. In training, the model is only exposed to sequences of ground truth tokens, but sees its own output when deployed. As we shall see in the following discussion, this *exposure bias* may result in some problems in the decoding process.

We now return to the decoding (inference) process. At each time step, the system predicts the probability $Pr(\hat{y}_t |\hat{\mathbf{y}}_{<t}, \mathbf{x}, \boldsymbol\phi)$ over items in the vocabulary, and we have to select a particular word from this distribution to feed into the next decoding step. The goal is to pick the sequence with the highest overall probability:

\begin{equation}

Pr(\hat{\mathbf{y}} | \mathbf{x}, \boldsymbol\phi) = \prod_t Pr(\hat{y}_t | \hat{\mathbf{y}}_{<t},\mathbf{x}, \boldsymbol\phi). \tag{3}

\end{equation}

In principle, we can simply compute the probability of every possible sequence by brute force. However, there are as many choices as there are words $|\mathcal{V}|$ in the vocabulary for each position and so a sentence of length $L$ would have $|\mathcal{V}|^{L}$ possible sequences. The vocabulary size $|\mathcal{V}|$ might be as large as 50,000 words and so this might not be practical.

We can improve the situation by re-using partial computations; there are many sequences which start in the same way and so there is no need to re-compute these partial likelihoods. A dynamic programming approach can exploit this structure to produce an algorithm with complexity $\mathcal{O}[L|\mathcal{V}|^2]$ but this is still very expensive.

Since we cannot find the sequence with the maximum probability, we must resort to tractable search strategies that produce a reasonable approximation of this maximum. Two common approximations are *greedy search* and *beam search* which we discuss respectively in the next two sections.

The simplest strategy is *greedy search*. It consists of picking the most likely token according to the model at each decoding time step $t$ (figure 3a).

\begin{equation}

\hat{y}_t =\underset{w \in \mathcal{V}}{\mathrm{argmax}}\left[ Pr(y_{t} = w | \hat{\mathbf{y}}_{<t}, \mathbf{x}, \boldsymbol\phi)\right] \tag{4}

\end{equation}

Note that this does not guarantee that the complete output $\hat{\mathbf{y}}$ will have high overall probability relative to others. For example, having selected the most likely first token $y_{0}$, it may transpire that there is no token $y_{1}$ for which the probability $Pr(y_{1}|y_{0},\mathbf{x},\boldsymbol\phi)$ is high. It might have been overall better to choose a less probable first token, which is more compatible with a second token.

We have seen that searching over all possible sequences is intractable, and that greedy search does not necessarily produce a good solution. Beam search seeks a compromise between these extremes by performing a restricted search over possible sequences. In this regard, it produces a solution that is both tractable and superior in quality to greedy search (figure 3b).

At each step of decoding $t$, the $B$ most probable sequences $\mathcal{B}^{t-1} =\{\hat{\mathbf{y}}_{<t,b}\}_{b=1}^{B}$ are stored as candidate outputs. For each of these hypotheses the log probability is computed for each possible next token $w$ in the vocabulary $\mathcal{V}$, so $B|\mathcal{V}|$ probabilities are computed in all. From these, the new $B$ most probable sequences $\mathcal{B}^t =\{\hat{\mathbf{y}}_{<t+1,b}\}_{b=1}^{B}$ are retained. By analogy with the formula for greedy search, we have:

\begin{equation}\label{eq:beam_search}

\mathcal{B}^t =\underset{w \in \mathcal{V},b\in1\ldots B}{\mathrm{argtopk}}\left[B, Pr(y_{t} = w | \hat{\mathbf{y}}_{<t,b}, \mathbf{x}, \boldsymbol\phi)\right] \tag{5}

\end{equation}

where the function $\mathrm{argtopk}[K,\bullet]$ returns the set of the top $K$ items that maximize the second argument. The process is repeated until EOS tokens are produced or the maximum decoding length is reached. Finally, we return the most likely overall result.

The integer $B$ is known as the *beam width*. As this increases, the search becomes more thorough but also more computationally expensive. In practice, it is common to see values of $B$ in the range 5 to 200. When the beam width is 1, the method becomes equivalent to greedy search.

When we train a decoder with a maximum-likelihood criterion, the resulting sentences can exhibit a lack of diversity. This happens at both (i) the beam level (many sentences in the same beam may be very similar) and (ii) the decoding level (words are repeated during one iteration of decoding). In the next two sections we look at methods that have been proposed to ameliorate these issues.

While beam search is superior to greedy search, it often produces sentences that have the same or a very similar start (figure 4a). Decoding is done in a left-to-right fashion and the probability weights are often concentrated at the beginning; even if search is performed over $B$ sentences, the weight of the first few words will mean that most of these sentences start with these words and there is little diversity.

4a) Beam SearchA steam engine train travelling down train tracks.A steam engine train travelling down tracks.A steam engine train travelling through a forest.A steam engine train travelling through a lush green forest.A steam engine train travelling through a lush green countryside.A train on a train track with a sky background. |

4b) Diverse Beam SearchA steam engine train travelling down train tracks.A steam engine train travelling through a forest. An old steam engine train travelling down train tracks. An old steam engine train travelling through a forest. A black train is on the tracks in a wooded area. A black train is on the tracks in a rural area. |

This raises the question of whether the likelihood objective correlates with our end-goals. We might care about other criteria such as diversity, which is important for chatbots: if a chatbot always said the same thing in response to a generic input such as "how are you today?", it would become quickly dull. Hence, we might want to factor in other criteria of quality when decoding.

To counter beam-level repetition, Vijayakumar *et* al. (2018) proposed a variant of beam search, called *diverse beam search*, which encourages more variation in the generated sentences than pure beam search (figure 4b). The beam is divided into $G$ groups. Regular beam search is performed in the first group to generate $B'=\frac{B}{G}$ sentences. For the second group, at step $t$ of decoding, the beam search criterion is augmented with a factor that penalizes token sequences that are similar to the first $t$ words of the hypotheses in the first group. For the third group, sequences that are similar to those in either of the first two groups are penalized, and so on (figure 5).

Vijayakumar *et* al. (2018) investigate several similarity metrics including *Hamming diversity* which penalizes tokens based on their number of occurrences in the previous groups.

Diverse beam search has the disadvantage that it only discourages sequences that are close to the final sequences found in previous beams. However, there may be significant portions of the space that were searched to find these hypotheses and since we didn't store the intermediate results there is nothing to stop us from redundantly considering the same part of the search space again (figure 6).

Kulikov *et* al. (2018) introduced *iterative beam search* which aims to solve this problem. It resembles diverse beam search in that beams (groups of hypotheses) are computed and recorded. These beams are ordered and each is affected by the previous beams. However, unlike diverse beam search we do not wait for a beam search to complete before computing the others. Instead, they are computed concurrently.

Consider the situation where at ouput time $t-1$ we have $G$ groups of beams, each of which contains $B^{\prime}$ hypotheses. We extend the first beam to length $t$ in the usual way; we consider concatenating every possible vocabulary word with each of the $B^{\prime}$ hypotheses, evaluate the probabilities and retain the best overall $B^{\prime}$ solutions of length $t$.

When we extend the second beam to length $t$, we follow the same procedure. However, we now set any hypotheses that are too close in Hamming distance to those in the first beam to have zero probability. Likewise, when we extend the third beam, we discount any hypotheses that are close to those in the first two beams. The result is that each beam is forced to explore a different part of the search space and the final results have increased diversity (figure 7).

Until this point, we have assumed that the best way to decode is by maximizing the probability of the output words, using either greedy search, beam search, or a variation on these techniques. However, Holtzman *et* al. (2019) demonstrate that human speech does not stay in high probability zones and is often much more surprising than the text generated by these methods.

This raises the question of whether we should sample randomly from the output probability distribution rather than search for likely decodings. Unfortunately, this can also lead to degenerate cases. Holtzman *et* al. (2019) conjecture that at some point during decoding, the model is likely to sample from the tail of the distribution (i.e., from the set of tokens which are much less probable than the gold token). Once the model samples from this tail, it might not know how to recover. In the next two sections, we consider two methods that aim to repress such behavior.

Fan *et* al. (2018) proposed *top-$k$ sampling* as a possible remedy. Consider one iteration of decoding at the $t$-th time step. Let us define $\mathcal{V}_{K,t}$ as the set of the $K$ most probable next tokens according to $Pr(y_{t}=w|\hat{y}_{<t},\boldsymbol\phi)$. Let us also define the sum $Z$ of these $K$ probabilities:

\begin{equation}

Z = \sum_{w \in \mathcal{V}_K^t} Pr(y_{t}=w|\hat{\mathbf{y}}_{<t},\boldsymbol\phi). \tag{6}

\end{equation}

Top-$k$ sampling proposes to re-scale these probabilities and ignore the other possible tokens:

\begin{equation}

Pr(y_{t}=w | \hat{\mathbf{y}}_{<t}|\boldsymbol\phi) \leftarrow

\begin{cases}

Pr(y_{t}=w | \hat{\mathbf{y}}_{<t},\boldsymbol\phi) / Z & \text{if } w \in \mathcal{V}_K^t \\

0 & \text{otherwise}. \tag{7}

\end{cases}

\end{equation}

With this strategy, we only sample from the $K$ most likely tokens and thus avoid tokens from the tail of the distribution (their probability is set to zero).

Unfortunately, it can be difficult to fix $K$ in practice. There exist extreme cases where the distribution is very peaked and so the top-$K$ tokens include tokens from the tail. Similarly, there may be cases where the distribution is very flat and valid tokens are excluded from the top-$K$ list.

Nucleus sampling (Holtzman *et* al. 2019) aims to solve this problem by retaining a fixed proportion of the probability mass. They define $\mathcal{V}_\tau^t$ as the smallest set such that:

\begin{equation*}

Z = \sum_{w \in \mathcal{V}^t_\tau} Pr(y_{t}=w|\hat{\mathbf{y}}_{<t}, \boldsymbol\phi) \geq \tau,

\end{equation*}

where $\tau$ is a fixed threshold. Then, as in top-$k$ sampling, probabilities are re-scaled to:

\begin{equation}

Pr(y_{t}=w | \hat{\mathbf{y}}_{<t}|\boldsymbol\phi) \leftarrow

\begin{cases}

Pr(y_{t}=w | \hat{\mathbf{y}}_{<t},\boldsymbol\phi) / Z & \text{if } w \in \mathcal{V}_\tau^t \\

0 & \text{otherwise}. \tag{8}

\end{cases}

\end{equation}

Since the set $\mathcal{V}_\tau^t$ is chosen so that the cumulative probability mass is at least $\tau$, nucleus sampling does not suffer in the case of a flat or a peaked distribution.

Part I of this tutorial has highlighted some of the problems that occur during decoding. We can classify these problems into two categories.

First, maximum-likelihood trains the model to stay in high probability regions of the token space. As shown by Holtzman *et* al. (2019), this differs significantly from human speech. If we want to take into account other criteria of quality, such as diversity, search strategies must be put in place to explore the space of likely outputs and greedy sampling or vanilla beam search are not enough.

Second, the exposure bias introduced by teacher forcing forces NNLG models to be myopic, and only look for the next most likely token given a ground truth prefix. As a consequence, if the model samples a token from the long tail, it might enter a "degenerate'' case. When this happens, the model "makes an error'' by sampling a low-probability token and does not know how to recover from this error.

We have presented some sampling strategies that alleviate these issues at inference time. However, since these problems are both side-effects of the maximum likelihood teacher-forcing methodology for training, another way to approach this is to modify the training method. In part II of this tutorial, we describe how Reinforcement Learning (RL) and structured prediction help with both maximum-likelihood and teacher-forcing induced issues.

]]>The structure of this post is as follows. First, we briefly review knowledge graphs and knowledge graph completion in static graphs. Second, we discuss the extension to temporal knowledge graphs. Third, we present our new method for knowledge completion in temporal knowledge graphs and demonstrate the efficacy of this method in a series of experiments. Finally, we draw attention to some possible future directions for work in this area.

Knowledge graphs are knowledge bases of facts where each fact is of the form $(Alice, Likes, Dogs)$. Here $Alice$ and $Dogs$ are called the head and tail entities respectively and $Likes$ is a relation. An example knowledge graph is depicted in figure 1.

KG completion is the problem of inferring new facts from a KG given the existing ones. This may be possible because the new fact is logically implied as in:

\begin{equation*}

(Alice, BornIn, London) \land (London, CityIn, England) \implies (Alice, BornIn, England)

\end{equation*}

or it may just be based on observed correlations. If $(Alice, Likes, Dogs)$ and $(Alice, Likes, Cats)$ then there's a high probability that $(Alice, Likes, Rabbits)$.

For a single relation-type, the problem of knowledge graph completion can be visualised in terms of completing a binary matrix. To see this, consider the simpler knowledge graph depicted in figure 2a, where there are only two types of entities and one type of relation. We can define a binary matrix with the head entities in the rows and the tail entities in the columns (figure 2b). Each known positive relation corresponds to an entry of '1' in this matrix. We do not typically know negative relations. However, we can generate putative negative relations, by randomly sampling combinations of head entity, tail entity and relation. This is reasonable for large graphs where almost all combinations are false. This process is known as negative sampling and these negatives correspond to entries of '0' in the matrix. The remainin missing values in the matrix are the relations that we wish to infer in the KG completion process.

This matrix representation of the single-relation knowledge graph completion problem suggests a way forward. We can consider factoring the binary matrix $\mathbf{M} = \mathbf{A}^{T}\mathbf{B}$ into the outer product of a portrait matrix $\mathbf{A}^{T}$ in which each row corresponds to the head entity and a landscape matrix $\mathbf{B}$ in which each column corresponds to the tail entity. This is illustrated in figure 3. Now the binary value representing whether a given fact is true or not is approximated by the dot product of the vector (embedding) corresponding to the head entity and the vector corresponding to the tail entity. Hence, the problem of knowledge graph embedding becomes equivalent to learning these embeddings.

More formally, we might define the likelihood of a relation being true as:

\begin{eqnarray}

Pr(a_{i}, Likes, b_{j}) &=& \mbox{sig}\left[\mathbf{a}_{i}^{T}\mathbf{b}_{j} \right]\nonumber \\

Pr(a_{i}, \lnot Likes, b_{j}) &=& 1-\mbox{sig}\left[\mathbf{a}_{i}^{T}\mathbf{b}_{j} \right] \tag{1}

\end{eqnarray}

$\mbox{sig}[\bullet]$ is a sigmoid function. The term $\mathbf{a}_{i}$ is the embedding for the $i^{th}$ head entity (from the $i^{th}$ row of the portrait matrix $\mathbf{A}^{T}$) and $\mathbf{b}_{j}$ is the embedding for the $j^{th}$ tail entity $b$ (from the $j^{th}$ column of the landscape matrix $\mathbf{B}$). We can hence learn the embedding matrices $\mathbf{A}$ and $\mathbf{B}$ by maximizing the log likelihood of all of the known relations.

The above discussion considered only the simplified case where there is a single type of relation between entities. However, this general strategy can be extended to the case of multiple relations by considering a three dimensional binary tensor in which the third dimension represents the type of relation (figure 4). During the factorization process, we now also generate a matrix containing embeddings for each type of relation.

In the previous section, we considered KG completion in terms of factorizing a matrix or tensor into matrices of embeddings for the head entity, tail entity and relation.

We can generalize this idea, by retaining the notion of embeddings, but use more general *score functions* than the one implied by factorization to provide scores for each tuple. For example, TransE (Bordes *et* al. 2013) maps each entity and each relation to a vector of size $d$ and defines the score for a tuple $(Alice, Likes, Fish)$ as:

\[-|| {z}_{Alice} + {z}_{Likes} - {z}_{Fish}||\]

where ${z}_{Alice},{z}_{Likes},{z}_{Fish}\in\mathbb{R}^d$, corresponding to the embeddings for $Alice$, $Likes$, and $Fish$, are vectors with learnable parameters. To train, we define a likelihood such that $-|| {z}_{Alice} + {z}_{Likes} - {z}_{Fish}||$ becomes large if $(Alice, Likes, Fish)$ is in the KG and small if $(Alice, Likes, Fish)$ is likely to be false.

Other models map entities and relations to different spaces and/or use different score functions. For a comprehensive list of existing approaches and their advantages and disadvantages, see Nguyen (2017).

Temporal KGs are KGs where each fact can have a timestamp associated with it. An example of a fact in a temporal KG is $(Alice, Liked, Fish, 1995)$. Temporal KG completion (TKGC) is the problem of inferring new temporal facts from a KG based on the existing ones.

Existing approaches for TKGC usually extend (static) KG embedding models by mapping the timestamps to latent representations and updating the score function to take into account the timestamps as well. As an example, TTransE extends TransE by mapping each entity, relation, and timestamp to a vector in $\mathbb{R}^d$ and defining the score function for a tuple $(Alice, Liked, Fish, 1995)$ as:

\[-|| z_{Alice} + z_{Liked} + z_{1995} - z_{Fish}||\]

For a comprehensive list of existing approaches for TKGC and their advantages and disadvantages, see Kazemi *et* al. (2019).

We develop models for TKGC based on an intuitive assumption: to provide a score for, $(Alice, Liked, Fish, 1995)$, one needs to know $Alice$'s and $Fish$'s features in $1995$; providing a score based on their current features or an aggregation of their features over time may be misleading. That is because $Alice$'s personality and the sentiment towards $Fish$ may have been quite different in 1995 as compared to now (figure 5). Consequently, learning a static embedding for each entity - as is done by existing approaches - may be sub-optimal as such a representation only captures an aggregation of entity features during time.

To provide entity features at any given time, we define the entity embedding as a function which takes an entity and a timestamp as input and provides a hidden representation for the entity at that time. Inspired by diachronic word embeddings, we call our proposed embedding a *diachronic embedding (DE)*. In particular, we define the diachronic embedding for an entity $E$ using vector(s) defined as follows:

\begin{equation}

\label{eq:demb}

z^t_E[n]=\begin{cases}

a_E[n] \sigma(w_E[n] t + b_E[n]), & \text{if $1 \leq n\leq \gamma d$}. \\

a_E[n], & \text{if $\gamma d < n \leq d$}. \tag{2}

\end{cases}

\end{equation}

where $a_E\in\mathbb{R}^{d}$ and $w_E,b_E\in\mathbb{R}^{\gamma d}$ are (entity-specific) vectors with learnable parameters, $z^t_E[n]$ indicates the $n^{th}$ element of $z^t_E$ (similarly for $a_E$, $w_E$ and $b_E$), and $\sigma$ is an activation function.

Intuitively, entities may have some features that change over time and some features that remain fixed (figure 6). The first $\gamma d$ elements of $z^t_E$ in Equation (2) capture temporal features and the other $(1-\gamma)d$ elements capture static features. The hyperparameter $\gamma\in[0,1]$ controls the percentage of temporal features. In principle static features can be potentially obtained from the temporal ones if the optimizer sets some elements of $w_E$ in Equation (2) to zero. However, explicitly modeling static features helps reduce the number of learnable parameters and avoid overfitting to temporal signals.

Intuitively, by learning $w_E$s and $b_E$s, the model learns how to turn entity features on and off at different points in time so accurate temporal predictions can be made about them at any time. The terms $a_E$s control the importance of the features. We mainly use $\sin[\bullet]$ as the activation function for Equation (2) because one sine function can model several on and off states (figure 7). Our experiments explore other activation functions as well and provide more intuition.

It is possible to take any static KG embedding model and make it temporal by replacing the entity embeddings with diachronic entity embeddings as in Equation (2). For instance, TransE can be extended to TKGC by changing the score function for a tuple $(Alice, Liked, Fish, 1995)$ as:

\[-|| z^{1995}_{Alice} + z_{Liked} + z^{1995}_{Fish}||\]

where $z^{1995}_{Alice}$ and $z^{1995}_{Fish}$ are defined as in Equation (2). We call the above model DE-TransE where $DE$ stands for diachronic embedding. Besides TransE, we also test extensions of DistMult and SimplE, two effective models for static KG completion. We name the extensions DE-DistMult and DE-SimplE respectively.

**Table 1** Results on ICEWS14, ICEWS05-15, and GDELT. Best results are in bold blue.

**Datasets: **Our datasets are subsets of two temporal KGs that have become standard benchmarks for TKGC: ICEWS and GDELT. For ICEWS, we use the two subsets generated by García-Durán *et* al. (2018): 1- *ICEWS14* corresponding to the facts in 2014 and 2- *ICEWS05-15* corresponding to the facts between 2005 to 2015. For GDELT, we use the subset extracted by Trivedi *et* al. (2017) corresponding to the facts from April 1, 2015 to March 31, 2016. We changed the train/validation/test sets following a similar procedure as in Bordes *et* al. (2013) to make the problem into a TKGC rather than an extrapolation problem.

**Baselines:** Our baselines include both static and temporal KG embedding models. From the static KG embedding models, we use TransE, DistMult and SimplE where the timestamps are ignored. From the temporal KG embedding models, we compare to TTransE, HyTE, ConT, and TA-DistMult.

**Metrics:** We report filtered MRR and filtered hit@k measures. These essentially create queries such as $(v, r, ?)$ and measure how well the model predicts the correct answer among possible entities $u'$. See Bordes *et* al. 2013 for details.

Table 1 and figure 8 show the performance of our models compared to several baselines. According to the results, the temporal versions of different models outperform the static counterparts in most cases, thus providing evidence for the merit of capturing temporal information.

DE-TransE outperforms the other TransE-based baselines (TTransE and HyTE) on ICEWS14 and GDELT and gives on-par results with HyTE on ICEWS05-15. This result shows the superiority of our diachronic embeddings compared to TTransE and HyTE. DE-DistMult outperforms TA-DistMult, the only DistMult-based baseline, showing the superiority of our diachronic embedding compared to TA-DistMult. Moreover, DE-DistMult outperforms all TransE-based baselines. Finally, just as SimplE beats TransE and DistMult due to its higher expressivity, our results show that DE-SimplE beats DE-TransE, DE-DistMult, and the other baselines due to its higher expressivity.

We perform several studies to provide a better understanding of our models. Our ablation studies include i) different choices of activation function, ii) using diachronic embeddings for both entities and relations as opposed to using it only for entities, iii) testing the ability of our models in generalizing to timestamps unseen during training, iv) the importance of model parameters in Equation (2), v) balancing the number of static and temporal features in Equation (2), and vi) examining training complications due to the use of sine functions in the model. We refer the readers to the full paper for these experiments.

Our work opens several avenues for future research including:

- We proposed diachronic embeddings for KGs having timestamped facts. Future work may consider extending diachronic embeddings to KGs having facts with time intervals.
- We considered the ideal scenario where every fact in the KG is timestamped. Future work can propose ways of dealing with missing timestamps, or ways of dealing with a combination of static and temporal facts.
- We proposed a specific diachronic embedding in Equation (2). Future work can explore other possible functions.
- An interesting avenue for future research is to use Equation (2) to learn diachronic word embeddings and see if it can perform well in the context of word embeddings as well.

View the code here.

]]>It is common to talk about the variational autoencoder as if it *is* the model of $Pr(\mathbf{x})$. However, this is misleading; the variational autoencoder is a neural architecture that is designed to help learn the model for $Pr(\mathbf{x})$. The final model contains neither the 'variational' nor the 'autoencoder' parts and is better described as a *non-linear latent variable model*.

We'll start this tutorial by discussing latent variable models in general and then the specific case of the non-linear latent variable model. We'll see that maximum likelihood learning of this model is not straightforward, but we can define a lower bound on the likelihood. We then show how the autoencoder architecture can approximate this bound using a Monte Carlo (sampling) method. To maximize the bound, we need to compute derivatives, but unfortunately, it's not possible to compute the derivative of the sampling component. We'll show how to side-step this problem using the reparameterization trick. Finally, we'll discuss extensions of the VAE and some of its drawbacks.

Latent variable models take an indirect approach to describing a probability distribution $Pr(\mathbf{x})$ over a multi-dimensional variable $\mathbf{x}$. Instead of directly writing the expression for $Pr(\mathbf{x})$ they model a joint distribution $Pr(\mathbf{x}, \mathbf{h})$ of the data $\mathbf{x}$ and an unobserved latent (or hidden) variable $\mathbf{h}$. They then describe the probability of $Pr(\mathbf{x})$ as a marginalization of this joint probability so that

\begin{equation}

Pr(\mathbf{x}) = \int Pr(\mathbf{x}, \mathbf{h}) d\mathbf{h}.\tag{1}

\end{equation}

Typically we describe the joint probability $Pr(\mathbf{x}, \mathbf{h})$ as the product of the *likelihood* $Pr(\mathbf{x}|\mathbf{h})$ and the *prior* $Pr(\mathbf{h})$, so that the model becomes

\begin{equation}

Pr(\mathbf{x}) = \int Pr(\mathbf{x}| \mathbf{h}) Pr(\mathbf{h}) d\mathbf{h}.\tag{2}

\end{equation}

It is reasonable to question why we should take this indirect approach to describing $Pr(\mathbf{x})$. The answer is that relatively simple expressions for $Pr(\mathbf{x}| \mathbf{h})$ and $Pr(\mathbf{h})$ can describe a very complex distribution for $Pr(\mathbf{x})$.

A well known latent variable model is the mixture of Gaussians. Here the latent variable $h$ is discrete and the prior $Pr(h)$ is a discrete distribution with one probability $\lambda_{k}$ for each of the $K$ component Gaussians. The likelihood $Pr(\mathbf{x}|h)$ is a Gaussian with a mean $\boldsymbol\mu_{k}$ and covariance $\boldsymbol\Sigma_{k}$ that depends on the value $k$ of the latent variable $h$:

\begin{eqnarray}

Pr(h=k) &=& \lambda_{k}\nonumber \\

Pr(\mathbf{x} |h = k) &=& \mbox{Norm}_{\mathbf{x}}[\boldsymbol\mu_{k},\boldsymbol\Sigma_{k}].\label{eq:mog_like_prior}\tag{3}

\end{eqnarray}

where $\mbox{Norm}_{\mathbf{x}}[\boldsymbol\mu, \boldsymbol\Sigma]$ represents a multivariate probability distribution over $\mathbf{x}$ with mean $\boldsymbol\mu$ and covariance $\boldsymbol\Sigma$.

As in equation 2, the likelihood $Pr(x)$ is given by the marginalization over the latent variable $h$. In this case, this is a sum as the latent variable is discrete:

\begin{eqnarray}

Pr(\mathbf{x}) &=& \sum_{k=1}^{K} Pr(\mathbf{x}, h=k) \nonumber \\

&=& \sum_{k=1}^{K} Pr(\mathbf{x}| h=k) Pr(h=k)\nonumber \\

&=& \sum_{k=1}^{K} \lambda_{k} \mbox{Norm}_{\mathbf{x}}[\boldsymbol\mu_{k},\boldsymbol\Sigma_{k}]. \tag{4}

\end{eqnarray}

This is illustrated in figure 1. From the simple expressions for the likelihood and prior in equation 3, we can describe a complex multi-modal probability distribution.

Now let's consider the non-linear latent variable model, which is what the VAE actually learns. This differs from the mixture of Gaussians in two main ways. First, the latent variable $\mathbf{h}$ is continuous rather than discrete and has a standard normal prior (i.e., one with mean zero and identity covariance). Second, the likelihood is a normal distribution as before, but the variance is constant and spherical. The mean is a non-linear function $\mathbf{f}[\mathbf{h},\bullet]$ of the hidden variable $\mathbf{h}$ and this gives rise to the name. The prior and likelihood terms are:

\begin{eqnarray}

Pr(\mathbf{h}) &=& \mbox{Norm}_{\mathbf{h}}[\mathbf{0},\mathbf{I}]\nonumber \\

Pr(\mathbf{x} |\mathbf{h},\boldsymbol\phi) &=& \mbox{Norm}_{\mathbf{x}}[\mathbf{f}[\mathbf{h},\boldsymbol\phi],\sigma^{2}\mathbf{I}], \tag{5}

\end{eqnarray}

where the function $\mathbf{f}[\mathbf{h},\boldsymbol\phi]$ is a deep neural network with parameters $\boldsymbol\phi$. This model is illustrated in figure 2.

The model can be viewed as an infinite mixture of spherical Gaussians with different means; as before, we build a complex distribution by weighting and summing these Gaussians in the marginalization process. In the next three sections we consider three operations that we might want to perform with this model: computing the posterior distribution, sampling, and evaluating the likelihood.

The likelihood $Pr(\mathbf{x} |\mathbf{h},\boldsymbol\phi)$ tells us how to compute the distribution over the observed data $\mathbf{x}$ given hidden variable $\mathbf{h}$. We might however want to move in the other direction; given an observed data example $\mathbf{x}$ we might wish to understand what possible values of the hidden variable $\mathbf{h}$ were responsible for it (figure 3). This information is encompassed in the posterior distribution $Pr(\mathbf{h}|\mathbf{x})$. In principle, we can compute this using Bayes's rule

\begin{eqnarray}

Pr(\mathbf{h}|\mathbf{x}) = \frac{Pr(\mathbf{x}|\mathbf{h})Pr(\mathbf{h})}{Pr(\mathbf{x})}. \tag{6}

\end{eqnarray}

However, in practice, there is no closed form expression for the left hand side of this equation. In fact, as we shall see shortly, we cannot evaluate the denominator $Pr(\mathbf{x})$ and so we can't even compute the numerical value of the posterior for a given pair $\mathbf{h}$ and $\mathbf{x}$.

Although computing the posterior is intractable, it is easy to generate a new sample $\mathbf{x}^{*}$ from this model using ancestral sampling; we draw $\mathbf{h}^{*}$ from the prior $Pr(\mathbf{h})$, pass this through the network $f[\mathbf{h}^{*},\boldsymbol\phi]$ to compute the mean of the likelihood $Pr(\mathbf{x}|\mathbf{h})$ and then draw $\mathbf{h}$ from this distribution. Both the prior and the likelihood are normal distributions and so sampling from them in each step is easy. This process is illustrated in figure 4.

Finally, let's consider evaluating the likelihood of a data example $\mathbf{x}$ under the model. As before, the likelihood is given by:

\begin{eqnarray}

Pr(\mathbf{x}) &=& \int Pr(\mathbf{x}, \mathbf{h}|\boldsymbol\phi) d\mathbf{h} \nonumber \\

&=& \int Pr(\mathbf{x}| \mathbf{h},\boldsymbol\phi) Pr(\mathbf{h})d\mathbf{h}\nonumber \\

&=& \int \mbox{Norm}_{\mathbf{x}}[\mathbf{f}[\mathbf{h},\boldsymbol\phi],\sigma^{2}\mathbf{I}]\mbox{Norm}_{\mathbf{h}}[\mathbf{0},\mathbf{I}]d\mathbf{h}. \tag{7}

\end{eqnarray}

Unfortunately, there is no closed form for this integral, so we cannot easily compute the probability for a given example $\mathbf{x}$. This is a major problem for two reasons. First, evaluating the probability $Pr(\mathbf{x})$ was of the main reasons for modelling the probability distribution in the first place. Second, to learn the model, we maximize the log likelihood, which is obviously going to be hard if we cannot compute it. In the next section we'll introduce a lower bound on the log likelihood which can be computed and which we can use to learn the model.

During learning we are given training data $\{\mathbf{x}_{i}\}_{i=1}^{I}$ and want to maximize the parameters $\boldsymbol\phi$ of the model with respect to the log likelihood. For simplicity we'll assume that the variance term $\sigma^2$ in the likelihood expression is known and just concentrate on learning $\boldsymbol\phi$:

\begin{eqnarray}

\hat{\boldsymbol\phi} &=& argmax_{\boldsymbol\phi} \left[\sum_{i=1}^{I}\log\left[Pr(\mathbf{x}_{i}|\boldsymbol\phi) \right]\right] \nonumber \\

&=& argmax_{\boldsymbol\phi} \left[\sum_{i=1}^{I}\log\left[\int Pr(\mathbf{x}_{i}, \mathbf{h}_{i}|\boldsymbol\phi) d\mathbf{h}_{i}\right]\right].\label{eq:log_like} \tag{8}

\end{eqnarray}

As we noted above, we cannot write a closed form expression for the integral and so we can't just build a network to compute this and let Tensorflow or PyTorch optimize it.

To make some progress we define a lower bound on the log likelihood. This is a function that is always less than or equal to the log likelihood for a given value of $\boldsymbol\phi$ and will also depend on some other parameters $\boldsymbol\theta$. Eventually we will build a network to compute this lower bound and optimize it. To define this lower bound, we need to use Jensen's inequality which we quickly review in the next section.

Jensen’s inequality concerns what happens when we pass values through a concave function $g[\bullet]$. Specifically, it says that if we compute the expectation (mean) of these values and pass this mean through the function, the results will be greater than if we pass the values themselves through the function and then compute the expectation of the results. In mathematical terms:

\begin{equation}

g[\mathbf{E}[y]] \geq \mathbf{E}[g[y]], \tag{9}

\end{equation}

for any concave function $g[\bullet]$. Some intuition as to why this is true is given in figure 5. In our case, the concave function in question is the logarithm so we have:

\begin{equation}

\log[\mathbf{E}[y]]\geq\mathbf{E}[\log[y]], \tag{10}

\end{equation}

or writing out the expression for expectation in full we have:

\begin{equation}

\log\left[\int Pr(y) y dy\right]\geq \int Pr(y)\log[y]dy. \tag{11}

\end{equation}

We will now use Jensen's inequality to derive the lower bound for the log likelihood. We start by multiplying and dividing the log likelihood by an arbitrary probability distribution $q(\mathbf{h})$ over the hidden variables

\begin{eqnarray}

\log[Pr(\mathbf{x}|\boldsymbol\phi)] &=& \log\left[\int Pr(\mathbf{x},\mathbf{h}|\boldsymbol\phi)d\mathbf{h} \tag{12} \right] \\

&=& \log\left[\int q(\mathbf{h}) \frac{ Pr(\mathbf{x},\mathbf{h}|\boldsymbol\phi)}{q(\mathbf{h})}d\mathbf{h} \tag{13} \right],

\end{eqnarray}

We then use Jensen's inequality for the logarithm (equation 11) to find a lower bound:

\begin{eqnarray}

\log\left[\int q(\mathbf{h}) \frac{ Pr(\mathbf{x},\mathbf{h}|\boldsymbol\phi)}{q(\mathbf{h})}d\mathbf{h} \right]

&\geq& \int q(\mathbf{h}) \log\left[\frac{ Pr(\mathbf{x},\mathbf{h}|\boldsymbol\phi)}{q(\mathbf{h})} \right]d\mathbf{h}, \tag{14}

\end{eqnarray}

where the term on the right hand side is known as the *evidence lower bound* or *ELBO*. It gets this name because the term $Pr(\mathbf{x}|\boldsymbol\phi)$ is known as the evidence when viewed in the context of Bayes' rule (equation 6).

In practice, the distribution $q(\mathbf{h})$ will have some parameters $\boldsymbol\theta$ as well and so the ELBO can be written as:

\begin{equation}

\mbox{ELBO}[\boldsymbol\theta, \boldsymbol\phi] = \int q(\mathbf{h}|\boldsymbol\theta) \log\left[\frac{ Pr(\mathbf{x},\mathbf{h}|\boldsymbol\phi)}{q(\mathbf{h}|\boldsymbol\theta)} \right]d\mathbf{h}. \tag{15}

\end{equation}

To learn the non-linear latent variable model, we'll maximize this quantity as a function of both $\boldsymbol\phi$ and $\boldsymbol\theta$. The neural architecture that computes this quantity (and hence is used to optimize it) is the variational autoencoder. Before we introduce that, we first consider some of the properties of the ELBO.

When first encountered, the ELBO can be a somewhat mysterious object. In this section we'll provide some intuition about its properties. Consider that the original log likelihood of the data is a function of the parameters $\boldsymbol\phi$ and we want to find its maximum. For any fixed $\boldsymbol\theta$, the ELBO is still a function of the parameters, but one that must lie below the original likelihood function. When we change $\boldsymbol\theta$ we modify this function and depending on our choice, it may move closer or further from the log likelihood. When we change $\boldsymbol\phi$ we move along this function. These perturbations are illustrated in figure 6.

The ELBO is described as being *tight* when for a fixed value of $\boldsymbol\phi$ we choose parameters $\boldsymbol\theta$ so that the ELBO and the likelihood function coincide. We can show that this happens when the distribution $q(\boldsymbol\theta)$ is equal to the posterior distribution $Pr(\mathbf{h}|\mathbf{x})$ over the hidden variables. We start by expanding out the joint probability numerator of the fraction in the ELBO using the definition of conditional probability:

\begin{eqnarray}

\mbox{ELBO}[\boldsymbol\theta, \boldsymbol\phi] &=& \int q(\mathbf{h}|\boldsymbol\theta) \log\left[\frac{ Pr(\mathbf{x},\mathbf{h}|\boldsymbol\phi)}{q(\mathbf{h}|\boldsymbol\theta)} \right]d\mathbf{h}\nonumber \\

&=& \int q(\mathbf{h}|\boldsymbol\theta) \log\left[\frac{ Pr(\mathbf{h}|\mathbf{x},\boldsymbol\phi)Pr(\mathbf{x}|\boldsymbol\phi)}{q(\mathbf{h}|\boldsymbol\theta)} \right]d\mathbf{h}\nonumber \\

&=& \int q(\mathbf{h}|\boldsymbol\theta)

\log\left[Pr(\mathbf{x}|\boldsymbol\phi)\right]d\mathbf{h} +\int q(\mathbf{h}|\boldsymbol\theta) \log\left[\frac{ Pr(\mathbf{h}|\mathbf{x},\boldsymbol\phi)}{q(\mathbf{h}|\boldsymbol\theta)} \right]d\mathbf{h} \nonumber\nonumber \\

&=& \log[Pr(\mathbf{x} |\boldsymbol\phi)] +\int q(\mathbf{h}|\boldsymbol\theta) \log\left[\frac{ Pr(\mathbf{h}|\mathbf{x},\boldsymbol\phi)}{q(\mathbf{h}|\boldsymbol\theta)} \right]d\mathbf{h} \nonumber \\

&=& \log[Pr(\mathbf{x} |\boldsymbol\phi)] -\mbox{D}_{KL}\left[ q(\mathbf{h}|\boldsymbol\theta) ||Pr(\mathbf{h}|\mathbf{x},\boldsymbol\phi)\right].\label{eq:ELBOEvidenceKL} \tag{16}

\end{eqnarray}

This equation shows that the ELBO is the original log likelihood minus the Kullback-Leibler divergence $\mbox{D}_{KL}\left[ q(\mathbf{h}|\boldsymbol\theta) ||Pr(\mathbf{h}|\mathbf{x},\boldsymbol\phi)\right]$ which will be zero when these distributions are the same. Hence the bound is tight when $q(\mathbf{h}|\boldsymbol\theta) =Pr(\mathbf{h}|\mathbf{x},\boldsymbol\phi)$. Since the KL divergence can only take non-negative values it is easy to see that the ELBO is a lower bound on $\log[Pr(\mathbf{x} |\boldsymbol\phi)]$ from this formulation.

In the previous section we saw that the bound is tight when the distribution $q(\mathbf{h}|\boldsymbol\theta)$ matches the posterior $Pr(\mathbf{h}|\mathbf{x},\boldsymbol\phi)$. This observation is the basis of the *expectation maximization* (*EM*) algorithm. Here, we alternately (i) choose $\boldsymbol\theta$ so that $q(\mathbf{h}|\boldsymbol\theta)$ equals the posterior $Pr(\mathbf{h}|\mathbf{x},\boldsymbol\phi)$ and (ii) change $\boldsymbol\phi$ to maximize the upper bound (figure 7a). This is viable for models like the mixture of Gaussians where we can compute the posterior distribution in closed form. Unfortunately, for the non-linear latent variable model there is no closed form expression for the posterior distribution and so this method is inapplicable.

We've already seen two different ways to write the ELBO (equations 15 and 16). In fact, there are several more ways to re-express this function (see Hoffman & Johnson 2016). The one that is important for the VAE is:

\begin{eqnarray}

\mbox{ELBO}[\boldsymbol\theta, \boldsymbol\phi] &=& \int q(\mathbf{h}|\boldsymbol\theta) \log\left[\frac{ Pr(\mathbf{x},\mathbf{h}|\boldsymbol\phi)}{q(\mathbf{h}|\boldsymbol\theta)} \right]d\mathbf{h}\nonumber \\

&=& \int q(\mathbf{h}|\boldsymbol\theta) \log\left[\frac{ Pr(\mathbf{x}|\mathbf{h},\boldsymbol\phi)Pr(\mathbf{h})}{q(\mathbf{h}|\boldsymbol\theta)} \right]d\mathbf{h}\nonumber \\

&=& \int q(\mathbf{h}|\boldsymbol\theta) \log\left[ Pr(\mathbf{x}|\mathbf{h},\boldsymbol\phi) \right]d\mathbf{h}

+ \int q(\mathbf{h}|\boldsymbol\theta) \log\left[\frac{Pr(\mathbf{h})}{q(\mathbf{h}|\boldsymbol\theta)}\right]d\mathbf{h}

\nonumber \\

&=& \int q(\mathbf{h}|\boldsymbol\theta) \log\left[ Pr(\mathbf{x}|\mathbf{h},\boldsymbol\phi) \right]d\mathbf{h}

- \mbox{D}_{KL}[ q(\mathbf{h}|\boldsymbol\theta), Pr(\mathbf{h})] \tag{17}

\end{eqnarray}

In this formulation, the first term measures the average agreement $Pr(\mathbf{x}|\mathbf{h},\boldsymbol\phi)$ of the hidden variable and the data (reconstruction loss) and the second one measures the degree to which the auxiliary distribution $q(\mathbf{h}, \boldsymbol\theta)$ matches the prior. This formulation is the one that will be used in the variational autoencoder.

We have seen that the ELBO is tight when we choose the distribution $q(\mathbf{h}|\boldsymbol\theta)$ to be the posterior $Pr(\mathbf{h}|\mathbf{x},\boldsymbol\phi)$ but for the non-linear latent variable model, we cannot write an expression for this posterior.

The solution to the problem is to make a variational approximation: we just choose a simple parametric form for $q(\mathbf{h}|\boldsymbol\theta)$ and use this as an approximation to the true posterior. In this case we'll choose a normal distribution with parameters $\boldsymbol\mu$ and $\boldsymbol\Sigma$. This distribution is not always going to be a great match to the posterior, but will be better for some values of $\boldsymbol\mu$ and $\boldsymbol\Sigma$ than others. When we optimize this model, we will be finding the normal distribution that is "closest" to the true posterior $Pr(\mathbf{h}|\mathbf{x})$ (figure 8). This corresponds to minimizing the KL divergence in equation 16.

Since the optimal choice for $q(\mathbf{h}|\boldsymbol\theta)$ was the posterior $Pr(\mathbf{h}|\mathbf{x})$ and this depended on the data example $\mathbf{x}$, it makes sense that our variational approximation should do the same and so we choose

\begin{equation}\label{eq:posterior_pred}

q(\mathbf{h}|\boldsymbol\theta,\mathbf{x}) = \mbox{Norm}_{\mathbf{h}}[g_{\mu}[\mathbf{x}|\boldsymbol\theta], g_{\sigma}[\mathbf{x}|\boldsymbol\theta]], \tag{18}

\end{equation}

where $g[\mathbf{x},\boldsymbol\theta]$ is a neural network with parameters $\boldsymbol\theta$ that predicts the mean and variance of the normal variational approximation.

Finally, we are in a position to describe the variational autoencoder. We will build a network that computes the ELBO:

\begin{equation}

\mbox{ELBO}[\boldsymbol\theta, \boldsymbol\phi]

= \int q(\mathbf{h}|\mathbf{x},\boldsymbol\theta) \log\left[ Pr(\mathbf{x}|\mathbf{h},\boldsymbol\phi) \right]d\mathbf{h}

- \mbox{D}_{KL}[ q(\mathbf{h}|\mathbf{x},\boldsymbol\theta), Pr(\mathbf{h})] \tag{19}

\end{equation}

where the distribution $q(\mathbf{h}|\mathbf{x},\boldsymbol\theta)$ is the approximation from equation 18.

The first term in equation 19 still involves an integral that we cannot compute, but since it represents an expectation, we can approximate it with a set of samples:

\begin{equation}

E[f[\mathbf{h}]] \approx \frac{1}{N}\sum_{n=1}^{N}f[\mathbf{h}^{*}_N]\tag{20}

\end{equation}

where $\mathbf{h}^{*}_{n}$ is the $n^{th}$ sample. In the limit, we might only use a single sample $\mathbf{h}^{*}$ from $q(\mathbf{h}|\mathbf{x},\boldsymbol\theta)$ as a very approximate estimate of the expectation and here the ELBO will look like:

\begin{eqnarray}

\mbox{ELBO}[\boldsymbol\theta, \boldsymbol\phi] &\approx& \log\left[ Pr(\mathbf{x}|\mathbf{h}^{*}|\boldsymbol\phi) \right]- \mbox{D}_{KL}[ q(\mathbf{h}|\mathbf{x},\boldsymbol\theta), Pr(\mathbf{h})] \tag{21}

\end{eqnarray}

The second term is just the KL divergence between the variational Gaussian $q(\mathbf{h}|\mathbf{x},\boldsymbol\theta) = \mbox{Norm}_{\mathbf{h}}[\boldsymbol\mu,\boldsymbol\Sigma]$ and the prior $Pr(h) =\mbox{Norm}_{\mathbf{h}}[\mathbf{0},\mathbf{I}]$. The KL divergence between two Gaussians can be calculated in closed form and for this case is given by:

\begin{equation}

\mbox{D}_{KL}[ q(\mathbf{h}|\mathbf{x},\boldsymbol\theta), Pr(\mathbf{h})] = \frac{1}{2}\left(\mbox{Tr}[\boldsymbol\Sigma] + \boldsymbol\mu^T\boldsymbol\mu - D - \log\left[\mbox{det}[\boldsymbol\Sigma]\right]\right). \tag{22}

\end{equation}

were $D$ is the dimensionality of the hidden space.

So, to compute the ELBO for a point $\mathbf{x}$ we first estimate the mean $\boldsymbol\mu$ and variance $\boldsymbol\Sigma$ of the posterior distribution $q(\mathbf{h}|\boldsymbol\theta,\mathbf{x})$ for this data point $\mathbf{x}$ using the network $\mbox{g}[\mathbf{x},\boldsymbol\theta]$. Then we draw a sample $\mathbf{h}^{*}$ from this distribution. Finally, we compute the ELBO using equation 21.

The architecture to compute this is shown in figure 9. Now it's clear why it is called a variational autoencoder. It is an autoencoder because it starts with a data point $\mathbf{x}$, computes a lower dimensional latent vector $\mathbf{h}$ from this and then uses this to recreate the original vector $\mathbf{x}$ as closely as possible. It is variational because it computes a Gaussian approximation to the posterior distribution along the way.

The VAE computes the ELBO bound as a function of the parameters $\boldsymbol\phi$ and $\boldsymbol\theta$. When we maximize this bound as a function of both of these parameters, we gradually move the parameters $\boldsymbol\phi$ to values that have give the data a higher likelihood under the non-linear latent variable model (figure 7b).

In this section, we've described how to compute the ELBO for a single point, but actually we want to maximize its sum over all of the data examples. As in most deep learning methods, we accomplish this with stochastic gradient descent, by running mini-batches of points through our network.

You might think that we are done; we set up this architecture, then we allow PyTorch / Tensorflow to perform automatic differentiation via the backpropagation algorithm and hence optimize the cost function. However, there's a problem. The network involves a sampling step and there is no way to differentiate through this. Consequently, it's impossible to make updates to the parameters $\boldsymbol\theta$ that occur earlier in the network than this.

Fortunately, there is a simple solution; we can move the stochastic part into a branch of the network which draws a sample from $\mbox{Norm}_{\epsilon}[\mathbf{0},\mathbf{I}]$ and then use the relation

\begin{equation}

\mathbf{h}^{*} = \boldsymbol\mu + \boldsymbol\Sigma^{1/2}\epsilon, \tag{23}

\end{equation}

to draw from the intended Gaussian. Now we can compute the derivatives as usual because there is no need for the backpropagation algorithm to pass down the stochastic branch. This is known as the reparameterization trick and is illustrated in figure 10.

Variational autoencoders were first introduced by Kingma &Welling (2013). Since then, they have been extended in several ways. First, they have been adapted to other data types including discrete data (van den Oord *et* al. 2017, Razavi *et* al. 2019), word sequences (Bowman *et* al. 2015), and temporal data (Gregor & Besse 2018). Second, researchers have experimented with different forms for the variational distribution, most notably using normalizing flows which can approximate the true posterior much more closely than a Gaussian (Rezende & Mohamed 2015). Third, there is a strand of work investigating more complex likelihood models $Pr(\mathbf{x}|\mathbf{h})$. For example, Gulrajani *et *al. (2016) used an auto-regressive relation between output variables and Dorta *et* al. (2018) modeled the covariance as well as the mean.

Finally, there is a large body of work that attempts to improve the properties of the latent space. Here, one popular goal is to learn a *disentangled* representation in which each dimension of the latent space represents an independent real world factor. For example, when modeling face images, we might hope to uncover head pose or hair color as independent factors. These methods generally add regularization terms to either the posterior $q(\mathbf{h}|\mathbf{x})$ or the aggregated posterior $q(\mathbf{h}) = \frac{1}{J}\sum_{i=1}^{I}q(\mathbf{h}|\mathbf{x}_{i})$ so that the new loss function is

\begin{equation}

L_{new} = \mbox{ELBO}[\boldsymbol\theta, \boldsymbol\phi] - \lambda_{1} \mathbb{E}_{Pr(\mathbf{x})}\left[\mbox{R}_{1}\left[q(\mathbf{h}|\mathbf{x}) \right]\right] - \lambda_{2} \mbox{R}_{2}[q(\mathbf{h})], \tag{24}

\end{equation}

where $\lambda_{1}$ and $\lambda_{2}$ are weights and $\mbox{R}_{1}[\bullet]$ and $\mbox{R}_{2}[\bullet]$ are functions of the posterior and aggregated posterior respectively. This class of methods includes the BetaVAE (Higgins *et* al. 2017), InfoVAE (Zhao *et* al. 2017) and many others (*e.g.*, Kim & Mnih 2018, Kumar *et* al. 2017, Chen *et* al. 2018).

VAEs have several drawbacks. First, we cannot compute the likelihood of a new point $\mathbf{x}$ under the probability distribution efficiently, because this involves integrating over the hidden variable $\mathbf{h}$. We can approximate this integral using a Markov chain Monte Carlo method, but this is very inefficient. Second, samples from VAEs are generally not perfect (figure 11). The naive spherical Gaussian noise model which is independent for each variable generally produces noisy samples (or overly smooth ones if we do not add in the noise).

In practice, training VAEs (particularly sequence VAEs) can be brittle. It's possible that that the system converges to a local minimum in which the latent variable is completely ignored and the encoder always predicts the prior. This phenomenon is known as *posterior collapse* (Bowman *et* al. 2015). One way to avoid this problem is to only gradually introduce the second term in the cost function (equation 19) using an annealing schedule.

The VAE is an architecture to learn a probability model over $Pr(\mathbf{x})$. This model can generate samples, and interpolate between them (by manipulating the latent variable $\mathbf{h}$) but it is not easy to compute the likelihood for a new example. There are two main alternatives to the VAE. The first is generative adversarial networks. These are good for sampling but their samples have quite different properties from those produced by the VAE. Similarly to the VAE they cannot evaluate the likelihood of new data points. The second alternative is normalizing flows for which the both sampling and likelihood evaluation are tractable.

]]>In December 2019, many Borealis employees travelled to Vancouver to attend NeurIPS 2019. With almost 1500 accepted papers, there’s a lot of great work to sift through. In this post, some of our researchers describe the papers that they thought were especially important.

by Alex Radovic

Related Papers:

- Neural ordinary differential equations
- GRU-ODE-Bayes: Continuous modeling of sporadically-observed time series
- Legendre memory units: continuous-time representation in recurrent neural networks

**What problem does it solve?** It is a new neural network architecture optimized for irregularly-sampled time series.

**Why this is important?** Real world data is often sparse and/or irregular in time. For example, sometimes data is recorded when a sensor updates in response to some external stimulus, or an agent decides to make a measurement. The timing of these data points can itself be a powerful predictor, so we would like a neural network architecture that is designed to extract that signal.

Zero padding and other approaches are sometimes used so that a simple LSTM or GRU can be applied. However, this makes for a less efficient and sparser data representation that those recurrent networks are not well equipped to deal with.

**The approach taken and how it relates to previous work:** Neural ODEs are only a year old but are paving the way for a number of fascinating applications. Neural ODEs reinterpret repeating neural network layers as approximations to a differential equation expressed as a function of depth. As pointed out in the original paper, a Neural ODE can also be a function of some variable of interest (e.g., time).

The original Neural ODE paper does touch on potential uses for time series data, and uses a Neural ODE to generate time series data. This paper describes a modified RNN that has a hidden state that changes both when a new data point comes in and as a function of time between observations. The architecture is a development of a well explored idea where the hidden state decays as some function of time between observations (Cao *et* al., 2018; Mozer *et* al., 2017; Rajkomar *et* al., 2018; Che *et* al., 2018). Now instead of a preset function, the hidden state between observations is the solution to a Neural ODE. Figure 1 shows how information is updated in an ODE-RNN in contrast to other common approaches.

**Results:** They show state of the art performance at both interpolation and extrapolation on MuJoCo simulation and the PhysioNet datasets. The PhysioNet dataset is particularly exciting as it represents an important real world scenario consisting of patients' intensive care unit data. The extrapolation is particularly impressive as successful interpolation methods often don’t extrapolate well. Figure 2 shows on a toy dataset how using an ODE-RNN encoder rather than an RNN encoder leads to much better extrapolation in the ODE decoder.

Related Papers:

- On spectral clustering: analysis and an algorithm
- Sparse subspace clustering: algorithm, theory, and applications

**What problem does it solve?** High-dimensional data (e.g., videos, text) often lies on a low dimensional manifold. Subspace clustering algorithms such as the Sparse Subspace Clustering (SSC) algorithm assume that this manifold can be approximated by a union of lower dimensional subspaces. Such algorithms try to identify these subspaces and associate them with individual data points. This paper improves the speed and memory cost of the SSC algorithm while retaining theoretical guarantees by introducing an algorithm called Selective Sampling-based Scalable Sparse Subspace Clustering (S$^5$C).

**Why this is important? **The method scales to large datasets which was a big practical limitation of the SSC algorithm. It also comes with theoretical guarantees and these empirically translate to improved performance.

**Previous Work:** The original SSC algorithm consisted of two steps: representation learning and spectral clustering . The first step learns an affinity matrix $\mathbf{W}$. Intuitively the $ij$-components of $\mathbf{W}$ encodes the "similarity" between point $i$ and $j$ (but with $W_{ii} = 0$ for all $i$). The matrix $\mathbf{W} = |\mathbf{C}| + |\mathbf{C}|^T$ is made sparse by imposing an $\ell_1$ norm regularizer on the objective functions:

\begin{equation}\label{eq:sccobjective}

\underset{\left(C_{j i}\right)_{j \in[N]} \in \mathbb{R}^{N}}{\operatorname{minimize}} \frac{1}{2}\left\|\mathbf{x}_{i}-\sum_{j \in[N]} C_{j i} \mathbf{x}_{j}\right\|_{2}^{2}+\lambda \sum_{j \in[N]}\left|C_{j i}\right|, \text { subject to } C_{i i}=0. \tag{1}

\end{equation}

where $\mathbf{x}_i \in \mathbb{R}^M$ is a data point in the dataset and the $N$ different objective functions determine the $N$ rows of $\mathbf{C}$.

This $\ell_1$ regularizer should not affect the ability of $\mathbf{W}$ to minimize the unregularized objective if the data points lie in subspaces that are of lower dimensions than the original embedding space. The regularization will produce a $\mathbf{W}$ whose non-zero elements suggests that the linked data points are within the same subspace.

The second step of the SSC algorithm applies spectral clustering to $\mathbf{W}$. The eigenvectors of the Laplacian $\mathbf{L} = \mathbf{I} - \mathbf{D}^{-1/2}\mathbf{W}\mathbf{D}^{-1/2}$ are computed, where $\mathbf{D}$ is a diagonal matrix with $D_{ii} = \sum_j W_{ij}$. For $N_{c}$ clusters, the eigenvectors associated with the with $N_{c}$ smallest non-zero eigenvalues are chosen, normalised, and stacked into a matrix $\mathbf{X} \in \mathbb{R}^{N\times N_c }$. Each row represents a data point and their cluster memberships are determined by further clustering in this $\mathbb{R}^{N_c}$ space using k-means.

Many other approaches aim to improve SSC. Some algorithms like OMP (Dyer *et* al., 2013; You and Vidal 2015), PIC (Lin and Cohen, 2010), and nearest neighbor SSC (Park *et* al., 2014) retain theoretical guarantees but still suffer from scalability issues. Other fast methods such as EnSC-ORGEN (You *et* al., 2013), and SSSC (Peng *et* al., 2013) exist at the expense of theoretical guarantees and justification.

**Approach taken: **Each of the $N$ components of the objective function in equation 1 has $O(N)$ parameters. The minimization procedure can be done in a time and space complexity of $O(N^2)$, so this does not scale to well in $N$. Instead S$^5$C aims to solve $N$ problems having $T$ parameters by selecting $T$ vectors that can be used to represent the rest of the dataset. The intuition is that data points from each subspace can be reconstructed from the span of a few selected vectors denoted by the set $\mathcal{S}$. The vectors in $\mathcal{S}$ are selected incrementally by stochastic approximation of a sub-gradient; a subsample $I\subset [N]$ (with $|I| \ll N$) of vectors of the full dataset are selected and used to estimate which data point in $[N]$ best improves the data representation spanned by $\mathcal{S}$. This step has $\mathcal{O}(\left|I\right|N)$ complexity. This step repeated $T$ times to build $\mathcal{S}$, giving a complexity of $\mathcal{O}(\left|I\right|TN)$ instead of $\mathcal{O}(N^3)$ to build the affinity matrix $\mathbf{W}$. This construction ensures that $\mathbf{W}$ only has $\mathcal{O}(N)$ non-zero elements and hence the eigenvector decomposition can also be done in $\mathcal{O}(N)$ time using orthogonal iteration.

**Results: **Figure 3 shows the linear increase in time as a function of dataset size. Figure 4 shows improvements on the clustering error on many datasets when comparing to other fast algorithms without theoretical guarantees.

Dataset | Nyström | dKK | SSC | SSC-OMP | SSC-ORGEN | SSSC | S^{5}C |
---|---|---|---|---|---|---|---|

Yale B | 76.8 | 85.7 | 33.8 | 35.9 | 37.4 | 59.6 | 39.3 (1.8) |

Hopkins 155 | 21.8 | 20.6 | 4.1 | 23.0 | 20.5 | 21.1 | 14.6 (0.4) |

COIL-100 | 54.5 | 53.1 | 42.5 | 57.9 | 89.7 | 67.8 | 45.9 (0.5) |

Letter-rec | 73.3 | 71.7 | / | 95.2 | 68.6 | 68.4 | 67.7 (1.3) |

CIFAR-10 | 76.6 | 75.6 | / | / | 82.4 | 82.4 | 75.1 (0.8) |

MNIST | 45.7 | 44.6 | / | / | 28.7 | 48.7 | 40.4 (2.3) |

Devanagari | 73.5 | 72.8 | / | / | 58.6 | 84.9 | 67.2 (1.3) |

^{Figure 4. Clustering error in $\%$. Error bars (if available) are in parentheses. Experiments where a time limit of 24 hours or memory limit of 16 GB was exceeded are denoted by $/$. }

Related Papers:

- Variational Inference with normalizing flows
- Density estimation using Real NVP
- The graph neural network model
- Graph attention networks
- GraphRNN: Generating realistic graphs with deep auto-regressive models

**What problem does it solve?** It introduces a new invertible graph neural network that can be used for supervised tasks such as node classification and unsupervised tasks such as graph generation.

**Why this is important?** Graph representation learning has diverse applications from bioinformatics to social networks and transportation. It is challenging due to diverse possible representations and complex structural dependencies among nodes. In recent years, graph neural networks have been the state-of-the-art model for graph representation learning. This paper proposes a new graph neural model with less memory footprint, better scalability, and room for parallel computation. Additionally, this model can be used for graph generation.

**The approach taken and how it relates to previous work: **The model builds on both message passing in graph neural networks and normalizing flows. The idea is to adapt normalizing flows for node feature transformation and generation.

Given a graph with $N$ nodes and node features $\mathbf{H} \in \mathbb{R}^{N \times d_n}$, graph neural networks transform the raw features $\mathbf{H}$ to embedded features that capture the contextual and structural information of the graph. This transformation consists of a series of message passing steps, where step $t$ consists of i) message generation using function $\mathbf{M}_t[\bullet]$ and ii) updating the node features with the aggregated messages of the neighboring nodes using function $\mathbf{U}_t[\bullet]$:

\begin{align}\label{message-passing}

&\mathbf{m}_{t+1}^{(v)} = \mathbf{Agg}\left[\{\mathbf{M}_t[\mathbf{h}_{t}^{(v)}, \mathbf{h}_{t}^{(u)}]\}_{u \in \mathcal{N}_v}\right] \nonumber\\

&\mathbf{h}_{t+1}^{(v)} = \mathbf{U}_t[\mathbf{h}_{t}^{(v)}, \mathbf{m}_{t+1}^{(v)}]. \tag{2}

\end{align}

Normalizing flows are generative models which find an invertible mapping $\mathbf{z}=\mathbf{f}[\mathbf{x}]$ to transform the data $\mathbf{x}$ to a latent variable $\mathbf{z}$ with a simple prior distribution (e.g., $\mbox{Norm}_{\mathbf{z}}[\mathbf{0}, \mathbf{I}]$). By sampling $\mathbf{z}$ and applying the inverse function $\mathbf{f}^{-1}[\bullet]$, we can generate data from the target distribution. This paper is based on the RealNVP which uses the mapping:

\begin{align}\label{realnvp_map}

\mathbf{z}^{(0)}&= \mathbf{x}^{(0)} \exp{\left[\mathbf{f}_1\left[\mathbf{x}^{(1)}\right]\right]} + \mathbf{f}_2\left[\mathbf{x}^{(1)}\right] \nonumber\\

\mathbf{z}^{(1)}&=\mathbf{x}^{(1)},\nonumber

\end{align}

where $\mathbf{x}^{(0)}$ and $\mathbf{x}^{(1)}$ are partitions of the input $\mathbf{x}$, and the output $\mathbf{z}$ is obtained by concatenating $\mathbf{z}^{(0)}$ and $\mathbf{z}^{(1)}$. The functions $\mathbf{f}_1[\bullet]$ and $\mathbf{f}_2[\bullet]$ are neural networks. RealNVP cascades these mapping functions with random partitioning at each step.

This paper extends normalizing flows to graphs by applying the RealNVP mapping functions to the node feature matrix $\mathbf{H}$ (figure 5a):

\begin{align}\label{realnvp_graph}

\mathbf{H}_{t+1}^{(0)}&= \mathbf{H}_{t}^{(0)} \exp{\left[\mathbf{F}_1\left[\mathbf{H}_{t}^{(1)}\right]\right]} + \mathbf{F}_2\left[\mathbf{H}_{t}^{(1)}\right] \nonumber\\

\mathbf{H}_{t+1}^{(1)}&=\mathbf{H}_{t}^{(1)},\nonumber

\end{align}

where the functions $\mathbf{F}_1[\bullet]$ and $\mathbf{F}_2[\bullet]$ can be any message-based transformation and are chosen here to be graph attention layers. In practice, an alternating pattern is used for consecutive operations.

In the supervised setting, the raw features are transformed through these layers to perform downstream tasks such as node classification. The resulting network is called *GRevNet*. Ordinary graph neural networks need to store the hidden states after each message passing step to do backpropagation. However, the reversible functions of GRevNet can save memory by reconstructing the hidden states in the backpropagation phase.

In addition, the graph normalizing flow can generate graphs via a two step process. First, a permutation-invariant graph autoencoder is trained to encode the graph to continuous node embeddings $\mathbf{X}$ and use these to reconstruct the adjacency matrix (figure 5b). Here, the encoder is a graph neural network and the decoder is a fixed function that makes nodes adjacent if their respective columns of $\mathbf{X}$ are similar. Second, a graph normalizing flow is trained to map from $\mathbf{z} \sim \mbox{Norm}_[\mathbf{0},\mathbf{I}]$ to a target distribution of node embeddings $\mathbf{X} \in \mathbb{R}^{N \times d_e}$. We generate from this distribution and use the decoder to generate new adjacency matrices (figure 5c).

**Results: **In a supervised context, GRevNet is used to classify documents (on Cora and Pubmed dataset) and protein-protein interaction (on PPI dataset) and compares favorably with other approaches. In the unsupervised context, the GNF model is used for graph generation on two datasets, COMMUNITY-SMALL and EGO-SMALL and is competitive with the popular GraphRNN.

by Jimmy Chen

Related Papers:

- Distinctive image features from scale-invariant keypoints
- L2-net: Deep learning of discriminative patch descriptor in Euclidean space

**What problem does it solve?** Keypoints are pixel locations where local image patches are quasi-invariant to camera viewpoint changes, photometric transformations, and partial occlusion. The goal of this paper is to detect keypoints and extract visual feature vectors from the surrounding image patches.

**Why this is important?** Keypoint detection and local feature description are the foundation of many applications such as image matching and 3D reconstruction.

**The approach taken how it relates to previous work:** R2D2 proposes a three-branch network to predict keypoint reliability, repeatability and image patch descriptors simultaneously (figure 6). Repeatability is a measure of the degree to which a keypoint can be detected at different scales, under different illuminations, and with different camera angles. Reliability is a measure of how easily the feature descriptor can be distinguished from others. R2D2 proposes a learning process that improves both repeatability and reliability.

Figure 7 shows a toy example of repeatability and reliability that are predicted by R2D2 in two images. The corners of the triangle in the first image are both repeatable and reliable. The grid corners in the second image are repeatable but not reliable as there are many similar corners nearby.

**Results: **R2D2 is tested on the HPatches dataset for image matching. Performance is measured by mean matching accuracy. Figure 8 shows that R2D2 significantly outperforms previous work at nearly all error thresholds. R2D2 is also tested on the Aachen Day-Night dataset for camera relocalization. R2D2 achieves state-of-the-art accuracy with a smaller model size (figure 9). The paper also provides qualitative results and an ablation study.

Although the paper demonstrated impressive results, the audience raised concerns about the keypoint sub-pixel accuracy and computation cost for large images.

Method | #kpts | dim | #weights | 0.5m, 2° | 1m, 5° | 5m, 10° |
---|---|---|---|---|---|---|

RootSIFT [24] | 11K | 128 | - | 33.7 | 52.0 | 65.3 |

HAN+HN [30] | 11K | 128 | 2 M | 37.8 | 54.1 | 75.5 |

SuperPoint [9] | 7K | 256 | 1.3 M | 42.8 | 57.1 | 75.5 |

DELF (new) [32] | 11K | 1024 | 9M | 39.8 | 61.2 | 85.7 |

D2-Net [11] | 19K | 512 | 15 M | 44.9 | 66.3 | 88.8 |

R2D2, $N$ = 16 | 5K | 128 | 0.5 M | 45.9 | 65.3 | 86.7 |

R2D2 $N$ = 8 | 10K | 128 | 1.0 M | 45.9 | 66.3 | 88.8 |

^{Figure 9. Results for Aachen Day-Night visual localization task. }

&

**What problem do they solve?** Both papers incorporate unlabeled data into adversarial training to improve adversarial robustness of neural networks.

**Why this is important?** We would like to be able to train neural networks in such a way that they are robust to adversarial attack. This is difficult, but we do not fully understand why. It could be that we need to use significantly larger models than we can currently train. Alternatively, it might be that we need a new training algorithm that has not yet been discovered. Another possibility is that adversarially robust networks have a very high sample complexity and so we just don't use enough data to train a robust model.

These two papers pertain to the latter sample complexity issue. They ask whether we can exploit additional unlabeled data to boost the adversarial robustness of a neural network. Since unlabeled data is relatively abundant this potentially provides a practical way to train adversarially robust models.

**The approach taken and how it relates to previous work: **We assume that each data-label pair $(\mathbf{x}, y)$ is sampled from distribution $\mathcal{D}$, and we are learning a model $Pr(y|\mathbf{x},\boldsymbol\theta)$ that predicts the probability of the label from the data and has parameters $\boldsymbol\theta$. The standard training objective is

\begin{equation}

\min_{\boldsymbol\theta}\left[ \mathbb{E}_{(\mathbf{x}, y)\sim \mathcal{D}}\left[\mbox{xent}\left[y, Pr(y|\mathbf{x},\boldsymbol\theta)\right]\right]\right] \tag{3}

\label{eq:clean_training}

\end{equation}

where $\mbox{xent}[\bullet, \bullet]$ is the cross-entropy loss.

When we train for adversarial robustness, we want the model to make the same prediction within a neighborhood and we train the model using the min-max formation (Madry *et* al., 2017):

\begin{equation}

\min_{\boldsymbol\theta}\left[ \mathbb{E}_{(\mathbf{x}, y)\sim \mathcal{D}}\left[\max_{\boldsymbol\delta\in B_\epsilon}\left[\mbox{xent}\left[y, Pr(y|\mathbf{x}+\boldsymbol\delta,\boldsymbol\theta)\right]\right]\right]\right] \tag{4}

\label{eq:minmax_training}

\end{equation}

where $B_\epsilon$ is a ball with radius $\epsilon$. In other words, we minimize the maximum cross-entropy loss with in a small neighborhood to achieve robustness.

TRADES improved adversarial training by separating the inner maximization term into a classification loss and a regularization loss:

\begin{equation}

\min_\theta \left[\mathbb{E}_{(\mathbf{x}, y)\sim \mathcal{D}}\left[\mbox{xent}\left[y, Pr(y|\mathbf{x},\boldsymbol\theta)\right] + \frac{1}{\lambda}\max_{\boldsymbol\delta\in B_\epsilon}\left[\mbox{D}_{KL}\left[Pr(y|\mathbf{x},\boldsymbol\theta)|| Pr(y|\mathbf{x}+\boldsymbol\delta,\boldsymbol\theta)\right]\right]\right]\right] \tag{5}

\label{eq:trades_training}

\end{equation}

where $\mbox{D}_{KL}[\bullet||\bullet]$ is the Kullback--Leibler divergence, and $\lambda$ is a scalar weight.

Both Carmon *et* al. (2019) and Uesato *et* al. (2019) exploit the observation that the regularization term in TRADES doesn't need the true label $y$; it tries to make the label prediction similar, before and after the perturbation. This makes incorporating unlabeled data very easy: for unlabeled data, we only train on the regularization loss, whereas for labeled data, we train on both the classification loss and the regularization loss.

In addition to this formulation, Uesato *et* al. (2019) propose an alternative way of using unlabeled data. They first perform natural training on labeled data, and then use the trained model to generate labels $\hat y(\mathbf{x})$ for unlabeled data. Then they combine all the labeled and unlabeled data to perform min-max adversarial training as in equation 4.

Carmon *et* al. (2019) also proposes a method to replace the maximization over a small neighborhood $B_\epsilon(\mathbf{x})$ with a larger additive noise sampled from $\mbox{Norm}_{\mathbf{x}}[\mathbf{0}, \sigma^2\mathbf{I}]$. This alternative is specifically designed for a certified $\ell_2$ defense via randomized smoothing Cohen *et* al. (2019).

**Results:** After adding unlabeled data into adversarial training, robustness was improved by around 4%. As adversarial robustness needs to be evaluated systematically under different types of attacks and settings, we refer the reader to the original papers for details.

by Leo Long

Related Papers:

**What problem does it solve?** Program synthesis aims to generate source code from a natural language description of a task. This paper presents a program synthesis approach ('Patois') that operates at different levels of abstraction and explicitly interleaves high-level and low-level reasoning at each generation step.

**Why is this important? **Many existing systems operate only at the low level of abstraction, generating one token of the target program at a time. On the other hand, humans constantly switch between high-level reasoning (e.g. list comprehension) and token-level reasoning when writing a program.

The system, called Patois, achieves this high/low-level separation by automatically mining common code idioms from a corpus of source code and incorporating them into the model used for synthesizing programs.

Moreover, we can use the mined code idioms as a way to leverage other unlabelled source code corpora since the amount of supervised data for program synthesis (i.e., paired source code and descriptions) is often limited and it is very time-consuming to obtain additional data.

**The approach taken and how it relates to previous work:** The system consists of two steps (figure 10). The goal of the first step is to obtain a set of frequent and useful program fragments. These are referred to as code idioms. Programs can be equivalently expressed as abstract syntax tree (AST). Hence, mining code idioms is treated as a non-parametric problem (Allamanis *et* al., 2018) and represented as inference over the probabilistic tree substitution grammar (pTSG).

The second step exploits these code idioms to augment the synthesis model. This model consists of a natural language encoder and an AST decoder (Yin and Neubig, 2017). At each step, the AST decoder has three possible actions. The first is to expand production rules defined in the original CFG of the source code language, which expands the sub-trees of the program AST. The second is to generate terminal nodes in the AST, such as reserved keywords and variable names. The third type of action is to expand the commonly used code idioms. They are hence effectively added to the output action space of the decoder at each step of generation. The resulting synthesis model is trained to maximize the log-likelihood of the action sequences that construct the right ASTs of the target programs given their natural language specifications.

**Results:** The paper presents experimental results on the Hearthstone and Spider datasets (figure 11). The experiment results show a noticeable improvement of the Patois system over the baseline model, which does not take advantage of the mined common code idioms. For a more qualitative analysis, Figure 12 presents few examples of the mined code idioms from the Hearthstone and Spider datasets, which correspond nicely to some of the high-level programming patterns for each language.

Model | Exact match |
Sentence BLEU |
Corpus BLEU |
---|---|---|---|

Baseline | 0.152 | 0.743 | 0.723 |

PATOIS | 0.197 | 0.780 | 0.766 |

^{Figure 11. Results on the Hearthstone dataset. }

def __init__(self) : super().__init__($\ell_0$ : str, $\ell_1$ : int , CHARACTER_CLASS.$\ell_3$ : id, CARD_RARITY.$\ell_4$ : id, $\ell_5^?$ ) |
$\ell_0$ : id = copy.copy($\ell_1$ : expr) class $\ell_0$ : id ($\ell_1$ : id) : def __init__(self): |
SELECT COUNT ( $\ell_0$ : col ), $\ell_1^*$ WHERE $\ell_2^*$ INTERSECT $\ell_4^?$ : sql EXPECT $\ell_5^?$ : sql WHERE $\ell_0$ : col = $terminal |

^{Figure 12. Examples of code idioms mined from the Hearthstone and Spider datasets. Adapted from Shin et al. (2019).}

Reinforcement learning (RL) can now produce super-human performance on a variety of tasks, including board games such as chess and go, video games, and multi-player games such as poker. However, current algorithms require enormous quantities of data to learn these tasks. For example, OpenAI Five generates 180 years of gameplay data per day, and AlphaStar used 200 years of Starcraft II gameplay data.

It follows that one of the biggest challenges for RL is *sample efficiency*. In many realistic scenarios, the reward signals are sparse, delayed or noisy which makes learning particularly inefficient; most of the collected experiences do not produce a learning signal. This problem is exacerbated because RL simultaneously learns both the policy (i.e., to make decisions) and the representation on which these decisions are based. Until the representation is reasonable, the system will be unable to develop a sensible policy.

This article focuses on the use of *auxiliary tasks* to improve the speed of learning. These are additional tasks that are learned simultaneously with the main RL goal and that generate a more consistent learning signal. The system uses these signals to learn a shared representation and hence speed up the progress on the main RL task.

An auxiliary task is an additional cost-function that an RL agent can predict and observe from the environment in a self-supervised fashion. This means that losses are defined via surrogate annotations that are synthesized from unlabeled inputs, even in the absence of a strong reward signal.

Auxiliary tasks usually consist of estimating quantities that are relevant to solving the main RL problem. For example, we might estimate depth in a navigation task. However, in other cases, they are be more general. For example, we might try to predict how close the agent is to a terminal state. Accordingly, they may take the form of classification and regression algorithms or alternatively may maximize reinforcement learning objectives.

We note that auxiliary tasks are different from model-based RL. Here, a model of how the environment transitions between states given the actions is used to support planning (Oh et al. 2015; Leibfried et al. 2016) and hence ultimately to directly improve the main RL objective. In contrast, auxiliary tasks do not directly improve the main RL objective, but are used to facilitate the representation learning process (Bellemare et al. 2019) and improve learning stability (Jaderberg et al. 2017).

Auxiliary tasks were originally developed for neural networks and referred to as *hints*. Suddarth & Kergosien (1990) argued that for the hint to be effective, it needs to "special relationship with the original input-output being learned." They demonstrated that adding auxiliary tasks to a minimal neural network effectively removed local minima.

The idea of adding supplementary cost functions was first used in reinforcement learning by Sutton et al. (2011) in the form of *general value functions* (GVFs). As the name suggests, GVFs are similar to the well-known value functions of reinforcement learning. However, instead of caring about environmental rewards, they consider other signals. They differ from auxiliary tasks in that they usually predict long term features. Hence, they employ summation across multiple time-steps similar to the state-value computation from rewards in standard RL.

Auxiliary tasks are naturally and succinctly implemented by splitting the last layer of the network into multiple parts (heads), each working on a different task. The multiple heads propagate errors back to the shared part of the network, which forms the representations that support all the tasks (Sutton & Barto 2018).

To see how this works in practice, we'll consider Asynchronous Advantage Actor-Critic (A3C) (Mnih et al. 2016), which is a popular and representative actor-critic algorithm. The loss function of A3C is composed of two terms: the policy loss (actor), $\mathcal{L}_{\boldsymbol\pi}$, and the value loss (critic), $\mathcal{L}_{v}$. An entropy loss $H[\boldsymbol\pi]$ for the policy $\boldsymbol\pi$, is also commonly added. This helps discouraging premature convergence to sub-optimal deterministic policies (Mnih et al. 2016). The complete loss function is given by:

\begin{equation}

\mathcal{L}_{\text{A3C}} = \lambda_v \mathcal{L}_{v} + \lambda_{\pi} \mathcal{L}_{\pi} - \lambda_{H} \mathbb{E}_{s \sim \pi} \left[H[\boldsymbol\pi[s]]\right]\tag{1}

\end{equation}

where $s$ is the state and scalars $\lambda_{v},\lambda_{\boldsymbol\pi}$, and $\lambda_{H}$ weight the component losses.

Auxiliary tasks are introduced to A3C via the Unsupervised Reinforcement and Auxiliary Learning (UNREAL) framework (Jaderberg et al. 2017). UNREAL optimizes the loss function:

\begin{equation}

\mathcal{L}_{\text{UNREAL}} = \mathcal{L}_{\text{A3C}} + \sum_i \lambda_{AT_i} \mathcal{L}_{AT_i}\tag{2}

\end{equation}

that combines the A3C loss, $\mathcal{L}_{\text{A3C}}$, together with auxiliary task losses $\mathcal{L}_{AT_i}$, where $\lambda_{AT_i}$ are weight terms (Figure 1). For a single auxiliary task, the loss computation code might look like:

` def loss_func_a3c(s, a, v_t):
# Compute logits from policy head and values from the value head
self.train()
logits, values = self.forward(s)
# Critic loss computation
td = 0.5*(v_t - values)
c_loss = td.pow(2)
# Actor loss
probs = F.softmax(logits, dim=1)
m = self.distribution(probs)
exp_v = m.log_prob(a) * td.detach()
a_loss = -exp_v
# Entropy loss
ent_loss = -(F.log_softmax(logits, dim=1) * F.softmax(logits, dim=1)).sum(1)
# Computing total loss
total_loss = (CRITIC_LOSS_W * c_loss + ACTOR_LOSS_W * a_loss - ENT_LOSS_W * ent_loss).mean()
# Auxiliary task loss
aux_task_loss = aux_task_computation()
#Computing total loss
total_loss = (AUX_TASK_WEIGHT_LOSS * aux_task_loss + a3c_loss).mean()`

The use of auxiliary tasks is not limited to actor-critic algorithms; they have also been implemented on top of Q-learning algorithms such as DRQN (Hausknecht & Stone 2015). For example, Lample & Chaplot (2017) extended the DRQN architecture with another head used to predict game features. In this case, the loss is the standard DRQN loss and the cross-entropy loss of the auxiliary task.

We now consider five different auxiliary tasks that have obtained good results in various RL domains. We provide insights as to the applicability of these tasks.

Sutton et al. (2011) speculated:

"

Suppose we are playing a game for which base terminal rewards are +1 for winning and -1 for losing. In addition to this, we might pose an independent question about how many more moves the game will last. This could be posed as a general value function."

The first part of the quote refers to the standard RL problem where we learn to maximize rewards (winning the game). The second part describes an auxiliary task in which we predict how many moves remain before termination.

Kartal et al. (2019) investigated this idea of *terminal prediction*. The agent predicts how close it is to a terminal state while learning the standard policy, with the goal of facilitating representation learning. Kartal et al. (2019) added this auxiliary task to A3C and named this A3C-TP. The architecture was identical to A3C, except for the addition of the terminal state prediction head.

The loss $\mathcal{L}_{TP}$ for the terminal state prediction is the mean squared error between the estimated closeness $\hat{y}$ to a terminal state of any given state and target values $y$ approximately computed from completed episodes:

\begin{equation}

\mathcal{L}_{TP}= \frac{1}{N} \sum_{i=0}^{N}(y_{i} - \hat{y}_{i})^2\tag{3}

\end{equation}

where $N$ represents the episode length during training. The target for the $i^{th}$ state is approximated with $y_{i} = i/N $ implying $y_{N}=1$ for the actual terminal state and $y_{0}=0$ for the initial state for each episode.

Kartal et al. (2019) initially used the actual current episode length for $N$ to compute the targets $y_{i}$. However, this delays access to the labels until the episode is over and did not provide significant benefit in practice. As an alternative, they approximate the current episode length by the *running average* of episode lengths computed from the most recent $100$ episodes, which provides a dense signal.* ^{1}* This improves learning performance, and is memory efficient for distributed on-policy deep RL as CPU workers do not have to retain the computation graph until episode termination to compute terminal prediction loss.

Since terminal prediction targets are computed in a self-supervised fashion, they have the advantage that they are independent of reward sparsity or any other domain dynamics that might render representation learning challenging (such as drastic changes in domain visuals, which happen in some Atari games). However, terminal prediction is applicable only for episodic environments.

*Agent modeling* (Hernandez-Leal et al. 2019) is an auxiliary task that is designed to work in a multi-agent setting. It takes ideas from game theory and in particular from the concept of *best response*: the strategy that produces the most favorable outcome for a player, taking other players' strategies as given.

The goal of agent modeling is to learn other agents' policies while itself learning a policy. For example, consider a game in which you face an opponent. Here, learning the opponent's behavior is useful to develop a strategy against it. However, agent modeling is not limited to only opponents; it can also model teammates, and can be applied to an arbitrary number of them.

There are two main approaches to implementing the agent modeling task. The first uses the conventional approach of adding new heads for the auxiliary task to a shared network base, as discussed in previous sections. The second uses a more sophisticated architecture in which latent features from the auxiliary network are used as inputs to the main value/policy prediction stream. We consider each in turn.

In this scheme, agents share the same network base, but the outputs represent different agent actions (Foerster et al. 2017). The goal is to predict opponent policies as well as the standard actor and critic, with the key characteristic that the previous layers share parameters (Figure 2a).

This architecture builds on the concept of *parameter sharing* where the idea is to perform centralized learning:

The AMS architecture uses the loss function:

\begin{equation}

\mathcal{L}_{\text{AMS}}= \mathcal{L}_{\text{A3C}} + \frac{1}{\mathcal{N}} \sum_i^{\mathcal{N}} \lambda_{AM_i} \mathcal{L}_{AM_i}\tag{4}

\end{equation}

where $\lambda_{AM_i}$ is a weight term and $\mathcal{L}_{AM_i}$ is the auxiliary loss for opponent $i$:

\begin{equation}

\mathcal{L}_{AM_i}= -\frac{1}{M} \sum_j^M \sum_{k}^{K} a^j_{ik} \log [\hat{a}^j_{ik}\tag{5}

\end{equation}

which is the cross entropy loss between the observed one-hot encoded opponent action, $\mathbf{a}^j_{i}$, and the prediction over opponent actions, $\hat{\mathbf{a}}^j_{i}$. Here $i$ indexes the opponents, $j$ indexes time for a trajectory of length $M$, and $k$ indexes the $K$ possible actions.

*Policy features* (Hong et al. 2018) are intermediate *features* from the latent space that is used to predict the opponent policy. The AMF framework exploits these features to improve the main reward prediction.

In this architecture, convolutional layers are shared, but the fully connected layers are divided in two sections (Figure 2b). The first is specialized for the actor and critic of the learning agent and the second for the opponent policies. The intermediate opponent policy features, $\mathbf{h}_{i}$ from the second path are used to condition (via an element-wise multiplication) the computation of the actor and critic. The loss function is similar to that for AMS.

Note that both AMS and AMF need to observe the opponent's actions to generate ground truth and for the auxiliary loss function. This is a limitation, and further research is required to handle partially observable environments.

In the previous two sections, we considered auxiliary tasks that related to the structure of the learning (terminal prediction) and to other agents' actions (agent modeling). In this section, we consider predicting the reward received at the next time-step — an idea that seems quite natural in the context of RL. More precisely, given state sequence $\{\mathbf{s}_{t-3}, \mathbf{s}_{t-2}, \mathbf{s}_{t-1}\}$, we aim to predict the reward $r_t$. Note that is similar to value learning with $\gamma=0$, so that the agent only cares about the immediate reward.

Jaderberg et al. (2017) formulated this task as multi-class classification with three classes: positive reward, negative reward, or zero. To mitigate data imbalance problems, the same number of samples with zero and non-zero rewards were provided during training.

In general, data imbalance is a disadvantage of reward prediction. This is particularly troublesome for hard-exploration problems with sparse rewards. For example, in the Pommerman game, an episode can last up to 800 timesteps, and the only non-zero reward is obtained at episode termination. Here, class-balancing would require many episodes, and this is in contradiction with the stated goal of using auxiliary tasks (i.e., to speed up learning).

Mirowski et al. (2016) studied auxiliary tasks in a navigation problem in which the agent needs to reach a goal in first-person 3D mazes from a random starting location. If the goal is reached, the agent is re-spawned to a new start location and must return to the goal. The 8 discrete actions permitted rotation and acceleration.

The agent sees RGB images as input. However, the authors speculated that depth information might supply valuable information about how to navigate the 3D environment. Thus, one of the auxiliary tasks is to predict depth, which can be cast as a regression or as a classification problem.

Mirowski et al. (2016) performed different experiments and we highlight two of these. In the first, they considered using the auxiliary task directly as input to the network, instead of just using it for computing the loss. In the second, they consider where to add the auxiliary task within the network. For example, the auxiliary task module can be set just after the convolutional layers, or after the convolutional and recurrent layers (Figure 3).

The results showed that:

- Using depth as input to the CNN (not shown in above Figure) resulted in worse performance than when predicting the depth.
- Treating depth estimation as classification (discretizing over 8 regions) outperformed casting it as regression.
- Placing the auxiliary task after the convolutional and recurrent networks obtained better results than moving it before the recurrent layers.

The auxiliary tasks discussed so far have involved estimating various quantities. A *control task* actually tries to manipulate the environment in some way. Jaderberg et al. (2017) proposed *pixel control* auxiliary tasks. An auxiliary control task $c$ is defined by a reward function $r^{c}: \mathcal{S} \times \mathcal{A} \rightarrow \mathbb{R}$ where $\mathcal{S}$ is the space of possible states and $\mathcal{A}$ is the space of available actions. Given a base policy $\pi$, and auxiliary control task with policy $\pi^c$, the learning objective becomes:

\begin{equation}

\mathcal{L}_{pc}= \mathbb{E}_{\pi}[R] + \lambda_{c} \mathbb{E}_{\pi^c}[R^{c}],\tag{6}

\end{equation}

where $R^c$ is the return obtained from the auxiliary control task and $\lambda_{c}$ is a weight. As for previous auxiliary tasks, some parameters are shared with the main task.

The system used off-policy learning; the data was replayed from an experience buffer and the system was optimized using a n-step Q-learning loss:

\begin{equation}

\mathcal{L}^c=(R_{t:t+n} + \gamma^n \max_{a'}\left[ Q^c(s',a',\boldsymbol\theta^{-})- Q^c(s,a,\boldsymbol\theta))^2\right]\tag{7}

\end{equation}

where $\boldsymbol\theta$ and $\boldsymbol\theta^{-}$ refers are the current and previous parameters, respectively.

Jaderberg et al. (2017) noted that changes in the perceptual stream are important since they generally correspond to important events in an environment. Thus, the agent learns a policy for maximally changing the pixels in each cell of an $n \times n$ non-overlapping grid superimposed on the input image. The immediate reward in each cell was defined as the average absolute difference from the previous frame. Results show that these types of auxiliary task can significantly improve learning.

In this section we consider some unresolved challenges associated with using auxiliary tasks in deep RL.

We have seen that adding an auxiliary task is relatively simple. However, the first and most important challenge is to define a *good* auxiliary task. Exactly how to do this remains an open question.

As a step towards this, Du et al. (2018) devised a method to detect *when* a given auxiliary task might be useful. Their approach uses the intuition that an algorithm should take advantage of the auxiliary task when it is helpful for the main task and block it otherwise.

Their proposal has two parts. First, they determine whether the auxiliary task and the main task are related. Second, they modulate how useful the auxiliary task is with respect to the main task using a weight. In particular, they propose to detect when an auxiliary loss $\mathcal{L}_{aux}$ is helpful to the main loss $\mathcal{L}_{main}$ by using the cosine similarity between gradients of the two losses:

**Algorithm 1:** Use of auxiliary task by gradient similarity

**if **$\cos[\nabla_{\theta}\mathcal{L}_{main},\nabla_{\theta} \mathcal{L}_{aux}]\ge 0$ **then**

| Update $\theta$ using $\nabla_{\theta}\mathcal{L}_{main} +\nabla_{\theta} \mathcal{L}_{aux}$

**else**

| Update $\theta$ using $\nabla_{\theta}\mathcal{L}_{main}$

**end**

The goal of this algorithm is to avoid adding an auxiliary loss that impedes learning progress for the main task. This is sensible, but would ideally be augmented by a better theoretical understanding of the benefits (Bellemare et al. 2019).

The auxiliary task is always accompanied by a weight but it's not obvious how to choose its value. Ideally, this weight should give enough importance to the auxiliary task to drive the representation learning but also one needs to be careful not to ignore the main task.

Weights need not be fixed, and can vary over time. Hernandez-Leal et al. (2019) compared five different weight parameterizations for the agent modeling task. In some the weight was fixed and in others it decayed. Figure 4 shows that, in the same environment, this choice directly affects the performance.

It also appears that the optimal auxiliary weight is dependent on the domain. For the tasks discussed in this post:

- Terminal prediction used a weight of 0.5 (Kartal et al. 2019).
- Reward prediction used a weight of 1.0 (Jaderberg et al. 2017).
- Auxiliary control varied between 0.0001 and 0.01 (Jaderberg et al. 2017).
- Depth prediction chose from the set {1, 3.33, 10} (Mirowski et al. 2016).

In the last part of this article we discuss the two major benefits of using auxiliary tasks: improving performance and increasing robustness.

The main benefit of auxiliary tasks is to drive representation learning and hence improve the agent's performance; the system learns faster and achieves better performance in terms of rewards with an appropriate auxiliary task.

For example, when auxiliary tasks where added to A3C agents, scores improved in domains such as Pommerman (Figure 5), Q*bert (Figure 6a) and the Bipedal walker (Figure 6b). Similar benefits of auxiliary tasks have been shown in Q-learning style algorithms (Lample & Chaplot 2017; Fedus et al. 2019).

The second benefit is related to robustness, and we believe this has been somewhat under-appreciated. One problem with deep RL is the high variance over different runs. This can even happen in the same experiment while just varying the random seed (Henderson et al. 2018). This is a major complication because algorithms sometimes diverge (i.e., they fail to learn) and thus we prefer robust algorithms that can learn under a variety of different values for their parameters.

Auxiliary tasks have been shown to improve robustness of the learning process. In the UNREAL work (Jaderberg et al. 2017), the authors varied two hyperparameters, entropy cost and learning rate, and kept track of the final performance while adding different auxiliary tasks. The results showed that adding auxiliary tasks increased performance over a variety of hyperparameter values (Figure 7).

In this tutorial we reviewed auxiliary tasks in the context of deep reinforcement learning and we presented examples from a variety of domains. Auxiliary tasks been used to accelerate and provide robustness to the learning process. However, there are still open questions and challenges such as defining what constitutes a good auxiliary task and forming a better theoretical understanding of how they contribute.

^{1}*Note that the RL agent does not have access to the time stamp or a memory, so it must predict its time relative to the terminal state afresh at each time step.*

Deep reinforcement learning (DRL) has had many successes on complex tasks, but is typically considered a black box. Opening this black box would enable better understanding and trust of the model which can be helpful for researchers and end users to better interact with the learner. In this paper, we propose a new visualization to better analyze DRL agents and present a case study using the Pommerman benchmark domain. This visualization combines two previously proven methods for improving human un-derstanding of systems: saliency mapping and immersive visualization.

]]>In part I of this tutorial we argued that few-shot learning can be made tractable by incorporating prior knowledge, and that this prior knowledge can be divided into three groups:

**Prior knowledge about class similarity:** We learn embeddings from training tasks that allow us to easily separate unseen classes with few examples.

**Prior knowledge about learning: **We use prior knowledge to constrain the learning algorithm to choose parameters that generalize well from few examples.

**Prior knowledge of data:** We exploit prior knowledge about the structure and variability of the data and this allows us to learn viable models from few examples.

We also discussed the family of methods that exploit prior knowledge of class similarity. In part II we will discuss the remaining two families that exploit prior knowledge about learning, and prior knowledge about the data respectively.

Perhaps the most obvious approach to few-shot learning would be transfer learning; we first find a similar task for which there is plentiful data and train a network for this. Then we adapt this network for the few-shot task. We might either (i) fine-tune this network using the few-shot data, or (ii) use the hidden layers as input for a new classifier trained with the few-shot data. Unfortunately, when training data is really sparse, the resulting classifier typically fails to generalize well.

In this section we'll discuss three related methods that are superior for the few-shot scenario. In the first approach ("learning to initialize"), we explicitly learn networks with parameters that can be fine-tuned with a few examples and still generalize well. In the second approach ("learning to optimize''), the optimization scheme becomes the focus of learning. We constrain the optimization algorithm to produce only models that generalize well from small datasets. Finally, the third approach ("sequence methods'') learns models that treat the data/label pairs as a temporal sequence and that learns an algorithm that takes this sequence and predicts missing labels from new data.

Algorithms in this class aim to choose a set of parameters that can be fine-tuned very easily to another task via one or more gradient learning steps. This criterion encourages the network to learn a stable feature set that is applicable to many different domains, with a set of parameters on top of these that can be easily modified to exploit this representation.

*Model agnostic meta-learning* or *MAML *(Finn *et* al. 2017) is a meta-learning framework that can be applied to any model that is trained with a gradient descent procedure. The aim is to learn a general model that can easily be fine-tuned for many different tasks, even when the training data is scarce.

The parameters $\boldsymbol\phi$ of this general model can be adapted to the $j^{th}$ task $\mathcal{T}_{j}$ by taking a single gradient step

\begin{equation}\label{eq:MAML_obj1}

\boldsymbol\phi_{j} = \boldsymbol\phi - \alpha \frac{\partial}{\partial \boldsymbol\phi} \mathcal{L}\left[\mathbf{f}[\boldsymbol\phi],\mathcal{T}_{j}]\right], \tag{1}

\end{equation}

to create a task-specific set of parameters $\boldsymbol\phi_{j}$. Here, the network is denoted by $\mathbf{f}[\bullet]$ with parameters $\boldsymbol\phi$. The loss $\mathcal{L}[\bullet, \bullet]$ takes the model $\mathbf{f}[\bullet]$ and the task data $\mathcal{T}_{j}$ as parameters. The parameter $\alpha$ represents the size of the gradient step.^{1}

Our goal is that on average for a number of different tasks, the loss will be small with these parameters. The *meta-cost function* $\mathcal{M}[\bullet]$ encompasses this idea

\begin{equation}

\mathcal{M}[\boldsymbol\phi] = \sum_{j=1}^{J} \mathcal{L}\left[\mathbf{f}[\boldsymbol\phi_{j}],\mathcal{T}_{j}]\right], \tag{2}

\end{equation}

where each set of parameters $\boldsymbol\phi_{j}$ is itself a function of $\boldsymbol\phi$ as given by equation 1. We wish to minimize this cost, which we can do by taking gradient descent steps

\begin{equation}\label{eq:MAML_obj2}

\boldsymbol\phi \leftarrow \boldsymbol\phi - \beta \frac{\partial}{\partial \boldsymbol\phi} \mathcal{M}[\boldsymbol\phi], \tag{3}

\end{equation}

where $\beta$ is the step size. This would typically be done in a stochastic fashion, updating the meta-cost function with respect to a few tasks at a time (figure 1a-b) In this way, MAML gradually learns parameters $\boldsymbol\phi$ which can be adapted to many tasks by fine tuning.

MAML has the disadvantage that both the meta-learning objective (equation 3) and the task learning objective within it (equation 1) contain gradients, and so we have to take gradients of gradients (via a Hessian matrix) to perform each update.

To improve the efficiency of learning Finn *et* al. (2017) introduced *first order model agnostic meta learning* or *FOMAML* which simply omitted the second derivatives. Surprisingly, this did not impact performance. They speculate that this might be because that RELU networks are almost linear and so the second derivatives are close to zero.

Nichol *et* al. (2018) introduced a simpler variant of first order MAML which they called *Reptile*. As for MAML, the algorithm repeatedly samples tasks $\mathcal{T}_{j}$, and optimizes the global parameters $\boldsymbol\phi$ to create task specific parameters $\boldsymbol\phi_{j}$. Then it updates the global parameters using the rule:

\begin{equation}

\boldsymbol\phi \longleftarrow \boldsymbol\phi + \alpha (\boldsymbol\phi_{j}-\boldsymbol\phi). \tag{4}

\end{equation}

This is illustrated in figure 1c. One interpretation of this is that we are performing stochastic gradient descent on a task level rather than a data level.

Jayathilaka (2019) improved this method by adding two thresholds $\beta$ and $\gamma$ as hyper-parameters. After $\beta$ steps, the gradient is pruned so that if the change in parameters $(\boldsymbol\phi_{j}-\boldsymbol\phi) < \gamma$ then no change is made. The logic of this approach is that the end of the meta-training procedure is over-learning the training tasks and hence this part of the regime is damped.

In the previous section we discussed algorithms that learn good starting positions for optimization. In this section we consider methods that learn the optimization algorithm itself. The idea is that these new optimization schemes will be constrained to produce models that generalize well when trained with few examples.

We will discuss two approaches. In the first we consider the learning rule as the cell-state update in a long short-term memory (LSTM) network; the LSTM is a model used to analyse sequences of data, and here these are sequences of optimization steps. In the second approach we frame the optimization updates in terms of reinforcement learning.

Ravi & Larochelle (2016) propose training a LSTM based meta-learner, where the cell state represents the model parameters. This was inspired by their realization that the standard gradient descent update rule has a very similar form to the cell update in an LSTM. We'll review each in turn to make this connection explicit.

The gradient descent update rule for a model with parameters $\boldsymbol\phi$ is given by:

\begin{equation}\label{eq:ravi_gradient}

\boldsymbol\phi_{t} = \boldsymbol\phi_{t-1}-\alpha\cdot\mathbf{g}_{t-1}, \tag{5}

\end{equation}

where $t$ represents the time step, $\alpha$ is the learning rate and $\mathbf{g}_{t}$ is the gradient vector. The cell state update rule in an LSTM is given by:

\begin{equation}\label{eq:ravi_optimize}

\mathbf{c}_{t} = \mathbf{f}_{t} \odot \mathbf{c}_{t-1} + \mathbf{i}_{t}\odot \tilde{\mathbf{c}}_{t-1}, \tag{6}

\end{equation}

where the cell state $\mathbf{c}_{t}$ at time $t$ is updated based on i) previous cell state $\mathbf{c}_{t-1}$ moderated by the forget gate $\mathbf{f}_{t}$ and ii) the candidate value for the cell state $\tilde{\mathbf{c}}_{t-1}$ moderated by the input gate $\mathbf{i}_{t}$.

The mapping from equation 5 to 6 is now clear. The cell state of the LSTM takes the place of the parameters $\boldsymbol\phi$ and the candidate value for the cell state $\tilde{\mathbf{c}}_{t-1}$ takes the place of the gradient $\mathbf{g}_{t-1}$. For the gradient descent case, the forget gate $\mathbf{f}_{t} = \mathbf{1}$, and the input gate $\mathbf{i}_{t} = -\alpha\mathbf{1}$.

Hence, Ravi & Larochelle (2016) propose representing the parameters of the models by the cell state of an LSTM and learning more general functions for the forget and input gates (figure 3). Each of these are two-layer neural networks that take a vector containing the previous gradient, previous loss function, previous parameters and previous value of the gate.

At each step of the training, the LSTM sees a sequence that corresponds to iterative optimization of the parameters $\boldsymbol\phi$ for the $j^{th}$ task. The LSTM learns the update rule from these sequences by updating the parameters in the forget and input gates; the parameters of these networks are manipulated to select an update rule that tends to produce good generalization.

In practice, each parameter is updated separately, so that there is a different input and forget gate for each. Similarly to ADAM, each parameter has a different learning rate, but now this learning rate is a complex function of the history of the optimization.

For the test task, the LSTM is run to provide gradient updates that incorporate prior knowledge from all of the training tasks and converge fast to a meaningful set of parameters without over-learning. Andrychowicz *et* al. (2016) present a similar scheme although this is not explicitly aimed at the few-shot learning situation.

Li & Malik (2016) observed that an optimization algorithm can be viewed as a Markov decision process. The state consists of the set of relevant quantities for optimization (current parameters, objective values, gradients, etc.). The action is the parameter update $\delta\boldsymbol\phi$ and so the policy is a probabilistic parameter update formula (figure 3).

Inspired by this observation, Li & Malik (2016) described the mean of the policy is a recurrent neural net that takes features relevant to the optimization (iterates, gradient and objective values from recent iterations) and the previous memory state and outputs the action (parameter update).

As with the LSTM system above, this system learns how best to update the model parameters in unseen test tasks, based on experience gained from a diverse collection of training tasks.

Bello *et* al. (2017) developed a reinforcement learning system where the action consists of an optimization update rule in a domain specific language. Each rule consists of two operands, two unary functions to apply to the first and second operand respectively, and a binary function to combine their outputs:

\begin{equation}

\boldsymbol\phi \leftarrow \boldsymbol\phi + \alpha \cdot \mbox{b}\left[\mbox{u}_{1}[o_{1}], \mbox{u}_{2}[o_{2}]\right], \tag{7}

\end{equation}

where $\mbox{u}_{1}[\bullet]$ and $\mbox{u}_{2}[\bullet]$ are the unary functions, $o_{1}$ and $o_{2}$ are the operands and $\mbox{b}[\bullet]$ is the binary function. The term $\alpha$ represents the learning rate.

Examples of operands include the gradient, sign of gradient, random noise, and moving average of gradient. Unary functions include clipping, square rooting, exponentiating and identity. Binary functions include addition, subtraction, multiplication, and division. Many existing optimization schemes can be expressed in this language, including stochastic gradient descent, RMSProp and Adam.

The controller consists of a recurrent neural network which samples strings of length $5$, each of which represents a different rule. A child classification network is trained with this rule and the accuracy is fed back to change the parameters of the RNN so that it is more likely to output better rules.

Perhaps surprisingly, the system finds interpretable optimization rules; for example, the *powersign* classifier compares the sign of the gradient and its running average and adjusts the step size according to whether those values agree.

In the previous two sections, we considered methods that find good initial parameters for fine tuning networks and methods that learn optimization rules that tend to produce good generalization. Both of these methods are obviously directly connected to the optimization process.

In this section, we introduce *sequence methods* for meta-learning. Sequence methods ingest the entire support set as a sequence of tuples $\{\mathbf{x}, y\}$ each containing a data example $\mathbf{x}$ and a label $y$. The last tuple consists of just a data example from the query set and the system is trained to predict the missing label. The parameters of the sequence model are updated so this prediction is consistently accurate over different tasks.

At first sight, this might seem unrelated to the previous methods, but consider the situation when we have already passed the support set into the system. From this perspective the situation is very similar to a standard network. The query example will be passed in, and the query label returned. Working backwards, we can think of passing each support set sequence as optimizing the network for this task, and the training of the sequence model itself (across many different tasks) as meta-learning of how to optimize the model for different tasks.

We'll consider two sequence methods. The first is based on a recurrent neural network (RNN) and the second uses an attention mechanism.

Santoro *et* al. (2016) introduced *memory augmented neural networks.* Their system is trained one task at a time, with each task considered as a sequence of data $\mathbf{x}$ and label $y$ pairs (figure 4). However, the label $y_{t}$ for the data example $\mathbf{x}_{t}$ at time $t$ is not provided until time $t+1$. Hence, the system must learn to classify the current example based on past information. The data is shuffled every time that a task is presented so that the network doesn't erroneously learn the sequence rather than the relation between the data and the label.

The network consists of a controller which stores memories in a network and retrieves them to use for classification. In practice, the controller that is used to place the memories and retrieve them is an LSTM or a feed-forward network. Memories are retrieved based on a key computed from the data which is compared to every memory by cosine similarity; the retrieved memory is a weighted sum of all of the stored memories weighted by the soft-max transformed cosine similarities.

As the sequence for a new task is passed in, the system stores memories from the support set sequence and uses these to predict the subsequent labels. Over time, the memory content becomes more suited to the current task and classification improves. During meta-training, we learn the parameters of the controller so that this process works well on average over many tasks; it learns the algorithm for storing and retrieving memories.

Mishra *et* al. (2017) also described a sequence method, in which the system is trained to take a set of (data, label) tuples $\{\mathbf{x}, y\}$ and then predict the missing label for the last example. Their system is not recurrent, and takes the entire sequence of support data at once. The architecture is based on alternating causal temporal convolutions and soft-attention layers (figure 5). This allows the decision for the query example to depend on the previously observed pairs. They term this architecture the *simple neural attentive meta-learner* or SNAIL.

The final family of few-shot learning methods exploit prior knowledge about the process that generates the classes and their examples. There are two main approaches here. First, we can try to characterize a *generative model* of all possible classes with a small number of parameters. These parameters can then be learned from a just a few examples and be used as the basis for discriminating new classes. Second, we can exploit knowledge of the data creation process to synthesize new examples and hence *augment* the dataset. We can then use this expanded dataset to train a conventional model. We consider each approach in turn.

We will describe two generative models. First we consider a model that is specialized to recognizing new classes of hand written characters. The structure of the model contains significant information about how images of characters are created and this is exploited to understand new types of character. Second, we consider a more generic generative model that learns how to generate families of data classes.

Lake *et* al. (2015) construct a hierarchical non-parametric generative model for hand-written characters using a probabilistic programming approach . At the first level, a set of primitives are combined into parts, each of which is a single pen-stroke. Multiple parts are then connected to create characters. This process is illustrated in figure 6a. At the second level, a realisation of the character is created by simulating the pen-strokes for that character under noisy conditions.

During the meta-learning process, the set of primitives are learned such that they can be combined to describe sets of unseen characters. The system has access to the actual pen-strokes for training which makes this learning easier. The support set of the test task is used to describe the new classes in terms of this fixed set of primitives. For a query in the test task, the posterior probability that it was generated from each of the character classes is computed.

The structure of the model (primitive pen strokes, and likely combinations) is hence prior information learned from previous sets of characters that can be exploited to discriminate unseen classes.

Edwards & Storkey (2016) presented a more generic model which they termed the *neural statistician* as it learns the statistics of both classes and examples in a dataset. This model can generate new examples of classes and examples of any data and contains no prior information about the generation process (e.g., about pen-strokes or image transformations).

The model has a similar generative structure to the pen stroke model, but is based on the variational auto-encoder (figure 6b). At the top level is a context vector $\mathbf{c}$. A probability distribution over the context vector describes the statistics of a single class. In the context of few-shot learning, we might have $N$ context vectors indexed as $\mathbf{c}_{n}$ representing the $N$ classes. Each generates $K$ hidden variables $\{\mathbf{z}_{nk}\}_{k=1}^{K}$ and each of these generates a data example $\mathbf{x}_{nk}$. In this way, a single task for the N-way-K-shot problem is generated.^{2}

The support sets from the training tasks are used to learn the parameters of this model using a modification of the variational auto-encoder that allows inference of both context variables and hidden variables. For the test task, the support set is used to infer the context vectors and hidden variables that explain this dataset. Classification of the query set can be done by evaluating the probability that each data example was generated by the context vector for each class. As for the pen-stroke model, prior knowledge is accumulated in building the structure of the model during meta-learning, which means that unseen classes can be modelled effectively from only a few data examples.

Rezende *et* al. (2016) presented a related model that was also based on the VAE but differed in that (i) it was specialized to images and contained prior knowledge about image transformations (ii) generation was conditioned explicitly on a new class example, as opposed to inferring a hidden variable representing the class.

Hariharan & Girshick (2017) proposed a method for hallucinating new examples to augment datasets where few data examples are available. Their proposed approach is based on the intuition that learned intra-class variations are both transferable and generalize well to novel classes.

They assume that they have a large body of data with many examples per class from which they can learn about intra-class variation. They then exploit this knowledge to create extra examples in the few-shot test scenario. Their learning approach is based on analogy; if in the training data we observe embeddings $\mathbf{z}_{11}$ and $\mathbf{z}_{12}$ for class 1, then perhaps we can use the embedding $\mathbf{z}_{21}$ from class 2 to predict a new variation $\mathbf{z}_{22}$. In other words, we aim to answer the question ``if $\mathbf{z}_{11}$ is to $\mathbf{z}_{12}$ then $\mathbf{z}_{21}$ is to what?'' (figure 7).

This analogy task is performed using a multi-layer perceptron that takes $\mathbf{z}_{11}$, $\mathbf{z}_{12}$ and $\mathbf{z}_{21}$ and predicts $\mathbf{z}_{22}$. They learn this network from quadruplets of features taken from training tasks with plentiful data.^{3} The loss function encourages both accurate prediction of the missing feature vector and also correct classification of the synthesized example. For few-shot test tasks, the data is augmented using this generator by analogy with the plentiful training classes and it is shown that this significantly improves performance.

Subsequently Wang *et* al. (2018) proposed an end-to-end framework that jointly optimizes a meta-learner (such as a prototypical network) and a hallucinator (which generates additional training examples). Samples are hallucinated by a multi-layer perceptron that takes a training example and a noise vector and generates a new example. The new samples are added to the original training set and this augmented training set is used to learn parameters of classifier. Loss is back-propagated and both parameters of the classification algorithm and parameters of the hallucinator are updated. A key notion here is that it is not the plausibility of the new examples that is important, but rather it is their ability to improve classification performance in a few-shot setting.

There is enormous diversity in approaches to meta-learning and few-shot learning, but there is currently no consensus on the best approach. Probably the most thorough empirical comparison was by Chen *et* al. (2019) (figure 8), but this mainly focuses on approaches that learn embeddings. It should be noted as well, that many of the approaches are complementary to one another and a practical solution would be to combine them.

Few-shot learning and meta-learning are likely to gain in importance as AI penetrates more specific problem domains where the cost of gathering data is too great to justify a brute force approach. They remain interesting open problems in artificial intelligence.

^{1 }*This update can be iterated, but we will describe a single update for simplicity of notation.*

^{2 }*In practice, the model is somewhat more complicated than this with a sequence of dependent hidden variables describing each data example.*

^{3 }*The analogies are actually learned from cluster centers of the data and are chosen so that the cosine similarity between $\mathbf{z}_{11}-\mathbf{z}_{12}$ and $\mathbf{z}_{21}-\mathbf{z}_{22}$ is greater than zero.*

View the code here.

]]>Located at the heart of the University of Waterloo and housed in the new Evolv1 building, a leading space in sustainable design and the first ever zero-carbon office building in Canada, Borealis AI’s office features a unique design that draws inspiration from the campus life with a teacher’s lounge, a science lab, a track, and a field pitch.

It was a natural fit for Borealis AI to establish its fifth research lab in Waterloo, a city anchored by the University of Waterloo, a world-class institution, and flanked by a number of innovative AI start-ups and tech companies. Our Waterloo centre strengthened our existing ties to the city and its strong research community, dating back to Borealis AI’s early days in 2016. Borealis AI is a proud supporter of Waterloo.ai, the university’s artificial intelligence institute.

True to our vision of supporting Waterloo’s AI community, we are also pleased to announce Borealis AI’s support for the Leader’s Prize at True North, powered by COMMUNITECH. The Prize is “a national competition that challenges Canadian thinkers to solve a major societal or industry problem of global proportion and consequence.” Teams will compete in employing AI/ML to produce solutions that automate “the fact-checking process and flag whether a claim is true or false.” Professor Pascal Poupart, Principal Researcher at Borealis AI, will be heading up the scientific committee for the competition.

For more photos from our event, click below.

]]>Today, Borealis AI announced it will collaborate with MILA to support a machine learning research initiative on climate change.

Climate change is indisputably one of the biggest challenges of our time. Global temperature rise, glacial retreat, sea-level rise and extreme weather events are just a few examples of the impact that humans are having on Earth. While modern society is at the centre of this change, there is currently a disconnect between human responsibility and awareness. People have a hard time understanding how climate change affects them personally and what it means for their future.

MILA researchers, led by Prof Yoshua Bengio, have developed computer vision algorithms to personalize the effect of extreme weather events on locations of interest. Given an address, this machine learning model can generate a photo-realistic image can visualise the impact of extreme weather phenomena in that region as predicted by a climate model associated with that geography. Generative models are used to synthesize images showing flooding and other weather effects that are hyper-personalized and depicting of your own home or street.

This project falls under MILA’s research portfolio on “AI for Humanity” which involves a number of projects that are socially responsible and beneficial to society.

]]>Humans can recognize new object classes from very few instances. However, most machine learning techniques require thousands of examples to achieve similar performance. The goal of *few-shot learning* is to classify new data having seen only a few training examples. In the extreme, there might only be a single example of each class (*one shot learning*). In practice, few-shot learning is useful when training examples are hard to find (e.g., cases of a rare disease), or where the cost of labelling data is high.

Few-shot learning is usually studied using *N-way-K-shot classification*. Here, we aim to discriminate between $N$ classes with $K$ examples of each. A typical problem size might be to discriminate between $N=10$ classes with only $K=5$ samples from each to train from. We cannot train a classifier using conventional methods here; any modern classification algorithm will depend on far more parameters than there are training examples, and will generalize poorly.

If the data is insufficient to constrain the problem, then one possible solution is to gain experience from other similar problems. To this end, most approaches characterize few-shot learning as a *meta-learning* problem.

In the classical learning framework, we learn a how to classify from training data and evaluate the results using test data. In the meta-learning framework, we *learn how to learn* to classify given a set of *training tasks* and evaluate using a set of t*est tasks* (figure 1); In other words, we use one set of classification problems to help solve other unrelated sets.

Here, each task mimics the few-shot scenario, so for N-way-K-shot classification, each task includes $N$ classes with $K$ examples of each. These are known as the *support set* for the task and are used for learning how to solve this task. In addition, there are further examples of the same classes, known as a *query set*, which are used to evaluating the performance on this task. Each task can be completely non-overlapping; we may never see the classes from one task in any of the others. The idea is that the system repeatedly sees instances (tasks) during training that match the structure of the final few-shot task, but contain different classes.

At each step of meta-learning, we update the model parameters based on a randomly selected training task. The loss function is determined by the classification performance on the query set of this training task, based on knowledge gained from its support set. Since the network is presented with a different task at each time step, it must learn how to discriminate data classes in general, rather than a particular subset of classes.

To evaluate few-shot performance, we use a set of test tasks. Each contains only unseen classes that were not in any of the training tasks. For each, we measure performance on the query set based on knowledge of their support set.

Approaches to meta-learning are diverse and there is no consensus on the best approach. However, there are three distinct families, each of which exploits a different type of prior knowledge:

**Prior knowledge about similarity: **We learn embeddings in training tasks that tend to separate different classes even when they are unseen.

**Prior knowledge about learning:** We use prior knowledge to constrain the learning algorithm to choose parameters that generalize well from few examples.

**Prior knowledge of data:** We exploit prior knowledge about the structure and variability of the data and this allows us to learn viable models from few examples.

An overview these methods can be seen in figure 2. In this review, we will consider each family of methods in turn.

This family of algorithms aims to learn compact representations (embeddings) in which the data vector is mostly unaffected by intra-class variations but retains information about class membership. Early work focused on pairwise comparators which aim to judge whether two data examples are from the same or different classes, even though the system may not have seen these classes before. Subsequent research focused on multi-class comparators which allow assignment of new examples to one of several classes.

Pairwise comparators take two examples and classify them as either belonging to the same or different classes. This differs from the standard N-way-K-shot configuration and does not obviously map onto the above description of meta-learning although as we will see later there is in fact a close relationship.

Koch *et al.* (2015) trained a model that outputs the probability $Pr(y_a=y_{b})$ that two data examples $\mathbf{x}_{a}$ and $\mathbf{x}_{b}$ belong to the same class (figure 3a). The two examples are passed through identical multi-layer neural networks (hence Siamese) to create two embeddings. The component-wise absolute distance between the embeddings is computed and passed to a subsequent comparison network that reduces this distance vector to a single number. This is passed though a sigmoidal output for classification as being the same or different with a cross-entropy loss.

During training, each pair of examples are randomly drawn from a super-set of training classes. Hence, the system learns to discriminate between classes is general, rather than two classes in particular. In testing, completely different classes are used. Although this does not have the formal structure of the N-way-K-shot task, the spirit is similar.

Triplet networks (Hoffer & Ailon 2015) consist of three identical networks that are trained by triplets $\{\mathbf{x}_{+},\mathbf{x}_{a},\mathbf{x}_{-}\}$ of the form (positive, anchor, negative). The positive and anchor samples are from the same class, whereas the negative sample is from a different class. The learning criterion is *triplet loss* which encourages the anchor to be closer to the positive example than it is to the negative example in the embedding space (figure 3b). Hence it is based on two pairwise comparisons.

After training, the system can take two examples and establish whether they are from the same or different classes, by thresholding the distance in the learned embedding space. This was employed in the context of face verification by Schroff *et al.* (2015). This line of work is part of a greater literature on learning distance metrics (see Suarez *et al.* 2018 for overview).

Pairwise comparators can be adapted to the N-way-K-shot setting by assigning the class for an example in the query set based on its maximum similarity to one of the examples in the support set. However, multi-class comparators attempt to do the same thing in a more principled way; here the representation and final classification are learned in an end-to-end fashion.

In this section, we'll use the notation $\mathbf{x}_{nk}$ to denote the $k$th support example from the $n$th class in the N-Way-K-Shot classification task, and $y_{nk}$ to denote the corresponding label. For simplicity, we'll assume there is a single query example $\hat{\mathbf{x}}$ and the goal is to predict the associated label $\hat{y}$.

Matching networks (Vinyals *et al.* 2016) predict the one-hot encoded query-set label $\hat{\mathbf{y}}$ as a weighted sum of all of the one-hot encoded support-set labels $\{\mathbf{y}_{nk}\}_{n,k=1}^{NK}$. The weight is based on a computed similarity $a[\hat{\mathbf{x}},\mathbf{x}_{nk}]$ between the query-set data $\hat{\mathbf{x}}$ and each training example $\{\mathbf{x}_{nk}\}_{n,k=1}^{N,K}$.

\begin{equation}

\hat{\mathbf{y}} = \sum_{n=1}^{N}\sum_{k=1}^{K} a[\mathbf{x}_{nk},\hat{\mathbf{x}}]\mathbf{y}_{nk} \tag{1.1}

\end{equation}

where the similarities have been constrained to be positive and sum to one.

To compute the similarity $a[\hat{\mathbf{x}},\mathbf{x}_{nk}]$, they pass each support example $\mathbf{x}_{nk}$ through a network $\mbox{ f}[\bullet]$ to produce an embedding and pass the query example $\hat{\mathbf{x}}$ through a different network $\mbox{ g}[\bullet]$ to produce a different embedding. They then compute the cosine similarity between these embeddings (figure 5a)

\begin{equation}

d[\mathbf{x}_{nk}, \hat{\mathbf{x}}] = \frac{\mbox{ f}[\mathbf{x}_{nk}]^{T}\mbox{ g}[\hat{\mathbf{x}}]} {||\mbox{ f}[\mathbf{x}_{nk}]||\cdot||\mbox{ g}[\hat{\mathbf{x}}]||}, \tag{1.2}

\end{equation}

and normalise using a softmax function:

\begin{equation}

a[\hat{\mathbf{x}}_{nk},\mathbf{x}] = \frac{\exp[d[\mathbf{x}_{nk},\hat{\mathbf{x}}]]}{\sum_{n=1}^{N}\sum_{k=1}^{K}\exp[d[\mathbf{x}_{nk},\hat{\mathbf{x}}]]}. \tag{1.3}

\end{equation}

to produce positive similarities that sum to one. This system can be trained end to end for the N-way-K-shot learning task.^{1 }At each learning iteration, the system is presented with a training task; the predicted labels are computed for the query set (the calculation is based on the support set) and the loss function is the cross entropy of the ground truth and predicted labels.

Matching networks compute similarities between the embeddings of each support example and the query example. This has the disadvantage that the algorithm is not robust to data imbalance; if there are more support examples for some classes than others (i.e., we have departed from the N-way-K-shot scenario), the ones with more frequent training data may dominate.

Prototypical networks (Snell et al. 2017) are robust to data imbalance by construction; they average the embeddings $\{\mathbf{z}_{nk}\}_{k=1}^{K}$ of the examples for class $n$ to compute their mean embedding or *prototype* $\mathbf{p}_{n}$. They then use the similarity between each prototype and the query embedding (figures 4 and 5 b) as a basis for classification.

The similarity is computed as a negative multiple of the Euclidean distance (so that larger distances now give smaller numbers). They pass these similarities to a softmax function to give a probability over classes. This model effectively learns a metric space where the average of a few examples of a class is a good representation of that class and class membership can be assigned based on distance.

They noted that (i) the choice of distance function is vital as squared Euclidean distance outperformed cosine distance, (ii) having a higher number of classes in the support set helps to achieve better performance, and that (iii) the system works best when the support size of each class is matched in the training and test tasks.

Ren et al. (2018) extended this system to take advantage of additional unlabeled data which might be from the test task classes or from other distractor classes. Oreshkin et al. (2018) extended this approach by learning a task-dependent metric on the feature space, so that the distance metric changes from place to place in the embedding space.

Matching networks and prototypical networks both focus on learning the embedding and compare examples using a pre-defined metric (cosine and Euclidean distance, respectively). Relation networks (Santoro et al. 2016) also learn a metric for comparison of the embeddings (figure 5c). Similarly to prototypical networks, the relation network averages the embeddings of each class in the support set together to form a single prototype. Each prototype is then concatenated with the query embedding and passed to a *relation module*. This is a learnable non-linear operator that produces a similarity score between 0 and 1 where 1 indicates that the query example belongs to this class prototype. This approach is clean and elegant and can be trained end-to-end.

All of the pairwise and multi-class comparators are closely related to one another. Each learns an embedding space for data examples. In matching networks, there are different embeddings for support and query examples, but in the other models, they are the same. For prototypical networks and relation networks, multiple embeddings from the same class are averaged to form prototypes. Distances between support set embeddings/prototypes and query set embeddings are computed using either pre-determined distance functions such as Euclidean or cosine distance (triplet networks, matching networks, prototypical networks) or by learning a distance metric (Siamese networks and relation networks).

The multi-class networks have the advantage that they can be trained end-to-end for the N-way-K-shot classification task. This is not true for the pairwise comparators which are trained to produce a similarity or distance between pairs of data examples (which could itself subsequently be used to support multi-class classification).

Although it is not obvious how the pairwise comparators map to the meta-learning framework, it is possible to consider their data as consisting of minimal training and test tasks. For Siamese networks, each pair of examples is a training task, consisting of one support example and one query example, where their classes may not necessarily match. For triplet networks, there are two support examples (from different classes) and one query example (from one of the classes).

In part I of this tutorial we have described the few-shot and meta-learning problems and introduced a taxonomy of methods. We have also discussed methods that use a series of training tasks to learn prior knowledge about the similarity and dissimilarity of classes that can be exploited for future few-shot tasks. This knowledge takes the form of data embeddings that reduce within-class variance relative to between-class variance, and hence make it easier to learn from just a few data points.

In part II of this tutorial, we'll discuss methods that incorporate prior knowledge about how to learn models, and that incorporate prior knowledge about the data itself.

^{1}Vinyals et al. (2016). also introduced a novel *context embedding* method which took the full context of the support set $\mathcal{S}$ into account so that $\mbox{ g}[\bullet] = \mbox{ g}[\mathbf{x}, \mathcal{S}]$. Here, the support set was considered as a sequence and encoded by a bi-directional LSTM. Snell et al. (2017) later argued that this context embedding was problematic and redundant.