**School: **Perimeter Institute for Theoretical Physics, University of Waterloo.

**Research areas: **Theoretical physics.

Research topic: Machine Learning for physics and physics for machine learning.

**School: **Machine Intelligence Institute and the department of computing science, University of Alberta.

**Research areas: **Machine learning, deep learning and natural language processing.

Research topic: Towards empathetic conversational AI.

**School: **Concordia Institute for Information System Engineering (CIISE), Concordia University, Montreal.

**Research areas: **Deep learning in health domain applications.

Research topic: Designing optimal deep neural networks for hand gesture recognition and force prediction, and developing domain adaptation algorithms for time-domain features.

**School: **University of Toronto.

**Research areas: **Machine learning.

Research topic: Interplay between optimization, generalization and uncertainty of deep learning.

**School: **Centre for Intelligent Machines (CIM), McGill University.

**Research areas: **Computer vision and artificial intelligence.

Research topic: Towards building reliable deep neural network.

**School: **University of British Columbia.

**Research areas: **Deep learning, chemistry, natural language processing, artificial intelligence, generative models, mass spectrometry.

Research topic: Automated discovery of unknown molecules using deep neural networks.

**School: **Mila, Université de Montréal.

**Research areas: **Optimization and deep learning.

Research topic: A dynamical systems perspective into game optimization.

**School: **Mila, Université de Montréal.

**Research areas: **Natural language processing and deep learning.

Research topic: Learning and modeling neural representations of text.

**School: **Artificial Intelligence and Algorithms laboratories at the University of British Columbia.

**Research areas: **Stochastic processes - neural networks - DNA computing.

Research topic: Mean first passage time and parameter estimation for continous-time Markov Chains.

**School: **McMaster University.

**Research areas: **Deep learning, watermarking, steganography, information-theoretical principles.

Research topic: New deep neural network architectures for blind image watermarking based on the information-theoretic principles.

The ten Fellows won the awards for their outstanding research capabilities and represent leading universities and AI institutes from provinces across Canada.

These fellowships are part of Borealis AI’s commitment to support Canadian academic excellence in AI and machine learning. They provide financial assistance for exceptional domestic and international graduate students to carry out fundamental research as they pursue their masters and PhDs in various fields of AI. The program is one of a number of Borealis AI initiatives designed to strengthen the partnership between academia and industry and advance the momentum of Canada’s leadership in the AI space.

This year’s winners demonstrated exceptional talent, vision and passion for high-quality research. Backed by some of Canada’s leading AI professors, the projects range from using AI in metabolomics, quantum physics and in areas such as natural language processing, deep learning, uncertainty, and computer vision.

Speaking about the program, Prof. Geoffrey Hinton, Chief Scientific Advisor at Vector Institute, said:

“Deep learning is poised to change the way we work and live and I am proud of the talent and caliber that our Universities have to offer. Canada is a top destination for research in machine learning globally and the Borealis AI Fellowships demonstrate the continuous support of the industry in that regard. Supporting students with the means to conduct their research is very important for our community.”

Foteini Agrafioti, Head of Borealis AI, said:

“

AI was pioneered in Canada, and our universities have trained some of the most prolific experts in the world. There is a huge demand for AI expertise, and that is why we are committed to nurturing talent in this highly critical field in Canada. I’m impressed by the caliber of this year’s winners and am excited to provide them with additional resources to advance their research and kick start their promising careers.”

Ibtihel Amara, a PhD student at McGill University at the Centre for Intelligent Machines (CIM), was awarded a fellowship this year for her work on addressing uncertainty and integrating reliability and trust into modern AI systems. Speaking about her award, Ibtihel said:

“The Borealis AI fellowship is important to me because it means I can focus completely on my research work. This award has also encouraged me to believe in the significance of my research, especially to be chosen among the best candidates in the AI research community. It compels me to achieve more and dream big!”

Click here to meet the class of 2020.

]]>Our study shows that maximizing margins can be achieved by minimizing the adversarial loss on the decision boundary at the "shortest successful perturbation", demonstrating a close connection between adversarial losses and the margins. We propose Max-Margin Adversarial (MMA) training to directly maximize the margins to achieve adversarial robustness.

Instead of adversarial training with a fixed $\epsilon$, MMA offers an improvement by enabling adaptive selection of the "correct" $\epsilon$ as the margin individually for each datapoint. In addition, we rigorously analyze adversarial training with the perspective of margin maximization, and provide an alternative interpretation for adversarial training, maximizing either a lower or an upper bound of the margins. Our experiments empirically confirm our theory and demonstrate MMA training's efficacy on the MNIST and CIFAR10 datasets w.r.t. $\ell_\infty$ and $\ell_2$ robustness.

]]>While sharing data between institutions and communities can help boost AI innovation, this practice runs the risk of exposing sensitive and private information about involved parties. Private and secure sharing of data is imperative and necessary for the AI field to succeed at scale.

Traditionally, a common practice has been to simply delete PII (personal identifiable information)—such as name, social insurance number, home address, or birth date—from data before it is shared. However, “Scrubbing,” as it is often called, is no longer a reliable way to protect privacy, because widespread proliferation of user-generated data has made it possible to reconstruct PII from scrubbed data. For example, in 2009 a Netflix user sued the company because her “anonymous” viewing history could be de-anonymized using other publicly-available data sources, inferring her sexual orientation.

Differentially private synthetic data generation (differential privacy) presents an interesting solution to this problem. In a nutshell, this technology adds “noise” to sensitive data while preserving the statistical patterns from which machine learning algorithms learn, allowing data to be shared safely and innovation to proceed rapidly.

Differential privacy preserves the statistical properties of a data set—the patterns and trends that algorithms care about to drive insights or automate processes—while obfuscating the underlying data themselves. The key idea behind data generation is to mask PII by adding statistical noise. The noisy synthetic data can be shared without compromising users’ privacy but still yields the same aggregate insights as the raw, noiseless data. Compare it to a doctor sharing trends and statistics about a patient base without ever revealing individual patients’ specific details.

Many major technology companies are already using differential privacy. For example, Google has applied differential privacy in the form of RAPPOR, a novel privacy technology that allows inferring statistics about populations while preserving the privacy of individual users. Apple also applies statistical noise to mask users’ individual data.

Differential privacy is not a free lunch, however: adding noise makes ML algorithms less accurate, especially with smaller datasets. This allows groups to join forces and safely leverage the combined size of their data to gain new insights. Consider a network of hospitals studying diabetes and needing to use their patient records to construct early diagnostics techniques from their collective intelligence. Each hospital could analyze its own patient records independently however, modern AI systems thrive on massive amounts of data, a reality that can only be practically achieved through large scale merging of patient records. Differential privacy presents a way of achieving that through the sharing of synthetic data and creation of a single, massive—but still privacy-preserving—dataset for scientists.

While differential privacy is not a universal solution, it bridges the gap between the need for individual privacy and the need for statistical insights – opening the doors to new possibilities.

]]>- If we use greedy search or beam search to select the output tokens, then the model's output tends to stay in high probability regions and so the outputs are less diverse than realistic human speech.
- If we sample randomly from the predicted distribution over the output tokens then we get more diverse outputs, but can enter a degenerate state where phrases are repeated if a low probability token is chosen.

It is hypothesized that both of these phenomena are side-effects of using a training criterion that is based on predicting one word of the output sequence at a time and comparing to the ground truth.

In the second part of this tutorial (figure 2), we consider alternative training approaches that compare the complete generated sequence to the ground truth at the sequence level. We'll consider two families of methods; in the first, we take models that have been trained using the maximum likelihood criterion and fine-tune them with a sequence-level cost function -- we'll consider using both reinforcement learning and minimum risk training for this fine tuning. In the second family, we consider structured prediction approaches which aim to train the model from scratch using a sequence level cost function.

Type of method | Training | Inference |
---|---|---|

Decoding algorithms | Maximum likelihood Maximum likelihood Maximum likelihood Maximum likelihood Maximum likelihood Maximum likelihood |
greedy search beam search diverse beam search iterative beam search top-k sampling nucleus sampling |

Sequence-level fine-tuning | Fine-tune with reinforcement learning Fine-tune with minimum risk training Scheduled sampling |
greedy search/beam search beam search |

Sequence-level training | Beam search optimization SeaRNN Reward augmented max likelihood |
greedy search / beam search beam search greedy search / beam search beam search |

Metrics that could provide a sequence-level comparison between the generated sentence and the ground truth include the BLEU (Papineni et al, 2002), ROUGE (Lin 2004), and METEOR (Bannerjee *et al.* 2005) scores. The best known of these is the BLEU score which compares the overlap of n-grams between the gold sequence and a generated sequence. This metric is not perfect because valid generated sentences that express the same ideas as the gold ones but with different words will be rated poorly. Despite its imperfections, the BLEU score has often been used to fine tune language generation models.

Unfortunately, the use of a sequence-level metric introduces a new problem. Unlike the maximum-likelihood criterion, the BLEU score is not easily differentiable. For a given output sentence it can be evaluated but there is no sense of how to smoothly change the sentence to improve it. The solution to this problem is to use reinforcement learning (RL) which, through its reward formalism, allows us to train a model to maximize a non-differentiable quantity. We will train the RL agent to output entire sequences (rather than just the next token) and learn from the feedback given from the BLEU reward function.

Let's briefly recap the reinforcement learning framework. In RL an agent performs actions in an environment and observes the consequences of these actions through (i) changes in the environment and (ii) a numerical reward which indicates whether the agent is achieving its task or not. More formally, an agent can choose actions from a set $\mathcal{A}=\{a_0,..,a_n\}$. At each time $t$ the agent makes an observation $\mathbf{o}_{t}$ and aggregates it with past observations into a state $\mathbf{s}_t$. The agent's behaviour is called a *policy* $\boldsymbol\pi[a|\mathbf{s}_{t}]$ and describes the probability of taking action $a$ in state $\mathbf{s}_{t}$. The agent's goal is to learn a policy which maximizes the expected sum of rewards at each state.

Now let's map the RL framework to neural natural language generation. An action consists of selecting the next token, so there as many possible actions $a$ as there are words $y\in|\mathcal{V}|$ in the vocabulary. The policy $\boldsymbol\pi$ is a distribution over possible actions. Hence, we consider the policy to be the likelihood $Pr(y_{t}|\hat{\mathbf{y}}_{<t},\mathbf{x},\boldsymbol\phi)$ of producing different tokens $y_{t}$ given the previous tokens $\hat{\mathbf{y}}_{<t}$, the input sentence $\mathbf{x}$ and the model parameters $\boldsymbol\phi$. The reward is a computed score like BLEU which is only provided after generating the entire translated sentence; intermediate rewards are zero.^{1} The state $\mathbf{s}_{t}$ corresponds to the state of the decoder. Note that it's not clear what constitutes the environment in this description; we will return to this question later.

The REINFORCE algorithm (Williams 1992) is a policy gradient method that describes the policy $\boldsymbol\pi[a|s,\boldsymbol\phi]$ as a function of parameters $\boldsymbol\phi$ and then manipulates these parameters to improve the rewards received. Translating to the NNLG case, this means that we will manipulate the parameters $\boldsymbol\phi$ of the model (technically we backpropagate through the encoder as well) so that it produces sensible distributions over the words at each step, and the resulting BLEU scores are good.

More specifically, the REINFORCE algorithm maximizes the expected discounted sum of rewards when the actions $a$ are drawn from the policy $\boldsymbol\pi$:

\begin{equation}

J[\boldsymbol\phi] = \mathbb{E}_{a\sim \boldsymbol\pi\left[\boldsymbol\phi\right]}\left[\sum_{t=1}^L\gamma^{t-1}r[\mathbf{s}_t,\hat{y}_{t}]\right], \tag{1}

\end{equation}

where $L$ is the length of a decoding episode, and $r[s_t,\hat{y}_{t}]$ is the reward obtained for generating word $\hat{y}_{t}$ when the decoder has hidden state $\mathbf{s}^t$. The term $\gamma\in(0,1]$ is the discount factor which weights rewards less the further they are into the future. For NNLG, we commonly only receive a reward at the end of the decoding when we compare to the BLEU score, and the intermediate rewards are all zero.

We need the derivative of this expression so we can change the parameters to increase this measure and this derivative is found using the *policy gradient theorem* (Sutton *et al.* 2000). The REINFORCE algorithm uses the resulting expression to estimate the derivative from a set of $I$ decoding results sampled with the current policy $\boldsymbol\pi[\boldsymbol\phi]$. The $i^{th}$ decoding result consists of series of tuples $\{\mathbf{s}_{i,t}, \hat{y}_{i,t}, r_{i,t}\}_{t=1}^{L_{i}}$ where each tuple consists of the decoder state $\mathbf{s}_{i,t}$ , the predicted tokens $\hat{y}_{i,t}$, and the rewards $r_{i,t}$ respectively. The estimated gradient is:

\begin{equation}

\nabla J[\theta] \approx \frac{1}{I}\sum_{i=1}^{I}\sum_{t=1}^{L_{i}} \nabla_{\theta} \log \left[\boldsymbol\pi\left[\hat{y}_{i,t}|\mathbf{s}_{i,t},\boldsymbol\phi\right]\right] \left(Q^{\boldsymbol\pi}[\mathbf{s}_{i,t}, \hat{y}_{i,t}]- b[\mathbf{s}_{i,t}]\right) \tag{2}

\end{equation}

where the state-action value function $Q^{\boldsymbol\pi}[\mathbf{s}_{i,t}, \hat{y}_{i,t}]$ is the expected sum of rewards $\mathbb{E}_{\sim \boldsymbol\pi}[\sum_{t'\geq t}^{T}\gamma^{t'-t} r[\mathbf{s}_{i,t'},\hat{y}_{i,t'}]|\mathbf{s}_{i,t},\hat{y}_{i,t}]$ after generating the token $\hat{y}_{i,t}$ given the state $\mathbf{s}_{i,t}$, and $b[\bullet]$ is a *baseline* function which helps reduce variance.

It is common that the state-action value function $Q^{\boldsymbol\pi}(s_{i,t}, \hat{y}_{i,t})$ is also estimated through Monte-Carlo roll-outs, which leads to:

\begin{equation}

\nabla J[\theta] \approx \frac{1}{I}\sum_{i=1}^{I}\sum_{t=1}^{L_{i}} \nabla_{\theta} \log \left[\boldsymbol\pi\left[\hat{y}_{i,t}|\mathbf{s}_{i,t},\boldsymbol\phi\right]\right] \left(\sum_{t'=t}^T \gamma^{t'-t} r[\mathbf{s}_{i,t'}, \hat{y}_{i,t'}] - b[\mathbf{s}_{i,t'}]\right). \tag{3}

\label{eq:reinforce}

\end{equation}

In practice, any RL fine-tuning approach must choose a decoding method. Wu *et al.* (2018) compare the performance of beam search and greedy search and find that RL fine-tuning seems to bring the largest improvement for greedy search. They explain this result in terms of *exploration-exploitation trade-off*. At each time step, the agent could choose to apply only the actions that it thinks are most valuable (exploitation), or it could choose to take different actions to increase its confidence in what it believes to be the best actions (exploration). Beam search exploits the current policy more efficiently than greedy search, and to the extent that greedy search is sub-optimal, it also permits more exploration of the sequence space.

RL fine-tuning has been shown to improve the performance of NNLG models for many tasks (Ranzato *et al.* 2017; Bahdanau *et al.* 2017; Strub *et al.* 2017; Das *et al.* 2017; Wu *et al.* 2018). However, the improvement over maximum likelihood training has been small (Wu *et al.* 2018) and although it is theoretically simple to move to an RL objective, in practice, it is not uncommon to have to resort to "tricks" to make it work (Bahdanau *et al.* 2017).

If RL is used as a solution to "fix" problems caused by log-likelihood training, then why not train a model with RL from scratch? Such training requires that the model outputs sequences, observes rewards, and updates its behaviour. However, it is highly unlikely that a randomly-initialized model would generate a sequence of tokens that would be relevant to a sentence to translate. Consequently, the model would never observe any significant reward and would be unable to learn to improve.

A different approach might be to use *off-policy training* which observes trajectories from another model and uses these to learn a policy; it differs from on-policy training in that it learns without ever actually generating tokens itself. For machine translation, the "other model" is the human who generated the gold translations. The NMT model observes these translations as well as the associated rewards, and updates its policy to maximize the expected sum of rewards at each state. To account for the differences between the learned policy and the gold policy, we use importance sampling and Equation 3 becomes:

\begin{eqnarray} \label{eq:offpol}

\nabla J[\theta] &\approx& \\

&&\hspace{-1.3cm}\frac{1}{I}\sum_{i=1}^{I}\sum_{t=1}^{L_{i}} \frac{\boldsymbol\pi\left[y_{i,t}|\mathbf{s}_{i,t},\boldsymbol\phi\right]}{q(y_{i,t}|\mathbf{s}_{i,t})}\nabla_{\boldsymbol\phi} \log \left[\boldsymbol\pi\left[y_{i,t}|\mathbf{s}_{i,t},\boldsymbol\phi\right]\right] \left(\sum_{t'=t}^T \gamma^{t'-t} r[\mathbf{s}_{i,t'}, y_{i,t'}] - b[\mathbf{s}_{i,t'}]\right). \nonumber \tag{4}

\end{eqnarray}

where $q(y_{i,t}|\mathbf{s}_{i,t})$ is the probability that a human would generate token $y_{i,t}$ given state $\mathbf{s}_{i,t}$.

This seems convenient, but there is a problem here too. Since the human-generated translation is the gold standard, the reward is always going to be the maximum BLEU score of 1. Consequently, the sum of rewards in Equation 4 is always 1 and the training objective reduces to negative log-likelihood. In fact, this method really only differs from ML training when trajectories are of variable quality and have different rewards, or when rewards for partial sequences are available (Zhou *et al.* 2017; Kandasamy *et al. *2017), and even then we have the additional complication of having to estimate $q[\bullet].$ Overall, training with RL from scratch is not adapted to the usual case where only good generations are available for learning.

Let's return to the question of what exactly constitutes the 'environment' when we frame natural language generation as RL. One of the difficulties of RL training is that the agent must interact with the unknown environment. For each action that the agent takes, it must wait for the environment to provide a new observation to be able to decide its next action. The agent can thus only produce one trajectory at a time.

However, for natural language generation the environment is the decoder itself: it outputs a word and then directly updates its state solely based on this word, without requiring any external input. Therefore, we can run the decoding process as many times as we want since there is no external environment that conditions the states of the decoder.

In this sense, natural language generation might be better framed as *structured prediction*; we treat it as a machine learning problem with multiple outputs that are mutually dependent. For machine translation, these dependencies represent the constraints between words in the output sentence that ensure that it is syntactically correct and semantically meaningful. When we consider natural language generation in this light, we can contemplate methods in which we roll-out as many trajectories as computationally-feasible and learn through sequence-level costs like BLEU. Further discussion of the relationship between RL and structured prediction can be found in Daume (2017), and Kreuzer (2018).

*Minimum risk Training* (Shen *et al.* 2016) is a structured prediction technique that is practically very similar to the REINFORCE algorithm except that is does not only consider one generated translation of the input sentence at a time. Instead, it evaluates multiple possible translations $\{\hat{\mathbf{y}}_{j}\}$, computes a cost $\Delta[\hat{\mathbf{y}}_j, \mathbf{y}_i]$ for each, and weighs each cost by the probability $Pr(\hat{\mathbf{y}}_j|\mathbf{x}_i,\boldsymbol\phi)$ of its trajectory. The overall loss function $\mathcal{L}[\boldsymbol\phi]$ is the total cost:

\begin{equation}

\mathcal{L}[\boldsymbol\phi] = \sum_{i=1}^{I}\left(\frac{1}{Z_{i}}\sum_{\hat{\mathbf{y}}_j \in \mathcal{S}[\mathbf{x}_i]} Pr(\hat{\mathbf{y}}_j|\mathbf{x}_i,\boldsymbol\phi)^{\alpha} \Delta[\hat{\mathbf{y}}_j, \mathbf{y}_i] \right) \tag{5}

\end{equation}

where $\mathcal{S}[\mathbf{x}_i]$ is a selected subset of all possible translations of $\mathbf{x}_i$ and $\Delta[\hat{\mathbf{y}}_i, \mathbf{y}_i]$ computes the risk associated with $\hat{\mathbf{y}}_i$ in the form of a distance between the estimated translation $\hat{\mathbf{y}}_i$ and the gold translation $\mathbf{y}_i$ (e.g., the negative BLEU score). The exponent $\alpha$ controls the sharpness of the weightings of each possible translation and $Z_{i}$ is a normalizing factor that ensures that the exponentiated probabilities sum to one. Note that $\mathcal{S}[\mathbf{x}_i]$ is constructed to always contain the ground truth translation $\mathbf{y}_i$. The remaining candidate translations in this subset are generated by sampling from the model distribution.

This model is efficient, because it evaluates multiple trajectories at once and is a viable alternative to reinforcement learning for fine tuning after maximum likelihood pre-training (Shen *et al.* 2016).

Scheduled sampling tries to address the main problem with the maximum likelihood approach; the decoder is never exposed to its own outputs during the training procedure. The idea is simple: during training, the input to the decoder at time t+1 is the ground truth token $y_t$ with probability $\epsilon$ or the previously decoded token $\hat{y}_{t}$ with probability $1−\epsilon$. The probability $\epsilon$ is adjusted during training: it starts at 1 (where the model learns from ground truth) and progressively decays to 0 (where the model only learns from its own outputs). This is a fine-tuning technique which progressively blends in model predictions.

This approach takes inspiration from the DAgger method in imitation learning (Ross *et al.*, 2011) in which an agent learns from both its own actions and the actions of an oracle which acts expertly. However, unlike imitation learning, scheduled sampling does not rely on a live oracle but instead uses the dataset (Figure 3). Consequently, the corrections provided to the model might not make sense given the model's errors: it optimizes an objective that does not guarantee good behaviour if trained until convergence (Huszár, 2015). Nonetheless, this method is empirically successful and outperforms RL on a paraphrase generation task (Du & Ji, 2019).

The preceding methods trained at the token level and then fine-tuned at the sequence level. The methods in this section combine searching and learning to directly train the model at the sequence level. In the next three sections, we consider beam search optimization, SeaRNN, and reward augmented maximum likelihood respectively.

Similarly to scheduled sampling, *Beam Search Optimization* also uses ground truth data as an oracle. However, unlike scheduled sampling, it tries to maintain semantic and syntactic correctness. It (i) uses a sequence-level cost function and (ii) maintains a beam of hypotheses during the learning procedure. At each point in time, the oracle ensures that the ground truth is among these hypotheses.

Let the score of a partial sequence at time $t$ be $\mbox{s}[w,\hat{\mathbf{y}}_{\leq t}, \mathbf{x}_i,\boldsymbol\phi]$, where $\boldsymbol\phi$ are the weights of the model, $w$ is a token in the vocabulary $\mathcal{V}$, $\hat{\mathbf{y}}_{<t}$ is the sequence decoded so far, and $\mathbf{x}_i$ is the input sentence to be translated. This score is the output of the decoder *before* passing through the softmax function. During decoding, the $K$ most highly scored sequences are retained. Let $\hat{\mathbf{y}}_{<t}^{(K)}$ be the $K$-th ranked sequence at time $t$, so that there are exactly $K-1$ sequences scored more highly than this. The loss function for a single sequence is as follows:

\begin{eqnarray}

\mathcal{L}[\boldsymbol\phi] &=& \sum_{t=1}^{T-1} \Delta\left[\hat{\mathbf{y}}^{(K)}_{\leq t}\right] \left(1 - \mbox{s}\left[w,\hat{\mathbf{y}}_{\leq t}^{(K)}, \mathbf{x},\boldsymbol\phi\right] + \mbox{s}\left[w,\mathbf{y}_{\leq t}, \mathbf{x},\boldsymbol\phi\right]\right) \\ \nonumber

&&\hspace{1cm}+ \Delta\left[\hat{\mathbf{y}}^{(1)}_{\leq T}\right] \left(1 - \mbox{s}\left[w,\hat{\mathbf{y}}_{\leq T}^{(1)}, \mathbf{x},\boldsymbol\phi\right] + \mbox{s}\left[w,\mathbf{y}_{\leq T}, \mathbf{x},\boldsymbol\phi\right]\right), \tag{6}

\end{eqnarray}

where $\Delta[\bullet]$ is a cost that derives from inverse BLEU score of the partial sequence $\hat{\mathbf{y}}^{(K)}_{\leq t}.$ The first term in the loss function encourages the model to score the ground truth translation $\mathbf{y}_{\leq t}$ higher than the $K$-th ranked sequence with a margin. The second term encourages the ground truth translation to be the most highly scored sequence at the last time step of decoding $T$. The function $\Delta[\bullet]$ is defined so that it returns 0 when there is no margin violation and a positive quantity otherwise.

To train this model, we need to update the set $\mathcal{S}_{t}$ of the $K$ most highly scored sequences at each time step $t$ and make sure that the ground truth sequence is in this set. To do this, Wiseman and Rush (2016) propose the following mechanism:

\begin{equation}

\mathcal{S}_t = \mbox{topK}

\begin{cases}

\mbox{succ}[\mathbf{y}_{<t}] & \text{if violation} \\

\bigcup\limits_{k=1}^K \mbox{succ}[\hat{\mathbf{y}}^{(k)}_{<t}] & \text{otherwise},

\end{cases} \tag{7}

\end{equation}

where $\mbox{succ}[w]$ returns the set of all valid sequences that can be formed by appending a token to the sequence $w$. If the gold sequence is in the top-$K$ results then beam search continues as normal. If the gold sequence falls off the beam, this is a *violation* and the model now uses the ground truth as prefix for subsequent generation (figure 4).

In this way, beam search optimization trains the model to output sequences where the ground truth sequence is always the most highly scored one at the end of decoding and is among the top $K$ sequences during decoding. The mistake-specific scoring function $\Delta$ takes into account a sequence-level cost during training.

SeaRNN (LeBlond *et al.* 2018) searches over all possible next tokens, taking into account the subsequent completion of the sequence. It generates the first part of the sequence using the *roll-in* policy, chooses the next token to evaluate and then completes the sentence with a *roll-out* policy. The cost of this full completed sentence can then be evaluated and the best token chosen (figure 5).

In more detail, the roll-in policy is first applied to generate a sequence which is stored together with the corresponding hidden states. The algorithm then steps through this sequence. At step $t$, the roll-out policy takes the partial sequence $\hat{\mathbf{y}}_{<t}$ and a possible next token $w \in \mathcal{V}$, and completes the sentence $\hat{\mathbf{y}}_{>t}$. A cost $\mbox{c}[\{\hat{\mathbf{y}}_{<t},w,\hat{\mathbf{y}}_{>t}\},\mathbf{y}]$ is then computed based on the similarity of the current sequence $\{\hat{\mathbf{y}}_{<t},w,\hat{\mathbf{y}}_{>t}\}$, consisting of partial roll-in, word choice and roll-out, and the ground truth $\mathbf{y}$. In this way we can select the best choice of word $w^{*}$.

LeBlond {*et al.* (2018) propose to use the log-loss for training the system:

\begin{align}

\mathcal{L}[\boldsymbol\phi] &= \sum_{t=1}^T -\log \left(\frac{\exp[s\left[w^{*}_{t},\hat{\mathbf{y}}_{<t}, \mathbf{x},\boldsymbol\phi\right]]} {\sum_{i \in \mathcal{V}} \exp\left[s\left[i,\hat{\mathbf{y}}_{<t}, \mathbf{x},\boldsymbol\phi\right]\right]} \right), \tag{8}

\end{align}

where $\mbox{s}\left[w^{*}_{t},\hat{\mathbf{y}}_{<t}, \mathbf{x},\boldsymbol\phi\right]$ is the pre-softmax score for the generated sequence $\hat{\mathbf{y}}_{\leq t} = \{\hat{\mathbf{y}}_{<t}, w^*_t\}$ given the input $\mathbf{x}$.

There are three possible types of roll-in and roll-out policy: we can use the model's predictions, the ground truth tokens, or a mix of the model's predictions and the ground truth tokens. Note that if the roll-in and roll-out policies are the ground truth predictions, then this loss reduces to negative log-likelihood since for each $t$, $w^*_t$ will be the ground truth token $y^t$. LeBlond *et al.* (2018) ran experiments for all 9 different combinations and they concluded that it was best to use the model's predictions as roll-in strategy and a mix of the two policies for roll-out.

Although this approach allows us to train the model at the sequence level, the computational cost is too high for practical word-level generation: at each time step of decoding, SeaRNN has to generate sequences for every possible next token and so in practice it is only applied to a reduced vocabulary.

The preceding approaches have relied on the model's own predictions to support sequence-level optimization. Instead, *Reward Augmented Maximum Likelihood* or *RAML* (Norouzi *et al.* 2016) aims to teach the model about the space of good solutions. To this end, the algorithm augments the ground truth dataset with sequences which are known to maximize a reward $r$. RAML uses sequence-level information not to train the model to learn from its own output, but to teach the model about the space of good solutions.

The algorithm exploits the fact that the global minimum of the REINFORCE loss function with reward $r$ and with entropy regularization is achieved when the policy $\boldsymbol\pi[\boldsymbol\phi]$ matches the *exponential payoff distribution*:

\begin{equation}

q(\hat{\mathbf{y}}_i|\mathbf{y}_i,\tau) = \frac{1}{Z(\mathbf{y}_i,\tau)} \exp(r[\hat{\mathbf{y}}_i, \mathbf{y}_i]/\tau), \tag{9}

\end{equation}

where $\tau$ is a temperature parameter, Z is a normalizing term, and $r[\hat{y}_i,y_i]$ is a sequence-level reward (e.g., the BLEU score).

It follows that if we sample values $\hat{\mathbf{y}}_i$ from $q(\hat{\mathbf{y}}_i|\mathbf{y}_i, \tau)$ we will maximize the regularized expected sum of rewards. Hence, we draw samples from $q$ and train the model to maximize the likelihood of these samples. The resulting training objective is:

\begin{equation}

\mathcal{L}[\boldsymbol\phi] = \sum_{i=1}^{I} \left(-\sum_{\hat{\mathbf{y}}_i \in \mathcal{Y}} q(\hat{\mathbf{y}}_i|\mathbf{y}_i, \tau) \log[\boldsymbol\pi[\mathbf{y}_i|\mathbf{x}_i,\boldsymbol\phi]]\right), \tag{10}

\end{equation}

where $\mathcal{Y}$ is the set of all possible translations. To minimize $\mathcal{L}$, one must minimize the negative log likelihood of samples weighted according to $q(\bullet|\mathbf{y}_i, \tau)$.

RAML thus serves as a data-augmentation method which not only presents ground truth samples to the model, but also samples that maximize the chosen reward function. Note that sampling from $q(\hat{\mathbf{y}}_i|\mathbf{y}_i, \tau)$ is not straightforward. See Norouzi *et al.* (2016) for more details.

In the first part of this tutorial, we considered training with maximum-likelihood and then presented different decoding algorithms. We discussed that one of the main problems with maximum likelihood training is *exposure bias*: during training the sequential decoderonly sees ground truth tokens whereas during testing it must generate new words based on its own previous outputs.

If the model samples one token from the tail of the distribution at some point during decoding, it enters a space that it has not observed during training so it does know not how to continue the generation and it ends up outputting a sequence of low-quality.^{2} Approaches to avoid this problem fall into four categories:

- During inference, force the model to stay in high-likelihood space to avoid errors that take the model to an unknown part of the output space (top-k sampling, nucleus sampling);
- During inference, force the model to stay in high-likelihood space as it learns from its own outputs (fine-tuning with RL, MRT or scheduled sampling);
- Make the model learn to recover from its own mistakes during training (BSO, SeaRNN);
- Teach the model about the space of
*good*solutions according to a sequence-level reward (RAML).

We speculate that the most successful future approaches will probably combine learning to recover from mistakes during training and preventing the solution from straying away from the known output space during inference.

To conclude the tutorial, we draw attention to several areas that we believe would benefit from further investigation.

**Exploring the space of solutions**: During inference, it is unclear how to constrain exploration. Beam search, top-k sampling, and nucleus sampling all rely on setting a threshold. However, instead of setting an arbitrary threshold, it might be interesting to try to quantify the model's confidence. In particular, approaches inspired from the *safe reinforcement learning* literature might be useful (García *et al.* 2015).

**Model distribution**: Technically encoder-decoder NNLG models are trained to model the distribution $Pr(\bullet|\mathbf{y}_{<t})$ at each time step $t$. However, it is unclear whether they are really successful at learning this distribution and in particular, whether they do use all the history to predict the next token. Since this is a fundamental assumption of NNLG, work that explore when models learn the joint distribution and when they do not would be very enlightening.

**Encoder-decoder relations**: All the work that we have presented here has focused on what is happening in the decoder. The underlying assumption is that it is possible to decouple encoder and decoder representations when investigating decoding strategies. This assumption might not always hold and phenomena induced by the encoder representations might explain some of the behaviour observed during decoding (see El Asri & Trischler, 2019).

**Metrics**: One hindrance to progress in this field has been the lack of reliable automatic metrics (Liu *et al.* 2016) and the fact that the community does not seem to have gathered around a small common set of benchmarks. There is an important body of work on the topic of evaluation (Holtzman *et al.* 2018; Xu *et al.* 2019}) and some metrics to measure various aspects such as coherence and entailment have been proposed and might become more widely used.

**Linguistic considerations**: In this post, we have only focused on the NNLG problem from a machine learning point of view, without any linguistic considerations. We encourage the reader to look into other approaches such as Shen *et al.* (2019) (which received a best paper award at ICLR 2019!) which modifies the RNN architecture to take into account linguistic properties. Following the *#BenderRule*, we should also mention that the work we have described has only been applied to a few languages and that these approaches might not be universal.

In conclusion, on high-resource languages, the field of NNLG has known great empirical success and has made significant progress towards generating coherent text. We hope that this post has provided a useful overview for young researchers or anybody who is considering doing research in this area. This is a truly exciting area where, we believe, most of the building blocks to build coherent and robust models are already in place.

^{1}Note that this is a common setting, but it is possible to use intermediate rewards, e.g., the BLEU scores on partial sequences.

^{2}This phenomenon is known as the problem of cascading errors in the imitation learning literature (Bagnell 2015).

We will not focus on the input type; we assume that the input has been processed by a suitable *encoder* to create an embedding in a latent space. Instead, we concentrate on the *decoder* which takes this embedding and generates sequences of natural language tokens.

We will use the running example of *neural machine translation*: given a sentence in language A, we aim to generate a translation in language B. The input sentence is processed by the encoder to create a fixed-length embedding. The decoder then uses this embedding to output the translation word-by-word.

We'll now describe the *encoder-decoder architecture* for this translation model in more detail (figure 1). The encoder takes the sequence $\mathbf{x}$ which consists of $K$ words $\{x_t\}_{t=1}^{K}$ and outputs the latent space embedding $\mathbf{h}$. The *decoder* takes this latent representation $\mathbf{h}$ and outputs a sequence of $L$ word tokens $\{\hat{y}_t\}_{t=1}^{L}$ one by one.

We consider an encoder which converts each word token to a fixed length word embedding using a method such as the SkipGram algorithm (Mikolov *et* al. 2013). Then these word embeddings are combined by a neural architecture such as a recurrent neural network (Sundermeyer *et* al. 2012), self-attention network (Vaswani *et* al. 2017), or convolutional neural network (Dauphin *et* al. 2017) to create a fixed-length hidden state $\mathbf{h}$ that describes the whole input sequence.

At each step $t$, the decoder takes this hidden state and an input token, and outputs a probability distribution over the vocabulary $\mathcal{V}$ from which the output word $\hat{y}_{t}$ can be chosen. The hidden state itself is also updated, so that it knows about the history of the generated words. The first input token is always a special *start of sentence* (SOS) token. Subsequent tokens correspond to the predicted output word from the previous time-step during inference, or the ground truth word during training. There is a special *end of sentence* (EOS) token. When this is generated, it signifies that no more tokens will be produced.

This tutorial is divided into two parts. In this first part, we assume that the system has been trained with a maximum likelihood criterion and discuss algorithms for the decoder. We will see that maximum likelihood training has some inherent problems related to the fact that the cost function does not consider the whole output sequence at once and we'll consider some possible solutions.

In the second part of the tutorial we change our focus to consider alternative training methods. We consider fine-tuning the system using reinforcement learning or minimum risk training which use sequence-level cost functions. Finally, we review a series of methods that frame the problem as structured prediction. A summary of these methods is given in table 1.

Type of method | Training | Inference |
---|---|---|

Decoding algorithms | Maximum likelihood Maximum likelihood Maximum likelihood Maximum likelihood Maximum likelihood Maximum likelihood |
greedy search beam search diverse beam search iterative beam search top-k sampling nucleus sampling |

Sequence-level fine-tuning | Fine-tune with reinforcement learning Fine-tune with minimum risk training Scheduled sampling |
greedy search/beam search beam search |

Sequence-level training | Beam search optimization SeaRNN Reward augmented max likelihood |
greedy search / beam search beam search greedy search / beam search beam search |

In this section, we describe the standard approach to train encoder-decoder architectures, which uses the maximum likelihood criterion. Specifically, the model is trained to maximize the conditional log-likelihood for each of $I$ input-output sequence pairs $\{\mathbf{x}_{i},\mathbf{y}_{i}\}_{i=1}^I$ in the training corpus so that:

\begin{equation}\label{eq:local_max_like}

L = \sum_{i=1}^{I} \log\left[ Pr(\mathbf{y}_i | \mathbf{x}_i, \boldsymbol\phi)\right] \tag{1}

\end{equation}

where $\boldsymbol\phi$ are the weights of the model.

The probabilities of the output words are evaluated sequentially and each depends on the previously generated tokens, so the probability term in equation 1 takes the form of an auto-regressive model and can be decomposed as:

\begin{equation}

Pr(\mathbf{y}_i | \mathbf{x}_i, \boldsymbol\phi)= \prod_{t=1}^{L_i} Pr(y_{i,t} |\mathbf{y}_{i,<t}, \mathbf{x}_i, \boldsymbol\phi). \tag{2}

\end{equation}

Here, the probability of token $y_{i,t}$ from the $i^{th}$ sequence at time $t$ depends on the tokens $\mathbf{y}_{i,<t} = \{y_{i,0}, y_{i,1},\ldots y_{i,t-1}\}$ seen up to this point as well as the input sentence $\mathbf{x}_{i}$ (via the latent embedding from the encoder).

This training criterion seems straightforward, but there is a subtle problem. In training, the previously seen ground truth tokens $\mathbf{y}_{i,<t}$ are used to compute the probability of the current token $y_{i,t}$. However, when we perform inference with the model and generate text, we no longer have access to the ground truth tokens; we only know the actual tokens $\hat{\mathbf{y}}_{i,<t}$ that we have generated so far (figure 2).

The approach of using the ground truth tokens for training is known as *teacher forcing*. In a sense, it means that the training scenario is unrealistic and does not map to the real situation when we perform inference. In training, the model is only exposed to sequences of ground truth tokens, but sees its own output when deployed. As we shall see in the following discussion, this *exposure bias* may result in some problems in the decoding process.

We now return to the decoding (inference) process. At each time step, the system predicts the probability $Pr(\hat{y}_t |\hat{\mathbf{y}}_{<t}, \mathbf{x}, \boldsymbol\phi)$ over items in the vocabulary, and we have to select a particular word from this distribution to feed into the next decoding step. The goal is to pick the sequence with the highest overall probability:

\begin{equation}

Pr(\hat{\mathbf{y}} | \mathbf{x}, \boldsymbol\phi) = \prod_t Pr(\hat{y}_t | \hat{\mathbf{y}}_{<t},\mathbf{x}, \boldsymbol\phi). \tag{3}

\end{equation}

In principle, we can simply compute the probability of every possible sequence by brute force. However, there are as many choices as there are words $|\mathcal{V}|$ in the vocabulary for each position and so a sentence of length $L$ would have $|\mathcal{V}|^{L}$ possible sequences. The vocabulary size $|\mathcal{V}|$ might be as large as 50,000 words and so this might not be practical.

We can improve the situation by re-using partial computations; there are many sequences which start in the same way and so there is no need to re-compute these partial likelihoods. A dynamic programming approach can exploit this structure to produce an algorithm with complexity $\mathcal{O}[L|\mathcal{V}|^2]$ but this is still very expensive.

Since we cannot find the sequence with the maximum probability, we must resort to tractable search strategies that produce a reasonable approximation of this maximum. Two common approximations are *greedy search* and *beam search* which we discuss respectively in the next two sections.

The simplest strategy is *greedy search*. It consists of picking the most likely token according to the model at each decoding time step $t$ (figure 3a).

\begin{equation}

\hat{y}_t =\underset{w \in \mathcal{V}}{\mathrm{argmax}}\left[ Pr(y_{t} = w | \hat{\mathbf{y}}_{<t}, \mathbf{x}, \boldsymbol\phi)\right] \tag{4}

\end{equation}

Note that this does not guarantee that the complete output $\hat{\mathbf{y}}$ will have high overall probability relative to others. For example, having selected the most likely first token $y_{0}$, it may transpire that there is no token $y_{1}$ for which the probability $Pr(y_{1}|y_{0},\mathbf{x},\boldsymbol\phi)$ is high. It might have been overall better to choose a less probable first token, which is more compatible with a second token.

We have seen that searching over all possible sequences is intractable, and that greedy search does not necessarily produce a good solution. Beam search seeks a compromise between these extremes by performing a restricted search over possible sequences. In this regard, it produces a solution that is both tractable and superior in quality to greedy search (figure 3b).

At each step of decoding $t$, the $B$ most probable sequences $\mathcal{B}^{t-1} =\{\hat{\mathbf{y}}_{<t,b}\}_{b=1}^{B}$ are stored as candidate outputs. For each of these hypotheses the log probability is computed for each possible next token $w$ in the vocabulary $\mathcal{V}$, so $B|\mathcal{V}|$ probabilities are computed in all. From these, the new $B$ most probable sequences $\mathcal{B}^t =\{\hat{\mathbf{y}}_{<t+1,b}\}_{b=1}^{B}$ are retained. By analogy with the formula for greedy search, we have:

\begin{equation}\label{eq:beam_search}

\mathcal{B}^t =\underset{w \in \mathcal{V},b\in1\ldots B}{\mathrm{argtopk}}\left[B, Pr(y_{t} = w | \hat{\mathbf{y}}_{<t,b}, \mathbf{x}, \boldsymbol\phi)\right] \tag{5}

\end{equation}

where the function $\mathrm{argtopk}[K,\bullet]$ returns the set of the top $K$ items that maximize the second argument. The process is repeated until EOS tokens are produced or the maximum decoding length is reached. Finally, we return the most likely overall result.

The integer $B$ is known as the *beam width*. As this increases, the search becomes more thorough but also more computationally expensive. In practice, it is common to see values of $B$ in the range 5 to 200. When the beam width is 1, the method becomes equivalent to greedy search.

When we train a decoder with a maximum-likelihood criterion, the resulting sentences can exhibit a lack of diversity. This happens at both (i) the beam level (many sentences in the same beam may be very similar) and (ii) the decoding level (words are repeated during one iteration of decoding). In the next two sections we look at methods that have been proposed to ameliorate these issues.

While beam search is superior to greedy search, it often produces sentences that have the same or a very similar start (figure 4a). Decoding is done in a left-to-right fashion and the probability weights are often concentrated at the beginning; even if search is performed over $B$ sentences, the weight of the first few words will mean that most of these sentences start with these words and there is little diversity.

4a) Beam SearchA steam engine train travelling down train tracks.A steam engine train travelling down tracks.A steam engine train travelling through a forest.A steam engine train travelling through a lush green forest.A steam engine train travelling through a lush green countryside.A train on a train track with a sky background. |

4b) Diverse Beam SearchA steam engine train travelling down train tracks.A steam engine train travelling through a forest. An old steam engine train travelling down train tracks. An old steam engine train travelling through a forest. A black train is on the tracks in a wooded area. A black train is on the tracks in a rural area. |

This raises the question of whether the likelihood objective correlates with our end-goals. We might care about other criteria such as diversity, which is important for chatbots: if a chatbot always said the same thing in response to a generic input such as "how are you today?", it would become quickly dull. Hence, we might want to factor in other criteria of quality when decoding.

To counter beam-level repetition, Vijayakumar *et* al. (2018) proposed a variant of beam search, called *diverse beam search*, which encourages more variation in the generated sentences than pure beam search (figure 4b). The beam is divided into $G$ groups. Regular beam search is performed in the first group to generate $B'=\frac{B}{G}$ sentences. For the second group, at step $t$ of decoding, the beam search criterion is augmented with a factor that penalizes token sequences that are similar to the first $t$ words of the hypotheses in the first group. For the third group, sequences that are similar to those in either of the first two groups are penalized, and so on (figure 5).

Vijayakumar *et* al. (2018) investigate several similarity metrics including *Hamming diversity* which penalizes tokens based on their number of occurrences in the previous groups.

Diverse beam search has the disadvantage that it only discourages sequences that are close to the final sequences found in previous beams. However, there may be significant portions of the space that were searched to find these hypotheses and since we didn't store the intermediate results there is nothing to stop us from redundantly considering the same part of the search space again (figure 6).

Kulikov *et* al. (2018) introduced *iterative beam search* which aims to solve this problem. It resembles diverse beam search in that beams (groups of hypotheses) are computed and recorded. These beams are ordered and each is affected by the previous beams. However, unlike diverse beam search we do not wait for a beam search to complete before computing the others. Instead, they are computed concurrently.

Consider the situation where at ouput time $t-1$ we have $G$ groups of beams, each of which contains $B^{\prime}$ hypotheses. We extend the first beam to length $t$ in the usual way; we consider concatenating every possible vocabulary word with each of the $B^{\prime}$ hypotheses, evaluate the probabilities and retain the best overall $B^{\prime}$ solutions of length $t$.

When we extend the second beam to length $t$, we follow the same procedure. However, we now set any hypotheses that are too close in Hamming distance to those in the first beam to have zero probability. Likewise, when we extend the third beam, we discount any hypotheses that are close to those in the first two beams. The result is that each beam is forced to explore a different part of the search space and the final results have increased diversity (figure 7).

Until this point, we have assumed that the best way to decode is by maximizing the probability of the output words, using either greedy search, beam search, or a variation on these techniques. However, Holtzman *et* al. (2019) demonstrate that human speech does not stay in high probability zones and is often much more surprising than the text generated by these methods.

This raises the question of whether we should sample randomly from the output probability distribution rather than search for likely decodings. Unfortunately, this can also lead to degenerate cases. Holtzman *et* al. (2019) conjecture that at some point during decoding, the model is likely to sample from the tail of the distribution (i.e., from the set of tokens which are much less probable than the gold token). Once the model samples from this tail, it might not know how to recover. In the next two sections, we consider two methods that aim to repress such behavior.

Fan *et* al. (2018) proposed *top-$k$ sampling* as a possible remedy. Consider one iteration of decoding at the $t$-th time step. Let us define $\mathcal{V}_{K,t}$ as the set of the $K$ most probable next tokens according to $Pr(y_{t}=w|\hat{y}_{<t},\boldsymbol\phi)$. Let us also define the sum $Z$ of these $K$ probabilities:

\begin{equation}

Z = \sum_{w \in \mathcal{V}_K^t} Pr(y_{t}=w|\hat{\mathbf{y}}_{<t},\boldsymbol\phi). \tag{6}

\end{equation}

Top-$k$ sampling proposes to re-scale these probabilities and ignore the other possible tokens:

\begin{equation}

Pr(y_{t}=w | \hat{\mathbf{y}}_{<t}|\boldsymbol\phi) \leftarrow

\begin{cases}

Pr(y_{t}=w | \hat{\mathbf{y}}_{<t},\boldsymbol\phi) / Z & \text{if } w \in \mathcal{V}_K^t \\

0 & \text{otherwise}. \tag{7}

\end{cases}

\end{equation}

With this strategy, we only sample from the $K$ most likely tokens and thus avoid tokens from the tail of the distribution (their probability is set to zero).

Unfortunately, it can be difficult to fix $K$ in practice. There exist extreme cases where the distribution is very peaked and so the top-$K$ tokens include tokens from the tail. Similarly, there may be cases where the distribution is very flat and valid tokens are excluded from the top-$K$ list.

Nucleus sampling (Holtzman *et* al. 2019) aims to solve this problem by retaining a fixed proportion of the probability mass. They define $\mathcal{V}_\tau^t$ as the smallest set such that:

\begin{equation*}

Z = \sum_{w \in \mathcal{V}^t_\tau} Pr(y_{t}=w|\hat{\mathbf{y}}_{<t}, \boldsymbol\phi) \geq \tau,

\end{equation*}

where $\tau$ is a fixed threshold. Then, as in top-$k$ sampling, probabilities are re-scaled to:

\begin{equation}

Pr(y_{t}=w | \hat{\mathbf{y}}_{<t}|\boldsymbol\phi) \leftarrow

\begin{cases}

Pr(y_{t}=w | \hat{\mathbf{y}}_{<t},\boldsymbol\phi) / Z & \text{if } w \in \mathcal{V}_\tau^t \\

0 & \text{otherwise}. \tag{8}

\end{cases}

\end{equation}

Since the set $\mathcal{V}_\tau^t$ is chosen so that the cumulative probability mass is at least $\tau$, nucleus sampling does not suffer in the case of a flat or a peaked distribution.

Part I of this tutorial has highlighted some of the problems that occur during decoding. We can classify these problems into two categories.

First, maximum-likelihood trains the model to stay in high probability regions of the token space. As shown by Holtzman *et* al. (2019), this differs significantly from human speech. If we want to take into account other criteria of quality, such as diversity, search strategies must be put in place to explore the space of likely outputs and greedy sampling or vanilla beam search are not enough.

Second, the exposure bias introduced by teacher forcing forces NNLG models to be myopic, and only look for the next most likely token given a ground truth prefix. As a consequence, if the model samples a token from the long tail, it might enter a "degenerate'' case. When this happens, the model "makes an error'' by sampling a low-probability token and does not know how to recover from this error.

We have presented some sampling strategies that alleviate these issues at inference time. However, since these problems are both side-effects of the maximum likelihood teacher-forcing methodology for training, another way to approach this is to modify the training method. In part II of this tutorial, we describe how Reinforcement Learning (RL) and structured prediction help with both maximum-likelihood and teacher-forcing induced issues.

]]>The structure of this post is as follows. First, we briefly review knowledge graphs and knowledge graph completion in static graphs. Second, we discuss the extension to temporal knowledge graphs. Third, we present our new method for knowledge completion in temporal knowledge graphs and demonstrate the efficacy of this method in a series of experiments. Finally, we draw attention to some possible future directions for work in this area.

Knowledge graphs are knowledge bases of facts where each fact is of the form $(Alice, Likes, Dogs)$. Here $Alice$ and $Dogs$ are called the head and tail entities respectively and $Likes$ is a relation. An example knowledge graph is depicted in figure 1.

KG completion is the problem of inferring new facts from a KG given the existing ones. This may be possible because the new fact is logically implied as in:

\begin{equation*}

(Alice, BornIn, London) \land (London, CityIn, England) \implies (Alice, BornIn, England)

\end{equation*}

or it may just be based on observed correlations. If $(Alice, Likes, Dogs)$ and $(Alice, Likes, Cats)$ then there's a high probability that $(Alice, Likes, Rabbits)$.

For a single relation-type, the problem of knowledge graph completion can be visualised in terms of completing a binary matrix. To see this, consider the simpler knowledge graph depicted in figure 2a, where there are only two types of entities and one type of relation. We can define a binary matrix with the head entities in the rows and the tail entities in the columns (figure 2b). Each known positive relation corresponds to an entry of '1' in this matrix. We do not typically know negative relations. However, we can generate putative negative relations, by randomly sampling combinations of head entity, tail entity and relation. This is reasonable for large graphs where almost all combinations are false. This process is known as negative sampling and these negatives correspond to entries of '0' in the matrix. The remainin missing values in the matrix are the relations that we wish to infer in the KG completion process.

This matrix representation of the single-relation knowledge graph completion problem suggests a way forward. We can consider factoring the binary matrix $\mathbf{M} = \mathbf{A}^{T}\mathbf{B}$ into the outer product of a portrait matrix $\mathbf{A}^{T}$ in which each row corresponds to the head entity and a landscape matrix $\mathbf{B}$ in which each column corresponds to the tail entity. This is illustrated in figure 3. Now the binary value representing whether a given fact is true or not is approximated by the dot product of the vector (embedding) corresponding to the head entity and the vector corresponding to the tail entity. Hence, the problem of knowledge graph embedding becomes equivalent to learning these embeddings.

More formally, we might define the likelihood of a relation being true as:

\begin{eqnarray}

Pr(a_{i}, Likes, b_{j}) &=& \mbox{sig}\left[\mathbf{a}_{i}^{T}\mathbf{b}_{j} \right]\nonumber \\

Pr(a_{i}, \lnot Likes, b_{j}) &=& 1-\mbox{sig}\left[\mathbf{a}_{i}^{T}\mathbf{b}_{j} \right] \tag{1}

\end{eqnarray}

$\mbox{sig}[\bullet]$ is a sigmoid function. The term $\mathbf{a}_{i}$ is the embedding for the $i^{th}$ head entity (from the $i^{th}$ row of the portrait matrix $\mathbf{A}^{T}$) and $\mathbf{b}_{j}$ is the embedding for the $j^{th}$ tail entity $b$ (from the $j^{th}$ column of the landscape matrix $\mathbf{B}$). We can hence learn the embedding matrices $\mathbf{A}$ and $\mathbf{B}$ by maximizing the log likelihood of all of the known relations.

The above discussion considered only the simplified case where there is a single type of relation between entities. However, this general strategy can be extended to the case of multiple relations by considering a three dimensional binary tensor in which the third dimension represents the type of relation (figure 4). During the factorization process, we now also generate a matrix containing embeddings for each type of relation.

In the previous section, we considered KG completion in terms of factorizing a matrix or tensor into matrices of embeddings for the head entity, tail entity and relation.

We can generalize this idea, by retaining the notion of embeddings, but use more general *score functions* than the one implied by factorization to provide scores for each tuple. For example, TransE (Bordes *et* al. 2013) maps each entity and each relation to a vector of size $d$ and defines the score for a tuple $(Alice, Likes, Fish)$ as:

\[-|| {z}_{Alice} + {z}_{Likes} - {z}_{Fish}||\]

where ${z}_{Alice},{z}_{Likes},{z}_{Fish}\in\mathbb{R}^d$, corresponding to the embeddings for $Alice$, $Likes$, and $Fish$, are vectors with learnable parameters. To train, we define a likelihood such that $-|| {z}_{Alice} + {z}_{Likes} - {z}_{Fish}||$ becomes large if $(Alice, Likes, Fish)$ is in the KG and small if $(Alice, Likes, Fish)$ is likely to be false.

Other models map entities and relations to different spaces and/or use different score functions. For a comprehensive list of existing approaches and their advantages and disadvantages, see Nguyen (2017).

Temporal KGs are KGs where each fact can have a timestamp associated with it. An example of a fact in a temporal KG is $(Alice, Liked, Fish, 1995)$. Temporal KG completion (TKGC) is the problem of inferring new temporal facts from a KG based on the existing ones.

Existing approaches for TKGC usually extend (static) KG embedding models by mapping the timestamps to latent representations and updating the score function to take into account the timestamps as well. As an example, TTransE extends TransE by mapping each entity, relation, and timestamp to a vector in $\mathbb{R}^d$ and defining the score function for a tuple $(Alice, Liked, Fish, 1995)$ as:

\[-|| z_{Alice} + z_{Liked} + z_{1995} - z_{Fish}||\]

For a comprehensive list of existing approaches for TKGC and their advantages and disadvantages, see Kazemi *et* al. (2019).

We develop models for TKGC based on an intuitive assumption: to provide a score for, $(Alice, Liked, Fish, 1995)$, one needs to know $Alice$'s and $Fish$'s features in $1995$; providing a score based on their current features or an aggregation of their features over time may be misleading. That is because $Alice$'s personality and the sentiment towards $Fish$ may have been quite different in 1995 as compared to now (figure 5). Consequently, learning a static embedding for each entity - as is done by existing approaches - may be sub-optimal as such a representation only captures an aggregation of entity features during time.

To provide entity features at any given time, we define the entity embedding as a function which takes an entity and a timestamp as input and provides a hidden representation for the entity at that time. Inspired by diachronic word embeddings, we call our proposed embedding a *diachronic embedding (DE)*. In particular, we define the diachronic embedding for an entity $E$ using vector(s) defined as follows:

\begin{equation}

\label{eq:demb}

z^t_E[n]=\begin{cases}

a_E[n] \sigma(w_E[n] t + b_E[n]), & \text{if $1 \leq n\leq \gamma d$}. \\

a_E[n], & \text{if $\gamma d < n \leq d$}. \tag{2}

\end{cases}

\end{equation}

where $a_E\in\mathbb{R}^{d}$ and $w_E,b_E\in\mathbb{R}^{\gamma d}$ are (entity-specific) vectors with learnable parameters, $z^t_E[n]$ indicates the $n^{th}$ element of $z^t_E$ (similarly for $a_E$, $w_E$ and $b_E$), and $\sigma$ is an activation function.

Intuitively, entities may have some features that change over time and some features that remain fixed (figure 6). The first $\gamma d$ elements of $z^t_E$ in Equation (2) capture temporal features and the other $(1-\gamma)d$ elements capture static features. The hyperparameter $\gamma\in[0,1]$ controls the percentage of temporal features. In principle static features can be potentially obtained from the temporal ones if the optimizer sets some elements of $w_E$ in Equation (2) to zero. However, explicitly modeling static features helps reduce the number of learnable parameters and avoid overfitting to temporal signals.

Intuitively, by learning $w_E$s and $b_E$s, the model learns how to turn entity features on and off at different points in time so accurate temporal predictions can be made about them at any time. The terms $a_E$s control the importance of the features. We mainly use $\sin[\bullet]$ as the activation function for Equation (2) because one sine function can model several on and off states (figure 7). Our experiments explore other activation functions as well and provide more intuition.

It is possible to take any static KG embedding model and make it temporal by replacing the entity embeddings with diachronic entity embeddings as in Equation (2). For instance, TransE can be extended to TKGC by changing the score function for a tuple $(Alice, Liked, Fish, 1995)$ as:

\[-|| z^{1995}_{Alice} + z_{Liked} + z^{1995}_{Fish}||\]

where $z^{1995}_{Alice}$ and $z^{1995}_{Fish}$ are defined as in Equation (2). We call the above model DE-TransE where $DE$ stands for diachronic embedding. Besides TransE, we also test extensions of DistMult and SimplE, two effective models for static KG completion. We name the extensions DE-DistMult and DE-SimplE respectively.

**Table 1** Results on ICEWS14, ICEWS05-15, and GDELT. Best results are in bold blue.

**Datasets: **Our datasets are subsets of two temporal KGs that have become standard benchmarks for TKGC: ICEWS and GDELT. For ICEWS, we use the two subsets generated by García-Durán *et* al. (2018): 1- *ICEWS14* corresponding to the facts in 2014 and 2- *ICEWS05-15* corresponding to the facts between 2005 to 2015. For GDELT, we use the subset extracted by Trivedi *et* al. (2017) corresponding to the facts from April 1, 2015 to March 31, 2016. We changed the train/validation/test sets following a similar procedure as in Bordes *et* al. (2013) to make the problem into a TKGC rather than an extrapolation problem.

**Baselines:** Our baselines include both static and temporal KG embedding models. From the static KG embedding models, we use TransE, DistMult and SimplE where the timestamps are ignored. From the temporal KG embedding models, we compare to TTransE, HyTE, ConT, and TA-DistMult.

**Metrics:** We report filtered MRR and filtered hit@k measures. These essentially create queries such as $(v, r, ?)$ and measure how well the model predicts the correct answer among possible entities $u'$. See Bordes *et* al. 2013 for details.

Table 1 and figure 8 show the performance of our models compared to several baselines. According to the results, the temporal versions of different models outperform the static counterparts in most cases, thus providing evidence for the merit of capturing temporal information.

DE-TransE outperforms the other TransE-based baselines (TTransE and HyTE) on ICEWS14 and GDELT and gives on-par results with HyTE on ICEWS05-15. This result shows the superiority of our diachronic embeddings compared to TTransE and HyTE. DE-DistMult outperforms TA-DistMult, the only DistMult-based baseline, showing the superiority of our diachronic embedding compared to TA-DistMult. Moreover, DE-DistMult outperforms all TransE-based baselines. Finally, just as SimplE beats TransE and DistMult due to its higher expressivity, our results show that DE-SimplE beats DE-TransE, DE-DistMult, and the other baselines due to its higher expressivity.

We perform several studies to provide a better understanding of our models. Our ablation studies include i) different choices of activation function, ii) using diachronic embeddings for both entities and relations as opposed to using it only for entities, iii) testing the ability of our models in generalizing to timestamps unseen during training, iv) the importance of model parameters in Equation (2), v) balancing the number of static and temporal features in Equation (2), and vi) examining training complications due to the use of sine functions in the model. We refer the readers to the full paper for these experiments.

Our work opens several avenues for future research including:

- We proposed diachronic embeddings for KGs having timestamped facts. Future work may consider extending diachronic embeddings to KGs having facts with time intervals.
- We considered the ideal scenario where every fact in the KG is timestamped. Future work can propose ways of dealing with missing timestamps, or ways of dealing with a combination of static and temporal facts.
- We proposed a specific diachronic embedding in Equation (2). Future work can explore other possible functions.
- An interesting avenue for future research is to use Equation (2) to learn diachronic word embeddings and see if it can perform well in the context of word embeddings as well.

View the code here.

]]>It is common to talk about the variational autoencoder as if it *is* the model of $Pr(\mathbf{x})$. However, this is misleading; the variational autoencoder is a neural architecture that is designed to help learn the model for $Pr(\mathbf{x})$. The final model contains neither the 'variational' nor the 'autoencoder' parts and is better described as a *non-linear latent variable model*.

We'll start this tutorial by discussing latent variable models in general and then the specific case of the non-linear latent variable model. We'll see that maximum likelihood learning of this model is not straightforward, but we can define a lower bound on the likelihood. We then show how the autoencoder architecture can approximate this bound using a Monte Carlo (sampling) method. To maximize the bound, we need to compute derivatives, but unfortunately, it's not possible to compute the derivative of the sampling component. We'll show how to side-step this problem using the reparameterization trick. Finally, we'll discuss extensions of the VAE and some of its drawbacks.

Latent variable models take an indirect approach to describing a probability distribution $Pr(\mathbf{x})$ over a multi-dimensional variable $\mathbf{x}$. Instead of directly writing the expression for $Pr(\mathbf{x})$ they model a joint distribution $Pr(\mathbf{x}, \mathbf{h})$ of the data $\mathbf{x}$ and an unobserved latent (or hidden) variable $\mathbf{h}$. They then describe the probability of $Pr(\mathbf{x})$ as a marginalization of this joint probability so that

\begin{equation}

Pr(\mathbf{x}) = \int Pr(\mathbf{x}, \mathbf{h}) d\mathbf{h}.\tag{1}

\end{equation}

Typically we describe the joint probability $Pr(\mathbf{x}, \mathbf{h})$ as the product of the *likelihood* $Pr(\mathbf{x}|\mathbf{h})$ and the *prior* $Pr(\mathbf{h})$, so that the model becomes

\begin{equation}

Pr(\mathbf{x}) = \int Pr(\mathbf{x}| \mathbf{h}) Pr(\mathbf{h}) d\mathbf{h}.\tag{2}

\end{equation}

It is reasonable to question why we should take this indirect approach to describing $Pr(\mathbf{x})$. The answer is that relatively simple expressions for $Pr(\mathbf{x}| \mathbf{h})$ and $Pr(\mathbf{h})$ can describe a very complex distribution for $Pr(\mathbf{x})$.

A well known latent variable model is the mixture of Gaussians. Here the latent variable $h$ is discrete and the prior $Pr(h)$ is a discrete distribution with one probability $\lambda_{k}$ for each of the $K$ component Gaussians. The likelihood $Pr(\mathbf{x}|h)$ is a Gaussian with a mean $\boldsymbol\mu_{k}$ and covariance $\boldsymbol\Sigma_{k}$ that depends on the value $k$ of the latent variable $h$:

\begin{eqnarray}

Pr(h=k) &=& \lambda_{k}\nonumber \\

Pr(\mathbf{x} |h = k) &=& \mbox{Norm}_{\mathbf{x}}[\boldsymbol\mu_{k},\boldsymbol\Sigma_{k}].\label{eq:mog_like_prior}\tag{3}

\end{eqnarray}

where $\mbox{Norm}_{\mathbf{x}}[\boldsymbol\mu, \boldsymbol\Sigma]$ represents a multivariate probability distribution over $\mathbf{x}$ with mean $\boldsymbol\mu$ and covariance $\boldsymbol\Sigma$.

As in equation 2, the likelihood $Pr(x)$ is given by the marginalization over the latent variable $h$. In this case, this is a sum as the latent variable is discrete:

\begin{eqnarray}

Pr(\mathbf{x}) &=& \sum_{k=1}^{K} Pr(\mathbf{x}, h=k) \nonumber \\

&=& \sum_{k=1}^{K} Pr(\mathbf{x}| h=k) Pr(h=k)\nonumber \\

&=& \sum_{k=1}^{K} \lambda_{k} \mbox{Norm}_{\mathbf{x}}[\boldsymbol\mu_{k},\boldsymbol\Sigma_{k}]. \tag{4}

\end{eqnarray}

This is illustrated in figure 1. From the simple expressions for the likelihood and prior in equation 3, we can describe a complex multi-modal probability distribution.

Now let's consider the non-linear latent variable model, which is what the VAE actually learns. This differs from the mixture of Gaussians in two main ways. First, the latent variable $\mathbf{h}$ is continuous rather than discrete and has a standard normal prior (i.e., one with mean zero and identity covariance). Second, the likelihood is a normal distribution as before, but the variance is constant and spherical. The mean is a non-linear function $\mathbf{f}[\mathbf{h},\bullet]$ of the hidden variable $\mathbf{h}$ and this gives rise to the name. The prior and likelihood terms are:

\begin{eqnarray}

Pr(\mathbf{h}) &=& \mbox{Norm}_{\mathbf{h}}[\mathbf{0},\mathbf{I}]\nonumber \\

Pr(\mathbf{x} |\mathbf{h},\boldsymbol\phi) &=& \mbox{Norm}_{\mathbf{x}}[\mathbf{f}[\mathbf{h},\boldsymbol\phi],\sigma^{2}\mathbf{I}], \tag{5}

\end{eqnarray}

where the function $\mathbf{f}[\mathbf{h},\boldsymbol\phi]$ is a deep neural network with parameters $\boldsymbol\phi$. This model is illustrated in figure 2.

The model can be viewed as an infinite mixture of spherical Gaussians with different means; as before, we build a complex distribution by weighting and summing these Gaussians in the marginalization process. In the next three sections we consider three operations that we might want to perform with this model: computing the posterior distribution, sampling, and evaluating the likelihood.

The likelihood $Pr(\mathbf{x} |\mathbf{h},\boldsymbol\phi)$ tells us how to compute the distribution over the observed data $\mathbf{x}$ given hidden variable $\mathbf{h}$. We might however want to move in the other direction; given an observed data example $\mathbf{x}$ we might wish to understand what possible values of the hidden variable $\mathbf{h}$ were responsible for it (figure 3). This information is encompassed in the posterior distribution $Pr(\mathbf{h}|\mathbf{x})$. In principle, we can compute this using Bayes's rule

\begin{eqnarray}

Pr(\mathbf{h}|\mathbf{x}) = \frac{Pr(\mathbf{x}|\mathbf{h})Pr(\mathbf{h})}{Pr(\mathbf{x})}. \tag{6}

\end{eqnarray}

However, in practice, there is no closed form expression for the left hand side of this equation. In fact, as we shall see shortly, we cannot evaluate the denominator $Pr(\mathbf{x})$ and so we can't even compute the numerical value of the posterior for a given pair $\mathbf{h}$ and $\mathbf{x}$.

Although computing the posterior is intractable, it is easy to generate a new sample $\mathbf{x}^{*}$ from this model using ancestral sampling; we draw $\mathbf{h}^{*}$ from the prior $Pr(\mathbf{h})$, pass this through the network $f[\mathbf{h}^{*},\boldsymbol\phi]$ to compute the mean of the likelihood $Pr(\mathbf{x}|\mathbf{h})$ and then draw $\mathbf{h}$ from this distribution. Both the prior and the likelihood are normal distributions and so sampling from them in each step is easy. This process is illustrated in figure 4.

Finally, let's consider evaluating the likelihood of a data example $\mathbf{x}$ under the model. As before, the likelihood is given by:

\begin{eqnarray}

Pr(\mathbf{x}) &=& \int Pr(\mathbf{x}, \mathbf{h}|\boldsymbol\phi) d\mathbf{h} \nonumber \\

&=& \int Pr(\mathbf{x}| \mathbf{h},\boldsymbol\phi) Pr(\mathbf{h})d\mathbf{h}\nonumber \\

&=& \int \mbox{Norm}_{\mathbf{x}}[\mathbf{f}[\mathbf{h},\boldsymbol\phi],\sigma^{2}\mathbf{I}]\mbox{Norm}_{\mathbf{h}}[\mathbf{0},\mathbf{I}]d\mathbf{h}. \tag{7}

\end{eqnarray}

Unfortunately, there is no closed form for this integral, so we cannot easily compute the probability for a given example $\mathbf{x}$. This is a major problem for two reasons. First, evaluating the probability $Pr(\mathbf{x})$ was of the main reasons for modelling the probability distribution in the first place. Second, to learn the model, we maximize the log likelihood, which is obviously going to be hard if we cannot compute it. In the next section we'll introduce a lower bound on the log likelihood which can be computed and which we can use to learn the model.

During learning we are given training data $\{\mathbf{x}_{i}\}_{i=1}^{I}$ and want to maximize the parameters $\boldsymbol\phi$ of the model with respect to the log likelihood. For simplicity we'll assume that the variance term $\sigma^2$ in the likelihood expression is known and just concentrate on learning $\boldsymbol\phi$:

\begin{eqnarray}

\hat{\boldsymbol\phi} &=& argmax_{\boldsymbol\phi} \left[\sum_{i=1}^{I}\log\left[Pr(\mathbf{x}_{i}|\boldsymbol\phi) \right]\right] \nonumber \\

&=& argmax_{\boldsymbol\phi} \left[\sum_{i=1}^{I}\log\left[\int Pr(\mathbf{x}_{i}, \mathbf{h}_{i}|\boldsymbol\phi) d\mathbf{h}_{i}\right]\right].\label{eq:log_like} \tag{8}

\end{eqnarray}

As we noted above, we cannot write a closed form expression for the integral and so we can't just build a network to compute this and let Tensorflow or PyTorch optimize it.

To make some progress we define a lower bound on the log likelihood. This is a function that is always less than or equal to the log likelihood for a given value of $\boldsymbol\phi$ and will also depend on some other parameters $\boldsymbol\theta$. Eventually we will build a network to compute this lower bound and optimize it. To define this lower bound, we need to use Jensen's inequality which we quickly review in the next section.

Jensen’s inequality concerns what happens when we pass values through a concave function $g[\bullet]$. Specifically, it says that if we compute the expectation (mean) of these values and pass this mean through the function, the results will be greater than if we pass the values themselves through the function and then compute the expectation of the results. In mathematical terms:

\begin{equation}

g[\mathbf{E}[y]] \geq \mathbf{E}[g[y]], \tag{9}

\end{equation}

for any concave function $g[\bullet]$. Some intuition as to why this is true is given in figure 5. In our case, the concave function in question is the logarithm so we have:

\begin{equation}

\log[\mathbf{E}[y]]\geq\mathbf{E}[\log[y]], \tag{10}

\end{equation}

or writing out the expression for expectation in full we have:

\begin{equation}

\log\left[\int Pr(y) y dy\right]\geq \int Pr(y)\log[y]dy. \tag{11}

\end{equation}

We will now use Jensen's inequality to derive the lower bound for the log likelihood. We start by multiplying and dividing the log likelihood by an arbitrary probability distribution $q(\mathbf{h})$ over the hidden variables

\begin{eqnarray}

\log[Pr(\mathbf{x}|\boldsymbol\phi)] &=& \log\left[\int Pr(\mathbf{x},\mathbf{h}|\boldsymbol\phi)d\mathbf{h} \tag{12} \right] \\

&=& \log\left[\int q(\mathbf{h}) \frac{ Pr(\mathbf{x},\mathbf{h}|\boldsymbol\phi)}{q(\mathbf{h})}d\mathbf{h} \tag{13} \right],

\end{eqnarray}

We then use Jensen's inequality for the logarithm (equation 11) to find a lower bound:

\begin{eqnarray}

\log\left[\int q(\mathbf{h}) \frac{ Pr(\mathbf{x},\mathbf{h}|\boldsymbol\phi)}{q(\mathbf{h})}d\mathbf{h} \right]

&\geq& \int q(\mathbf{h}) \log\left[\frac{ Pr(\mathbf{x},\mathbf{h}|\boldsymbol\phi)}{q(\mathbf{h})} \right]d\mathbf{h}, \tag{14}

\end{eqnarray}

where the term on the right hand side is known as the *evidence lower bound* or *ELBO*. It gets this name because the term $Pr(\mathbf{x}|\boldsymbol\phi)$ is known as the evidence when viewed in the context of Bayes' rule (equation 6).

In practice, the distribution $q(\mathbf{h})$ will have some parameters $\boldsymbol\theta$ as well and so the ELBO can be written as:

\begin{equation}

\mbox{ELBO}[\boldsymbol\theta, \boldsymbol\phi] = \int q(\mathbf{h}|\boldsymbol\theta) \log\left[\frac{ Pr(\mathbf{x},\mathbf{h}|\boldsymbol\phi)}{q(\mathbf{h}|\boldsymbol\theta)} \right]d\mathbf{h}. \tag{15}

\end{equation}

To learn the non-linear latent variable model, we'll maximize this quantity as a function of both $\boldsymbol\phi$ and $\boldsymbol\theta$. The neural architecture that computes this quantity (and hence is used to optimize it) is the variational autoencoder. Before we introduce that, we first consider some of the properties of the ELBO.

When first encountered, the ELBO can be a somewhat mysterious object. In this section we'll provide some intuition about its properties. Consider that the original log likelihood of the data is a function of the parameters $\boldsymbol\phi$ and we want to find its maximum. For any fixed $\boldsymbol\theta$, the ELBO is still a function of the parameters, but one that must lie below the original likelihood function. When we change $\boldsymbol\theta$ we modify this function and depending on our choice, it may move closer or further from the log likelihood. When we change $\boldsymbol\phi$ we move along this function. These perturbations are illustrated in figure 6.

The ELBO is described as being *tight* when for a fixed value of $\boldsymbol\phi$ we choose parameters $\boldsymbol\theta$ so that the ELBO and the likelihood function coincide. We can show that this happens when the distribution $q(\boldsymbol\theta)$ is equal to the posterior distribution $Pr(\mathbf{h}|\mathbf{x})$ over the hidden variables. We start by expanding out the joint probability numerator of the fraction in the ELBO using the definition of conditional probability:

\begin{eqnarray}

\mbox{ELBO}[\boldsymbol\theta, \boldsymbol\phi] &=& \int q(\mathbf{h}|\boldsymbol\theta) \log\left[\frac{ Pr(\mathbf{x},\mathbf{h}|\boldsymbol\phi)}{q(\mathbf{h}|\boldsymbol\theta)} \right]d\mathbf{h}\nonumber \\

&=& \int q(\mathbf{h}|\boldsymbol\theta) \log\left[\frac{ Pr(\mathbf{h}|\mathbf{x},\boldsymbol\phi)Pr(\mathbf{x}|\boldsymbol\phi)}{q(\mathbf{h}|\boldsymbol\theta)} \right]d\mathbf{h}\nonumber \\

&=& \int q(\mathbf{h}|\boldsymbol\theta)

\log\left[Pr(\mathbf{x}|\boldsymbol\phi)\right]d\mathbf{h} +\int q(\mathbf{h}|\boldsymbol\theta) \log\left[\frac{ Pr(\mathbf{h}|\mathbf{x},\boldsymbol\phi)}{q(\mathbf{h}|\boldsymbol\theta)} \right]d\mathbf{h} \nonumber\nonumber \\

&=& \log[Pr(\mathbf{x} |\boldsymbol\phi)] +\int q(\mathbf{h}|\boldsymbol\theta) \log\left[\frac{ Pr(\mathbf{h}|\mathbf{x},\boldsymbol\phi)}{q(\mathbf{h}|\boldsymbol\theta)} \right]d\mathbf{h} \nonumber \\

&=& \log[Pr(\mathbf{x} |\boldsymbol\phi)] -\mbox{D}_{KL}\left[ q(\mathbf{h}|\boldsymbol\theta) ||Pr(\mathbf{h}|\mathbf{x},\boldsymbol\phi)\right].\label{eq:ELBOEvidenceKL} \tag{16}

\end{eqnarray}

This equation shows that the ELBO is the original log likelihood minus the Kullback-Leibler divergence $\mbox{D}_{KL}\left[ q(\mathbf{h}|\boldsymbol\theta) ||Pr(\mathbf{h}|\mathbf{x},\boldsymbol\phi)\right]$ which will be zero when these distributions are the same. Hence the bound is tight when $q(\mathbf{h}|\boldsymbol\theta) =Pr(\mathbf{h}|\mathbf{x},\boldsymbol\phi)$. Since the KL divergence can only take non-negative values it is easy to see that the ELBO is a lower bound on $\log[Pr(\mathbf{x} |\boldsymbol\phi)]$ from this formulation.

In the previous section we saw that the bound is tight when the distribution $q(\mathbf{h}|\boldsymbol\theta)$ matches the posterior $Pr(\mathbf{h}|\mathbf{x},\boldsymbol\phi)$. This observation is the basis of the *expectation maximization* (*EM*) algorithm. Here, we alternately (i) choose $\boldsymbol\theta$ so that $q(\mathbf{h}|\boldsymbol\theta)$ equals the posterior $Pr(\mathbf{h}|\mathbf{x},\boldsymbol\phi)$ and (ii) change $\boldsymbol\phi$ to maximize the upper bound (figure 7a). This is viable for models like the mixture of Gaussians where we can compute the posterior distribution in closed form. Unfortunately, for the non-linear latent variable model there is no closed form expression for the posterior distribution and so this method is inapplicable.

We've already seen two different ways to write the ELBO (equations 15 and 16). In fact, there are several more ways to re-express this function (see Hoffman & Johnson 2016). The one that is important for the VAE is:

\begin{eqnarray}

\mbox{ELBO}[\boldsymbol\theta, \boldsymbol\phi] &=& \int q(\mathbf{h}|\boldsymbol\theta) \log\left[\frac{ Pr(\mathbf{x},\mathbf{h}|\boldsymbol\phi)}{q(\mathbf{h}|\boldsymbol\theta)} \right]d\mathbf{h}\nonumber \\

&=& \int q(\mathbf{h}|\boldsymbol\theta) \log\left[\frac{ Pr(\mathbf{x}|\mathbf{h},\boldsymbol\phi)Pr(\mathbf{h})}{q(\mathbf{h}|\boldsymbol\theta)} \right]d\mathbf{h}\nonumber \\

&=& \int q(\mathbf{h}|\boldsymbol\theta) \log\left[ Pr(\mathbf{x}|\mathbf{h},\boldsymbol\phi) \right]d\mathbf{h}

+ \int q(\mathbf{h}|\boldsymbol\theta) \log\left[\frac{Pr(\mathbf{h})}{q(\mathbf{h}|\boldsymbol\theta)}\right]d\mathbf{h}

\nonumber \\

&=& \int q(\mathbf{h}|\boldsymbol\theta) \log\left[ Pr(\mathbf{x}|\mathbf{h},\boldsymbol\phi) \right]d\mathbf{h}

- \mbox{D}_{KL}[ q(\mathbf{h}|\boldsymbol\theta), Pr(\mathbf{h})] \tag{17}

\end{eqnarray}

In this formulation, the first term measures the average agreement $Pr(\mathbf{x}|\mathbf{h},\boldsymbol\phi)$ of the hidden variable and the data (reconstruction loss) and the second one measures the degree to which the auxiliary distribution $q(\mathbf{h}, \boldsymbol\theta)$ matches the prior. This formulation is the one that will be used in the variational autoencoder.

We have seen that the ELBO is tight when we choose the distribution $q(\mathbf{h}|\boldsymbol\theta)$ to be the posterior $Pr(\mathbf{h}|\mathbf{x},\boldsymbol\phi)$ but for the non-linear latent variable model, we cannot write an expression for this posterior.

The solution to the problem is to make a variational approximation: we just choose a simple parametric form for $q(\mathbf{h}|\boldsymbol\theta)$ and use this as an approximation to the true posterior. In this case we'll choose a normal distribution with parameters $\boldsymbol\mu$ and $\boldsymbol\Sigma$. This distribution is not always going to be a great match to the posterior, but will be better for some values of $\boldsymbol\mu$ and $\boldsymbol\Sigma$ than others. When we optimize this model, we will be finding the normal distribution that is "closest" to the true posterior $Pr(\mathbf{h}|\mathbf{x})$ (figure 8). This corresponds to minimizing the KL divergence in equation 16.

Since the optimal choice for $q(\mathbf{h}|\boldsymbol\theta)$ was the posterior $Pr(\mathbf{h}|\mathbf{x})$ and this depended on the data example $\mathbf{x}$, it makes sense that our variational approximation should do the same and so we choose

\begin{equation}\label{eq:posterior_pred}

q(\mathbf{h}|\boldsymbol\theta,\mathbf{x}) = \mbox{Norm}_{\mathbf{h}}[g_{\mu}[\mathbf{x}|\boldsymbol\theta], g_{\sigma}[\mathbf{x}|\boldsymbol\theta]], \tag{18}

\end{equation}

where $g[\mathbf{x},\boldsymbol\theta]$ is a neural network with parameters $\boldsymbol\theta$ that predicts the mean and variance of the normal variational approximation.

Finally, we are in a position to describe the variational autoencoder. We will build a network that computes the ELBO:

\begin{equation}

\mbox{ELBO}[\boldsymbol\theta, \boldsymbol\phi]

= \int q(\mathbf{h}|\mathbf{x},\boldsymbol\theta) \log\left[ Pr(\mathbf{x}|\mathbf{h},\boldsymbol\phi) \right]d\mathbf{h}

- \mbox{D}_{KL}[ q(\mathbf{h}|\mathbf{x},\boldsymbol\theta), Pr(\mathbf{h})] \tag{19}

\end{equation}

where the distribution $q(\mathbf{h}|\mathbf{x},\boldsymbol\theta)$ is the approximation from equation 18.

The first term in equation 19 still involves an integral that we cannot compute, but since it represents an expectation, we can approximate it with a set of samples:

\begin{equation}

E[f[\mathbf{h}]] \approx \frac{1}{N}\sum_{n=1}^{N}f[\mathbf{h}^{*}_N]\tag{20}

\end{equation}

where $\mathbf{h}^{*}_{n}$ is the $n^{th}$ sample. In the limit, we might only use a single sample $\mathbf{h}^{*}$ from $q(\mathbf{h}|\mathbf{x},\boldsymbol\theta)$ as a very approximate estimate of the expectation and here the ELBO will look like:

\begin{eqnarray}

\mbox{ELBO}[\boldsymbol\theta, \boldsymbol\phi] &\approx& \log\left[ Pr(\mathbf{x}|\mathbf{h}^{*}|\boldsymbol\phi) \right]- \mbox{D}_{KL}[ q(\mathbf{h}|\mathbf{x},\boldsymbol\theta), Pr(\mathbf{h})] \tag{21}

\end{eqnarray}

The second term is just the KL divergence between the variational Gaussian $q(\mathbf{h}|\mathbf{x},\boldsymbol\theta) = \mbox{Norm}_{\mathbf{h}}[\boldsymbol\mu,\boldsymbol\Sigma]$ and the prior $Pr(h) =\mbox{Norm}_{\mathbf{h}}[\mathbf{0},\mathbf{I}]$. The KL divergence between two Gaussians can be calculated in closed form and for this case is given by:

\begin{equation}

\mbox{D}_{KL}[ q(\mathbf{h}|\mathbf{x},\boldsymbol\theta), Pr(\mathbf{h})] = \frac{1}{2}\left(\mbox{Tr}[\boldsymbol\Sigma] + \boldsymbol\mu^T\boldsymbol\mu - D - \log\left[\mbox{det}[\boldsymbol\Sigma]\right]\right). \tag{22}

\end{equation}

were $D$ is the dimensionality of the hidden space.

So, to compute the ELBO for a point $\mathbf{x}$ we first estimate the mean $\boldsymbol\mu$ and variance $\boldsymbol\Sigma$ of the posterior distribution $q(\mathbf{h}|\boldsymbol\theta,\mathbf{x})$ for this data point $\mathbf{x}$ using the network $\mbox{g}[\mathbf{x},\boldsymbol\theta]$. Then we draw a sample $\mathbf{h}^{*}$ from this distribution. Finally, we compute the ELBO using equation 21.

The architecture to compute this is shown in figure 9. Now it's clear why it is called a variational autoencoder. It is an autoencoder because it starts with a data point $\mathbf{x}$, computes a lower dimensional latent vector $\mathbf{h}$ from this and then uses this to recreate the original vector $\mathbf{x}$ as closely as possible. It is variational because it computes a Gaussian approximation to the posterior distribution along the way.

The VAE computes the ELBO bound as a function of the parameters $\boldsymbol\phi$ and $\boldsymbol\theta$. When we maximize this bound as a function of both of these parameters, we gradually move the parameters $\boldsymbol\phi$ to values that have give the data a higher likelihood under the non-linear latent variable model (figure 7b).

In this section, we've described how to compute the ELBO for a single point, but actually we want to maximize its sum over all of the data examples. As in most deep learning methods, we accomplish this with stochastic gradient descent, by running mini-batches of points through our network.

You might think that we are done; we set up this architecture, then we allow PyTorch / Tensorflow to perform automatic differentiation via the backpropagation algorithm and hence optimize the cost function. However, there's a problem. The network involves a sampling step and there is no way to differentiate through this. Consequently, it's impossible to make updates to the parameters $\boldsymbol\theta$ that occur earlier in the network than this.

Fortunately, there is a simple solution; we can move the stochastic part into a branch of the network which draws a sample from $\mbox{Norm}_{\epsilon}[\mathbf{0},\mathbf{I}]$ and then use the relation

\begin{equation}

\mathbf{h}^{*} = \boldsymbol\mu + \boldsymbol\Sigma^{1/2}\epsilon, \tag{23}

\end{equation}

to draw from the intended Gaussian. Now we can compute the derivatives as usual because there is no need for the backpropagation algorithm to pass down the stochastic branch. This is known as the reparameterization trick and is illustrated in figure 10.

Variational autoencoders were first introduced by Kingma &Welling (2013). Since then, they have been extended in several ways. First, they have been adapted to other data types including discrete data (van den Oord *et* al. 2017, Razavi *et* al. 2019), word sequences (Bowman *et* al. 2015), and temporal data (Gregor & Besse 2018). Second, researchers have experimented with different forms for the variational distribution, most notably using normalizing flows which can approximate the true posterior much more closely than a Gaussian (Rezende & Mohamed 2015). Third, there is a strand of work investigating more complex likelihood models $Pr(\mathbf{x}|\mathbf{h})$. For example, Gulrajani *et *al. (2016) used an auto-regressive relation between output variables and Dorta *et* al. (2018) modeled the covariance as well as the mean.

Finally, there is a large body of work that attempts to improve the properties of the latent space. Here, one popular goal is to learn a *disentangled* representation in which each dimension of the latent space represents an independent real world factor. For example, when modeling face images, we might hope to uncover head pose or hair color as independent factors. These methods generally add regularization terms to either the posterior $q(\mathbf{h}|\mathbf{x})$ or the aggregated posterior $q(\mathbf{h}) = \frac{1}{J}\sum_{i=1}^{I}q(\mathbf{h}|\mathbf{x}_{i})$ so that the new loss function is

\begin{equation}

L_{new} = \mbox{ELBO}[\boldsymbol\theta, \boldsymbol\phi] - \lambda_{1} \mathbb{E}_{Pr(\mathbf{x})}\left[\mbox{R}_{1}\left[q(\mathbf{h}|\mathbf{x}) \right]\right] - \lambda_{2} \mbox{R}_{2}[q(\mathbf{h})], \tag{24}

\end{equation}

where $\lambda_{1}$ and $\lambda_{2}$ are weights and $\mbox{R}_{1}[\bullet]$ and $\mbox{R}_{2}[\bullet]$ are functions of the posterior and aggregated posterior respectively. This class of methods includes the BetaVAE (Higgins *et* al. 2017), InfoVAE (Zhao *et* al. 2017) and many others (*e.g.*, Kim & Mnih 2018, Kumar *et* al. 2017, Chen *et* al. 2018).

VAEs have several drawbacks. First, we cannot compute the likelihood of a new point $\mathbf{x}$ under the probability distribution efficiently, because this involves integrating over the hidden variable $\mathbf{h}$. We can approximate this integral using a Markov chain Monte Carlo method, but this is very inefficient. Second, samples from VAEs are generally not perfect (figure 11). The naive spherical Gaussian noise model which is independent for each variable generally produces noisy samples (or overly smooth ones if we do not add in the noise).

In practice, training VAEs (particularly sequence VAEs) can be brittle. It's possible that that the system converges to a local minimum in which the latent variable is completely ignored and the encoder always predicts the prior. This phenomenon is known as *posterior collapse* (Bowman *et* al. 2015). One way to avoid this problem is to only gradually introduce the second term in the cost function (equation 19) using an annealing schedule.

The VAE is an architecture to learn a probability model over $Pr(\mathbf{x})$. This model can generate samples, and interpolate between them (by manipulating the latent variable $\mathbf{h}$) but it is not easy to compute the likelihood for a new example. There are two main alternatives to the VAE. The first is generative adversarial networks. These are good for sampling but their samples have quite different properties from those produced by the VAE. Similarly to the VAE they cannot evaluate the likelihood of new data points. The second alternative is normalizing flows for which the both sampling and likelihood evaluation are tractable.

]]>In December 2019, many Borealis employees travelled to Vancouver to attend NeurIPS 2019. With almost 1500 accepted papers, there’s a lot of great work to sift through. In this post, some of our researchers describe the papers that they thought were especially important.

by Alex Radovic

Related Papers:

- Neural ordinary differential equations
- GRU-ODE-Bayes: Continuous modeling of sporadically-observed time series
- Legendre memory units: continuous-time representation in recurrent neural networks

**What problem does it solve?** It is a new neural network architecture optimized for irregularly-sampled time series.

**Why this is important?** Real world data is often sparse and/or irregular in time. For example, sometimes data is recorded when a sensor updates in response to some external stimulus, or an agent decides to make a measurement. The timing of these data points can itself be a powerful predictor, so we would like a neural network architecture that is designed to extract that signal.

Zero padding and other approaches are sometimes used so that a simple LSTM or GRU can be applied. However, this makes for a less efficient and sparser data representation that those recurrent networks are not well equipped to deal with.

**The approach taken and how it relates to previous work:** Neural ODEs are only a year old but are paving the way for a number of fascinating applications. Neural ODEs reinterpret repeating neural network layers as approximations to a differential equation expressed as a function of depth. As pointed out in the original paper, a Neural ODE can also be a function of some variable of interest (e.g., time).

The original Neural ODE paper does touch on potential uses for time series data, and uses a Neural ODE to generate time series data. This paper describes a modified RNN that has a hidden state that changes both when a new data point comes in and as a function of time between observations. The architecture is a development of a well explored idea where the hidden state decays as some function of time between observations (Cao *et* al., 2018; Mozer *et* al., 2017; Rajkomar *et* al., 2018; Che *et* al., 2018). Now instead of a preset function, the hidden state between observations is the solution to a Neural ODE. Figure 1 shows how information is updated in an ODE-RNN in contrast to other common approaches.

**Results:** They show state of the art performance at both interpolation and extrapolation on MuJoCo simulation and the PhysioNet datasets. The PhysioNet dataset is particularly exciting as it represents an important real world scenario consisting of patients' intensive care unit data. The extrapolation is particularly impressive as successful interpolation methods often don’t extrapolate well. Figure 2 shows on a toy dataset how using an ODE-RNN encoder rather than an RNN encoder leads to much better extrapolation in the ODE decoder.

Related Papers:

- On spectral clustering: analysis and an algorithm
- Sparse subspace clustering: algorithm, theory, and applications

**What problem does it solve?** High-dimensional data (e.g., videos, text) often lies on a low dimensional manifold. Subspace clustering algorithms such as the Sparse Subspace Clustering (SSC) algorithm assume that this manifold can be approximated by a union of lower dimensional subspaces. Such algorithms try to identify these subspaces and associate them with individual data points. This paper improves the speed and memory cost of the SSC algorithm while retaining theoretical guarantees by introducing an algorithm called Selective Sampling-based Scalable Sparse Subspace Clustering (S$^5$C).

**Why this is important? **The method scales to large datasets which was a big practical limitation of the SSC algorithm. It also comes with theoretical guarantees and these empirically translate to improved performance.

**Previous Work:** The original SSC algorithm consisted of two steps: representation learning and spectral clustering . The first step learns an affinity matrix $\mathbf{W}$. Intuitively the $ij$-components of $\mathbf{W}$ encodes the "similarity" between point $i$ and $j$ (but with $W_{ii} = 0$ for all $i$). The matrix $\mathbf{W} = |\mathbf{C}| + |\mathbf{C}|^T$ is made sparse by imposing an $\ell_1$ norm regularizer on the objective functions:

\begin{equation}\label{eq:sccobjective}

\underset{\left(C_{j i}\right)_{j \in[N]} \in \mathbb{R}^{N}}{\operatorname{minimize}} \frac{1}{2}\left\|\mathbf{x}_{i}-\sum_{j \in[N]} C_{j i} \mathbf{x}_{j}\right\|_{2}^{2}+\lambda \sum_{j \in[N]}\left|C_{j i}\right|, \text { subject to } C_{i i}=0. \tag{1}

\end{equation}

where $\mathbf{x}_i \in \mathbb{R}^M$ is a data point in the dataset and the $N$ different objective functions determine the $N$ rows of $\mathbf{C}$.

This $\ell_1$ regularizer should not affect the ability of $\mathbf{W}$ to minimize the unregularized objective if the data points lie in subspaces that are of lower dimensions than the original embedding space. The regularization will produce a $\mathbf{W}$ whose non-zero elements suggests that the linked data points are within the same subspace.

The second step of the SSC algorithm applies spectral clustering to $\mathbf{W}$. The eigenvectors of the Laplacian $\mathbf{L} = \mathbf{I} - \mathbf{D}^{-1/2}\mathbf{W}\mathbf{D}^{-1/2}$ are computed, where $\mathbf{D}$ is a diagonal matrix with $D_{ii} = \sum_j W_{ij}$. For $N_{c}$ clusters, the eigenvectors associated with the with $N_{c}$ smallest non-zero eigenvalues are chosen, normalised, and stacked into a matrix $\mathbf{X} \in \mathbb{R}^{N\times N_c }$. Each row represents a data point and their cluster memberships are determined by further clustering in this $\mathbb{R}^{N_c}$ space using k-means.

Many other approaches aim to improve SSC. Some algorithms like OMP (Dyer *et* al., 2013; You and Vidal 2015), PIC (Lin and Cohen, 2010), and nearest neighbor SSC (Park *et* al., 2014) retain theoretical guarantees but still suffer from scalability issues. Other fast methods such as EnSC-ORGEN (You *et* al., 2013), and SSSC (Peng *et* al., 2013) exist at the expense of theoretical guarantees and justification.

**Approach taken: **Each of the $N$ components of the objective function in equation 1 has $O(N)$ parameters. The minimization procedure can be done in a time and space complexity of $O(N^2)$, so this does not scale to well in $N$. Instead S$^5$C aims to solve $N$ problems having $T$ parameters by selecting $T$ vectors that can be used to represent the rest of the dataset. The intuition is that data points from each subspace can be reconstructed from the span of a few selected vectors denoted by the set $\mathcal{S}$. The vectors in $\mathcal{S}$ are selected incrementally by stochastic approximation of a sub-gradient; a subsample $I\subset [N]$ (with $|I| \ll N$) of vectors of the full dataset are selected and used to estimate which data point in $[N]$ best improves the data representation spanned by $\mathcal{S}$. This step has $\mathcal{O}(\left|I\right|N)$ complexity. This step repeated $T$ times to build $\mathcal{S}$, giving a complexity of $\mathcal{O}(\left|I\right|TN)$ instead of $\mathcal{O}(N^3)$ to build the affinity matrix $\mathbf{W}$. This construction ensures that $\mathbf{W}$ only has $\mathcal{O}(N)$ non-zero elements and hence the eigenvector decomposition can also be done in $\mathcal{O}(N)$ time using orthogonal iteration.

**Results: **Figure 3 shows the linear increase in time as a function of dataset size. Figure 4 shows improvements on the clustering error on many datasets when comparing to other fast algorithms without theoretical guarantees.

Dataset | Nyström | dKK | SSC | SSC-OMP | SSC-ORGEN | SSSC | S^{5}C |
---|---|---|---|---|---|---|---|

Yale B | 76.8 | 85.7 | 33.8 | 35.9 | 37.4 | 59.6 | 39.3 (1.8) |

Hopkins 155 | 21.8 | 20.6 | 4.1 | 23.0 | 20.5 | 21.1 | 14.6 (0.4) |

COIL-100 | 54.5 | 53.1 | 42.5 | 57.9 | 89.7 | 67.8 | 45.9 (0.5) |

Letter-rec | 73.3 | 71.7 | / | 95.2 | 68.6 | 68.4 | 67.7 (1.3) |

CIFAR-10 | 76.6 | 75.6 | / | / | 82.4 | 82.4 | 75.1 (0.8) |

MNIST | 45.7 | 44.6 | / | / | 28.7 | 48.7 | 40.4 (2.3) |

Devanagari | 73.5 | 72.8 | / | / | 58.6 | 84.9 | 67.2 (1.3) |

^{Figure 4. Clustering error in $\%$. Error bars (if available) are in parentheses. Experiments where a time limit of 24 hours or memory limit of 16 GB was exceeded are denoted by $/$. }

Related Papers:

- Variational Inference with normalizing flows
- Density estimation using Real NVP
- The graph neural network model
- Graph attention networks
- GraphRNN: Generating realistic graphs with deep auto-regressive models

**What problem does it solve?** It introduces a new invertible graph neural network that can be used for supervised tasks such as node classification and unsupervised tasks such as graph generation.

**Why this is important?** Graph representation learning has diverse applications from bioinformatics to social networks and transportation. It is challenging due to diverse possible representations and complex structural dependencies among nodes. In recent years, graph neural networks have been the state-of-the-art model for graph representation learning. This paper proposes a new graph neural model with less memory footprint, better scalability, and room for parallel computation. Additionally, this model can be used for graph generation.

**The approach taken and how it relates to previous work: **The model builds on both message passing in graph neural networks and normalizing flows. The idea is to adapt normalizing flows for node feature transformation and generation.

Given a graph with $N$ nodes and node features $\mathbf{H} \in \mathbb{R}^{N \times d_n}$, graph neural networks transform the raw features $\mathbf{H}$ to embedded features that capture the contextual and structural information of the graph. This transformation consists of a series of message passing steps, where step $t$ consists of i) message generation using function $\mathbf{M}_t[\bullet]$ and ii) updating the node features with the aggregated messages of the neighboring nodes using function $\mathbf{U}_t[\bullet]$:

\begin{align}\label{message-passing}

&\mathbf{m}_{t+1}^{(v)} = \mathbf{Agg}\left[\{\mathbf{M}_t[\mathbf{h}_{t}^{(v)}, \mathbf{h}_{t}^{(u)}]\}_{u \in \mathcal{N}_v}\right] \nonumber\\

&\mathbf{h}_{t+1}^{(v)} = \mathbf{U}_t[\mathbf{h}_{t}^{(v)}, \mathbf{m}_{t+1}^{(v)}]. \tag{2}

\end{align}

Normalizing flows are generative models which find an invertible mapping $\mathbf{z}=\mathbf{f}[\mathbf{x}]$ to transform the data $\mathbf{x}$ to a latent variable $\mathbf{z}$ with a simple prior distribution (e.g., $\mbox{Norm}_{\mathbf{z}}[\mathbf{0}, \mathbf{I}]$). By sampling $\mathbf{z}$ and applying the inverse function $\mathbf{f}^{-1}[\bullet]$, we can generate data from the target distribution. This paper is based on the RealNVP which uses the mapping:

\begin{align}\label{realnvp_map}

\mathbf{z}^{(0)}&= \mathbf{x}^{(0)} \exp{\left[\mathbf{f}_1\left[\mathbf{x}^{(1)}\right]\right]} + \mathbf{f}_2\left[\mathbf{x}^{(1)}\right] \nonumber\\

\mathbf{z}^{(1)}&=\mathbf{x}^{(1)},\nonumber

\end{align}

where $\mathbf{x}^{(0)}$ and $\mathbf{x}^{(1)}$ are partitions of the input $\mathbf{x}$, and the output $\mathbf{z}$ is obtained by concatenating $\mathbf{z}^{(0)}$ and $\mathbf{z}^{(1)}$. The functions $\mathbf{f}_1[\bullet]$ and $\mathbf{f}_2[\bullet]$ are neural networks. RealNVP cascades these mapping functions with random partitioning at each step.

This paper extends normalizing flows to graphs by applying the RealNVP mapping functions to the node feature matrix $\mathbf{H}$ (figure 5a):

\begin{align}\label{realnvp_graph}

\mathbf{H}_{t+1}^{(0)}&= \mathbf{H}_{t}^{(0)} \exp{\left[\mathbf{F}_1\left[\mathbf{H}_{t}^{(1)}\right]\right]} + \mathbf{F}_2\left[\mathbf{H}_{t}^{(1)}\right] \nonumber\\

\mathbf{H}_{t+1}^{(1)}&=\mathbf{H}_{t}^{(1)},\nonumber

\end{align}

where the functions $\mathbf{F}_1[\bullet]$ and $\mathbf{F}_2[\bullet]$ can be any message-based transformation and are chosen here to be graph attention layers. In practice, an alternating pattern is used for consecutive operations.

In the supervised setting, the raw features are transformed through these layers to perform downstream tasks such as node classification. The resulting network is called *GRevNet*. Ordinary graph neural networks need to store the hidden states after each message passing step to do backpropagation. However, the reversible functions of GRevNet can save memory by reconstructing the hidden states in the backpropagation phase.

In addition, the graph normalizing flow can generate graphs via a two step process. First, a permutation-invariant graph autoencoder is trained to encode the graph to continuous node embeddings $\mathbf{X}$ and use these to reconstruct the adjacency matrix (figure 5b). Here, the encoder is a graph neural network and the decoder is a fixed function that makes nodes adjacent if their respective columns of $\mathbf{X}$ are similar. Second, a graph normalizing flow is trained to map from $\mathbf{z} \sim \mbox{Norm}_[\mathbf{0},\mathbf{I}]$ to a target distribution of node embeddings $\mathbf{X} \in \mathbb{R}^{N \times d_e}$. We generate from this distribution and use the decoder to generate new adjacency matrices (figure 5c).

**Results: **In a supervised context, GRevNet is used to classify documents (on Cora and Pubmed dataset) and protein-protein interaction (on PPI dataset) and compares favorably with other approaches. In the unsupervised context, the GNF model is used for graph generation on two datasets, COMMUNITY-SMALL and EGO-SMALL and is competitive with the popular GraphRNN.

by Jimmy Chen

Related Papers:

- Distinctive image features from scale-invariant keypoints
- L2-net: Deep learning of discriminative patch descriptor in Euclidean space

**What problem does it solve?** Keypoints are pixel locations where local image patches are quasi-invariant to camera viewpoint changes, photometric transformations, and partial occlusion. The goal of this paper is to detect keypoints and extract visual feature vectors from the surrounding image patches.

**Why this is important?** Keypoint detection and local feature description are the foundation of many applications such as image matching and 3D reconstruction.

**The approach taken how it relates to previous work:** R2D2 proposes a three-branch network to predict keypoint reliability, repeatability and image patch descriptors simultaneously (figure 6). Repeatability is a measure of the degree to which a keypoint can be detected at different scales, under different illuminations, and with different camera angles. Reliability is a measure of how easily the feature descriptor can be distinguished from others. R2D2 proposes a learning process that improves both repeatability and reliability.

Figure 7 shows a toy example of repeatability and reliability that are predicted by R2D2 in two images. The corners of the triangle in the first image are both repeatable and reliable. The grid corners in the second image are repeatable but not reliable as there are many similar corners nearby.

**Results: **R2D2 is tested on the HPatches dataset for image matching. Performance is measured by mean matching accuracy. Figure 8 shows that R2D2 significantly outperforms previous work at nearly all error thresholds. R2D2 is also tested on the Aachen Day-Night dataset for camera relocalization. R2D2 achieves state-of-the-art accuracy with a smaller model size (figure 9). The paper also provides qualitative results and an ablation study.

Although the paper demonstrated impressive results, the audience raised concerns about the keypoint sub-pixel accuracy and computation cost for large images.

Method | #kpts | dim | #weights | 0.5m, 2° | 1m, 5° | 5m, 10° |
---|---|---|---|---|---|---|

RootSIFT [24] | 11K | 128 | - | 33.7 | 52.0 | 65.3 |

HAN+HN [30] | 11K | 128 | 2 M | 37.8 | 54.1 | 75.5 |

SuperPoint [9] | 7K | 256 | 1.3 M | 42.8 | 57.1 | 75.5 |

DELF (new) [32] | 11K | 1024 | 9M | 39.8 | 61.2 | 85.7 |

D2-Net [11] | 19K | 512 | 15 M | 44.9 | 66.3 | 88.8 |

R2D2, $N$ = 16 | 5K | 128 | 0.5 M | 45.9 | 65.3 | 86.7 |

R2D2 $N$ = 8 | 10K | 128 | 1.0 M | 45.9 | 66.3 | 88.8 |

^{Figure 9. Results for Aachen Day-Night visual localization task. }

&

**What problem do they solve?** Both papers incorporate unlabeled data into adversarial training to improve adversarial robustness of neural networks.

**Why this is important?** We would like to be able to train neural networks in such a way that they are robust to adversarial attack. This is difficult, but we do not fully understand why. It could be that we need to use significantly larger models than we can currently train. Alternatively, it might be that we need a new training algorithm that has not yet been discovered. Another possibility is that adversarially robust networks have a very high sample complexity and so we just don't use enough data to train a robust model.

These two papers pertain to the latter sample complexity issue. They ask whether we can exploit additional unlabeled data to boost the adversarial robustness of a neural network. Since unlabeled data is relatively abundant this potentially provides a practical way to train adversarially robust models.

**The approach taken and how it relates to previous work: **We assume that each data-label pair $(\mathbf{x}, y)$ is sampled from distribution $\mathcal{D}$, and we are learning a model $Pr(y|\mathbf{x},\boldsymbol\theta)$ that predicts the probability of the label from the data and has parameters $\boldsymbol\theta$. The standard training objective is

\begin{equation}

\min_{\boldsymbol\theta}\left[ \mathbb{E}_{(\mathbf{x}, y)\sim \mathcal{D}}\left[\mbox{xent}\left[y, Pr(y|\mathbf{x},\boldsymbol\theta)\right]\right]\right] \tag{3}

\label{eq:clean_training}

\end{equation}

where $\mbox{xent}[\bullet, \bullet]$ is the cross-entropy loss.

When we train for adversarial robustness, we want the model to make the same prediction within a neighborhood and we train the model using the min-max formation (Madry *et* al., 2017):

\begin{equation}

\min_{\boldsymbol\theta}\left[ \mathbb{E}_{(\mathbf{x}, y)\sim \mathcal{D}}\left[\max_{\boldsymbol\delta\in B_\epsilon}\left[\mbox{xent}\left[y, Pr(y|\mathbf{x}+\boldsymbol\delta,\boldsymbol\theta)\right]\right]\right]\right] \tag{4}

\label{eq:minmax_training}

\end{equation}

where $B_\epsilon$ is a ball with radius $\epsilon$. In other words, we minimize the maximum cross-entropy loss with in a small neighborhood to achieve robustness.

TRADES improved adversarial training by separating the inner maximization term into a classification loss and a regularization loss:

\begin{equation}

\min_\theta \left[\mathbb{E}_{(\mathbf{x}, y)\sim \mathcal{D}}\left[\mbox{xent}\left[y, Pr(y|\mathbf{x},\boldsymbol\theta)\right] + \frac{1}{\lambda}\max_{\boldsymbol\delta\in B_\epsilon}\left[\mbox{D}_{KL}\left[Pr(y|\mathbf{x},\boldsymbol\theta)|| Pr(y|\mathbf{x}+\boldsymbol\delta,\boldsymbol\theta)\right]\right]\right]\right] \tag{5}

\label{eq:trades_training}

\end{equation}

where $\mbox{D}_{KL}[\bullet||\bullet]$ is the Kullback--Leibler divergence, and $\lambda$ is a scalar weight.

Both Carmon *et* al. (2019) and Uesato *et* al. (2019) exploit the observation that the regularization term in TRADES doesn't need the true label $y$; it tries to make the label prediction similar, before and after the perturbation. This makes incorporating unlabeled data very easy: for unlabeled data, we only train on the regularization loss, whereas for labeled data, we train on both the classification loss and the regularization loss.

In addition to this formulation, Uesato *et* al. (2019) propose an alternative way of using unlabeled data. They first perform natural training on labeled data, and then use the trained model to generate labels $\hat y(\mathbf{x})$ for unlabeled data. Then they combine all the labeled and unlabeled data to perform min-max adversarial training as in equation 4.

Carmon *et* al. (2019) also proposes a method to replace the maximization over a small neighborhood $B_\epsilon(\mathbf{x})$ with a larger additive noise sampled from $\mbox{Norm}_{\mathbf{x}}[\mathbf{0}, \sigma^2\mathbf{I}]$. This alternative is specifically designed for a certified $\ell_2$ defense via randomized smoothing Cohen *et* al. (2019).

**Results:** After adding unlabeled data into adversarial training, robustness was improved by around 4%. As adversarial robustness needs to be evaluated systematically under different types of attacks and settings, we refer the reader to the original papers for details.

by Leo Long

Related Papers:

**What problem does it solve?** Program synthesis aims to generate source code from a natural language description of a task. This paper presents a program synthesis approach ('Patois') that operates at different levels of abstraction and explicitly interleaves high-level and low-level reasoning at each generation step.

**Why is this important? **Many existing systems operate only at the low level of abstraction, generating one token of the target program at a time. On the other hand, humans constantly switch between high-level reasoning (e.g. list comprehension) and token-level reasoning when writing a program.

The system, called Patois, achieves this high/low-level separation by automatically mining common code idioms from a corpus of source code and incorporating them into the model used for synthesizing programs.

Moreover, we can use the mined code idioms as a way to leverage other unlabelled source code corpora since the amount of supervised data for program synthesis (i.e., paired source code and descriptions) is often limited and it is very time-consuming to obtain additional data.

**The approach taken and how it relates to previous work:** The system consists of two steps (figure 10). The goal of the first step is to obtain a set of frequent and useful program fragments. These are referred to as code idioms. Programs can be equivalently expressed as abstract syntax tree (AST). Hence, mining code idioms is treated as a non-parametric problem (Allamanis *et* al., 2018) and represented as inference over the probabilistic tree substitution grammar (pTSG).

The second step exploits these code idioms to augment the synthesis model. This model consists of a natural language encoder and an AST decoder (Yin and Neubig, 2017). At each step, the AST decoder has three possible actions. The first is to expand production rules defined in the original CFG of the source code language, which expands the sub-trees of the program AST. The second is to generate terminal nodes in the AST, such as reserved keywords and variable names. The third type of action is to expand the commonly used code idioms. They are hence effectively added to the output action space of the decoder at each step of generation. The resulting synthesis model is trained to maximize the log-likelihood of the action sequences that construct the right ASTs of the target programs given their natural language specifications.

**Results:** The paper presents experimental results on the Hearthstone and Spider datasets (figure 11). The experiment results show a noticeable improvement of the Patois system over the baseline model, which does not take advantage of the mined common code idioms. For a more qualitative analysis, Figure 12 presents few examples of the mined code idioms from the Hearthstone and Spider datasets, which correspond nicely to some of the high-level programming patterns for each language.

Model | Exact match |
Sentence BLEU |
Corpus BLEU |
---|---|---|---|

Baseline | 0.152 | 0.743 | 0.723 |

PATOIS | 0.197 | 0.780 | 0.766 |

^{Figure 11. Results on the Hearthstone dataset. }

def __init__(self) : super().__init__($\ell_0$ : str, $\ell_1$ : int , CHARACTER_CLASS.$\ell_3$ : id, CARD_RARITY.$\ell_4$ : id, $\ell_5^?$ ) |
$\ell_0$ : id = copy.copy($\ell_1$ : expr) class $\ell_0$ : id ($\ell_1$ : id) : def __init__(self): |
SELECT COUNT ( $\ell_0$ : col ), $\ell_1^*$ WHERE $\ell_2^*$ INTERSECT $\ell_4^?$ : sql EXPECT $\ell_5^?$ : sql WHERE $\ell_0$ : col = $terminal |

^{Figure 12. Examples of code idioms mined from the Hearthstone and Spider datasets. Adapted from Shin et al. (2019).}

Reinforcement learning (RL) can now produce super-human performance on a variety of tasks, including board games such as chess and go, video games, and multi-player games such as poker. However, current algorithms require enormous quantities of data to learn these tasks. For example, OpenAI Five generates 180 years of gameplay data per day, and AlphaStar used 200 years of Starcraft II gameplay data.

It follows that one of the biggest challenges for RL is *sample efficiency*. In many realistic scenarios, the reward signals are sparse, delayed or noisy which makes learning particularly inefficient; most of the collected experiences do not produce a learning signal. This problem is exacerbated because RL simultaneously learns both the policy (i.e., to make decisions) and the representation on which these decisions are based. Until the representation is reasonable, the system will be unable to develop a sensible policy.

This article focuses on the use of *auxiliary tasks* to improve the speed of learning. These are additional tasks that are learned simultaneously with the main RL goal and that generate a more consistent learning signal. The system uses these signals to learn a shared representation and hence speed up the progress on the main RL task.

An auxiliary task is an additional cost-function that an RL agent can predict and observe from the environment in a self-supervised fashion. This means that losses are defined via surrogate annotations that are synthesized from unlabeled inputs, even in the absence of a strong reward signal.

Auxiliary tasks usually consist of estimating quantities that are relevant to solving the main RL problem. For example, we might estimate depth in a navigation task. However, in other cases, they are be more general. For example, we might try to predict how close the agent is to a terminal state. Accordingly, they may take the form of classification and regression algorithms or alternatively may maximize reinforcement learning objectives.

We note that auxiliary tasks are different from model-based RL. Here, a model of how the environment transitions between states given the actions is used to support planning (Oh et al. 2015; Leibfried et al. 2016) and hence ultimately to directly improve the main RL objective. In contrast, auxiliary tasks do not directly improve the main RL objective, but are used to facilitate the representation learning process (Bellemare et al. 2019) and improve learning stability (Jaderberg et al. 2017).

Auxiliary tasks were originally developed for neural networks and referred to as *hints*. Suddarth & Kergosien (1990) argued that for the hint to be effective, it needs to "special relationship with the original input-output being learned." They demonstrated that adding auxiliary tasks to a minimal neural network effectively removed local minima.

The idea of adding supplementary cost functions was first used in reinforcement learning by Sutton et al. (2011) in the form of *general value functions* (GVFs). As the name suggests, GVFs are similar to the well-known value functions of reinforcement learning. However, instead of caring about environmental rewards, they consider other signals. They differ from auxiliary tasks in that they usually predict long term features. Hence, they employ summation across multiple time-steps similar to the state-value computation from rewards in standard RL.

Auxiliary tasks are naturally and succinctly implemented by splitting the last layer of the network into multiple parts (heads), each working on a different task. The multiple heads propagate errors back to the shared part of the network, which forms the representations that support all the tasks (Sutton & Barto 2018).

To see how this works in practice, we'll consider Asynchronous Advantage Actor-Critic (A3C) (Mnih et al. 2016), which is a popular and representative actor-critic algorithm. The loss function of A3C is composed of two terms: the policy loss (actor), $\mathcal{L}_{\boldsymbol\pi}$, and the value loss (critic), $\mathcal{L}_{v}$. An entropy loss $H[\boldsymbol\pi]$ for the policy $\boldsymbol\pi$, is also commonly added. This helps discouraging premature convergence to sub-optimal deterministic policies (Mnih et al. 2016). The complete loss function is given by:

\begin{equation}

\mathcal{L}_{\text{A3C}} = \lambda_v \mathcal{L}_{v} + \lambda_{\pi} \mathcal{L}_{\pi} - \lambda_{H} \mathbb{E}_{s \sim \pi} \left[H[\boldsymbol\pi[s]]\right]\tag{1}

\end{equation}

where $s$ is the state and scalars $\lambda_{v},\lambda_{\boldsymbol\pi}$, and $\lambda_{H}$ weight the component losses.

Auxiliary tasks are introduced to A3C via the Unsupervised Reinforcement and Auxiliary Learning (UNREAL) framework (Jaderberg et al. 2017). UNREAL optimizes the loss function:

\begin{equation}

\mathcal{L}_{\text{UNREAL}} = \mathcal{L}_{\text{A3C}} + \sum_i \lambda_{AT_i} \mathcal{L}_{AT_i}\tag{2}

\end{equation}

that combines the A3C loss, $\mathcal{L}_{\text{A3C}}$, together with auxiliary task losses $\mathcal{L}_{AT_i}$, where $\lambda_{AT_i}$ are weight terms (Figure 1). For a single auxiliary task, the loss computation code might look like:

` def loss_func_a3c(s, a, v_t):
# Compute logits from policy head and values from the value head
self.train()
logits, values = self.forward(s)
# Critic loss computation
td = 0.5*(v_t - values)
c_loss = td.pow(2)
# Actor loss
probs = F.softmax(logits, dim=1)
m = self.distribution(probs)
exp_v = m.log_prob(a) * td.detach()
a_loss = -exp_v
# Entropy loss
ent_loss = -(F.log_softmax(logits, dim=1) * F.softmax(logits, dim=1)).sum(1)
# Computing total loss
total_loss = (CRITIC_LOSS_W * c_loss + ACTOR_LOSS_W * a_loss - ENT_LOSS_W * ent_loss).mean()
# Auxiliary task loss
aux_task_loss = aux_task_computation()
#Computing total loss
total_loss = (AUX_TASK_WEIGHT_LOSS * aux_task_loss + a3c_loss).mean()`

The use of auxiliary tasks is not limited to actor-critic algorithms; they have also been implemented on top of Q-learning algorithms such as DRQN (Hausknecht & Stone 2015). For example, Lample & Chaplot (2017) extended the DRQN architecture with another head used to predict game features. In this case, the loss is the standard DRQN loss and the cross-entropy loss of the auxiliary task.

We now consider five different auxiliary tasks that have obtained good results in various RL domains. We provide insights as to the applicability of these tasks.

Sutton et al. (2011) speculated:

"

Suppose we are playing a game for which base terminal rewards are +1 for winning and -1 for losing. In addition to this, we might pose an independent question about how many more moves the game will last. This could be posed as a general value function."

The first part of the quote refers to the standard RL problem where we learn to maximize rewards (winning the game). The second part describes an auxiliary task in which we predict how many moves remain before termination.

Kartal et al. (2019) investigated this idea of *terminal prediction*. The agent predicts how close it is to a terminal state while learning the standard policy, with the goal of facilitating representation learning. Kartal et al. (2019) added this auxiliary task to A3C and named this A3C-TP. The architecture was identical to A3C, except for the addition of the terminal state prediction head.

The loss $\mathcal{L}_{TP}$ for the terminal state prediction is the mean squared error between the estimated closeness $\hat{y}$ to a terminal state of any given state and target values $y$ approximately computed from completed episodes:

\begin{equation}

\mathcal{L}_{TP}= \frac{1}{N} \sum_{i=0}^{N}(y_{i} - \hat{y}_{i})^2\tag{3}

\end{equation}

where $N$ represents the episode length during training. The target for the $i^{th}$ state is approximated with $y_{i} = i/N $ implying $y_{N}=1$ for the actual terminal state and $y_{0}=0$ for the initial state for each episode.

Kartal et al. (2019) initially used the actual current episode length for $N$ to compute the targets $y_{i}$. However, this delays access to the labels until the episode is over and did not provide significant benefit in practice. As an alternative, they approximate the current episode length by the *running average* of episode lengths computed from the most recent $100$ episodes, which provides a dense signal.* ^{1}* This improves learning performance, and is memory efficient for distributed on-policy deep RL as CPU workers do not have to retain the computation graph until episode termination to compute terminal prediction loss.

Since terminal prediction targets are computed in a self-supervised fashion, they have the advantage that they are independent of reward sparsity or any other domain dynamics that might render representation learning challenging (such as drastic changes in domain visuals, which happen in some Atari games). However, terminal prediction is applicable only for episodic environments.

*Agent modeling* (Hernandez-Leal et al. 2019) is an auxiliary task that is designed to work in a multi-agent setting. It takes ideas from game theory and in particular from the concept of *best response*: the strategy that produces the most favorable outcome for a player, taking other players' strategies as given.

The goal of agent modeling is to learn other agents' policies while itself learning a policy. For example, consider a game in which you face an opponent. Here, learning the opponent's behavior is useful to develop a strategy against it. However, agent modeling is not limited to only opponents; it can also model teammates, and can be applied to an arbitrary number of them.

There are two main approaches to implementing the agent modeling task. The first uses the conventional approach of adding new heads for the auxiliary task to a shared network base, as discussed in previous sections. The second uses a more sophisticated architecture in which latent features from the auxiliary network are used as inputs to the main value/policy prediction stream. We consider each in turn.

In this scheme, agents share the same network base, but the outputs represent different agent actions (Foerster et al. 2017). The goal is to predict opponent policies as well as the standard actor and critic, with the key characteristic that the previous layers share parameters (Figure 2a).

This architecture builds on the concept of *parameter sharing* where the idea is to perform centralized learning:

The AMS architecture uses the loss function:

\begin{equation}

\mathcal{L}_{\text{AMS}}= \mathcal{L}_{\text{A3C}} + \frac{1}{\mathcal{N}} \sum_i^{\mathcal{N}} \lambda_{AM_i} \mathcal{L}_{AM_i}\tag{4}

\end{equation}

where $\lambda_{AM_i}$ is a weight term and $\mathcal{L}_{AM_i}$ is the auxiliary loss for opponent $i$:

\begin{equation}

\mathcal{L}_{AM_i}= -\frac{1}{M} \sum_j^M \sum_{k}^{K} a^j_{ik} \log [\hat{a}^j_{ik}\tag{5}

\end{equation}

which is the cross entropy loss between the observed one-hot encoded opponent action, $\mathbf{a}^j_{i}$, and the prediction over opponent actions, $\hat{\mathbf{a}}^j_{i}$. Here $i$ indexes the opponents, $j$ indexes time for a trajectory of length $M$, and $k$ indexes the $K$ possible actions.

*Policy features* (Hong et al. 2018) are intermediate *features* from the latent space that is used to predict the opponent policy. The AMF framework exploits these features to improve the main reward prediction.

In this architecture, convolutional layers are shared, but the fully connected layers are divided in two sections (Figure 2b). The first is specialized for the actor and critic of the learning agent and the second for the opponent policies. The intermediate opponent policy features, $\mathbf{h}_{i}$ from the second path are used to condition (via an element-wise multiplication) the computation of the actor and critic. The loss function is similar to that for AMS.

Note that both AMS and AMF need to observe the opponent's actions to generate ground truth and for the auxiliary loss function. This is a limitation, and further research is required to handle partially observable environments.

In the previous two sections, we considered auxiliary tasks that related to the structure of the learning (terminal prediction) and to other agents' actions (agent modeling). In this section, we consider predicting the reward received at the next time-step — an idea that seems quite natural in the context of RL. More precisely, given state sequence $\{\mathbf{s}_{t-3}, \mathbf{s}_{t-2}, \mathbf{s}_{t-1}\}$, we aim to predict the reward $r_t$. Note that is similar to value learning with $\gamma=0$, so that the agent only cares about the immediate reward.

Jaderberg et al. (2017) formulated this task as multi-class classification with three classes: positive reward, negative reward, or zero. To mitigate data imbalance problems, the same number of samples with zero and non-zero rewards were provided during training.

In general, data imbalance is a disadvantage of reward prediction. This is particularly troublesome for hard-exploration problems with sparse rewards. For example, in the Pommerman game, an episode can last up to 800 timesteps, and the only non-zero reward is obtained at episode termination. Here, class-balancing would require many episodes, and this is in contradiction with the stated goal of using auxiliary tasks (i.e., to speed up learning).

Mirowski et al. (2016) studied auxiliary tasks in a navigation problem in which the agent needs to reach a goal in first-person 3D mazes from a random starting location. If the goal is reached, the agent is re-spawned to a new start location and must return to the goal. The 8 discrete actions permitted rotation and acceleration.

The agent sees RGB images as input. However, the authors speculated that depth information might supply valuable information about how to navigate the 3D environment. Thus, one of the auxiliary tasks is to predict depth, which can be cast as a regression or as a classification problem.

Mirowski et al. (2016) performed different experiments and we highlight two of these. In the first, they considered using the auxiliary task directly as input to the network, instead of just using it for computing the loss. In the second, they consider where to add the auxiliary task within the network. For example, the auxiliary task module can be set just after the convolutional layers, or after the convolutional and recurrent layers (Figure 3).

The results showed that:

- Using depth as input to the CNN (not shown in above Figure) resulted in worse performance than when predicting the depth.
- Treating depth estimation as classification (discretizing over 8 regions) outperformed casting it as regression.
- Placing the auxiliary task after the convolutional and recurrent networks obtained better results than moving it before the recurrent layers.

The auxiliary tasks discussed so far have involved estimating various quantities. A *control task* actually tries to manipulate the environment in some way. Jaderberg et al. (2017) proposed *pixel control* auxiliary tasks. An auxiliary control task $c$ is defined by a reward function $r^{c}: \mathcal{S} \times \mathcal{A} \rightarrow \mathbb{R}$ where $\mathcal{S}$ is the space of possible states and $\mathcal{A}$ is the space of available actions. Given a base policy $\pi$, and auxiliary control task with policy $\pi^c$, the learning objective becomes:

\begin{equation}

\mathcal{L}_{pc}= \mathbb{E}_{\pi}[R] + \lambda_{c} \mathbb{E}_{\pi^c}[R^{c}],\tag{6}

\end{equation}

where $R^c$ is the return obtained from the auxiliary control task and $\lambda_{c}$ is a weight. As for previous auxiliary tasks, some parameters are shared with the main task.

The system used off-policy learning; the data was replayed from an experience buffer and the system was optimized using a n-step Q-learning loss:

\begin{equation}

\mathcal{L}^c=(R_{t:t+n} + \gamma^n \max_{a'}\left[ Q^c(s',a',\boldsymbol\theta^{-})- Q^c(s,a,\boldsymbol\theta))^2\right]\tag{7}

\end{equation}

where $\boldsymbol\theta$ and $\boldsymbol\theta^{-}$ refers are the current and previous parameters, respectively.

Jaderberg et al. (2017) noted that changes in the perceptual stream are important since they generally correspond to important events in an environment. Thus, the agent learns a policy for maximally changing the pixels in each cell of an $n \times n$ non-overlapping grid superimposed on the input image. The immediate reward in each cell was defined as the average absolute difference from the previous frame. Results show that these types of auxiliary task can significantly improve learning.

In this section we consider some unresolved challenges associated with using auxiliary tasks in deep RL.

We have seen that adding an auxiliary task is relatively simple. However, the first and most important challenge is to define a *good* auxiliary task. Exactly how to do this remains an open question.

As a step towards this, Du et al. (2018) devised a method to detect *when* a given auxiliary task might be useful. Their approach uses the intuition that an algorithm should take advantage of the auxiliary task when it is helpful for the main task and block it otherwise.

Their proposal has two parts. First, they determine whether the auxiliary task and the main task are related. Second, they modulate how useful the auxiliary task is with respect to the main task using a weight. In particular, they propose to detect when an auxiliary loss $\mathcal{L}_{aux}$ is helpful to the main loss $\mathcal{L}_{main}$ by using the cosine similarity between gradients of the two losses:

**Algorithm 1:** Use of auxiliary task by gradient similarity

**if **$\cos[\nabla_{\theta}\mathcal{L}_{main},\nabla_{\theta} \mathcal{L}_{aux}]\ge 0$ **then**

| Update $\theta$ using $\nabla_{\theta}\mathcal{L}_{main} +\nabla_{\theta} \mathcal{L}_{aux}$

**else**

| Update $\theta$ using $\nabla_{\theta}\mathcal{L}_{main}$

**end**

The goal of this algorithm is to avoid adding an auxiliary loss that impedes learning progress for the main task. This is sensible, but would ideally be augmented by a better theoretical understanding of the benefits (Bellemare et al. 2019).

The auxiliary task is always accompanied by a weight but it's not obvious how to choose its value. Ideally, this weight should give enough importance to the auxiliary task to drive the representation learning but also one needs to be careful not to ignore the main task.

Weights need not be fixed, and can vary over time. Hernandez-Leal et al. (2019) compared five different weight parameterizations for the agent modeling task. In some the weight was fixed and in others it decayed. Figure 4 shows that, in the same environment, this choice directly affects the performance.

It also appears that the optimal auxiliary weight is dependent on the domain. For the tasks discussed in this post:

- Terminal prediction used a weight of 0.5 (Kartal et al. 2019).
- Reward prediction used a weight of 1.0 (Jaderberg et al. 2017).
- Auxiliary control varied between 0.0001 and 0.01 (Jaderberg et al. 2017).
- Depth prediction chose from the set {1, 3.33, 10} (Mirowski et al. 2016).

In the last part of this article we discuss the two major benefits of using auxiliary tasks: improving performance and increasing robustness.

The main benefit of auxiliary tasks is to drive representation learning and hence improve the agent's performance; the system learns faster and achieves better performance in terms of rewards with an appropriate auxiliary task.

For example, when auxiliary tasks where added to A3C agents, scores improved in domains such as Pommerman (Figure 5), Q*bert (Figure 6a) and the Bipedal walker (Figure 6b). Similar benefits of auxiliary tasks have been shown in Q-learning style algorithms (Lample & Chaplot 2017; Fedus et al. 2019).

The second benefit is related to robustness, and we believe this has been somewhat under-appreciated. One problem with deep RL is the high variance over different runs. This can even happen in the same experiment while just varying the random seed (Henderson et al. 2018). This is a major complication because algorithms sometimes diverge (i.e., they fail to learn) and thus we prefer robust algorithms that can learn under a variety of different values for their parameters.

Auxiliary tasks have been shown to improve robustness of the learning process. In the UNREAL work (Jaderberg et al. 2017), the authors varied two hyperparameters, entropy cost and learning rate, and kept track of the final performance while adding different auxiliary tasks. The results showed that adding auxiliary tasks increased performance over a variety of hyperparameter values (Figure 7).

In this tutorial we reviewed auxiliary tasks in the context of deep reinforcement learning and we presented examples from a variety of domains. Auxiliary tasks been used to accelerate and provide robustness to the learning process. However, there are still open questions and challenges such as defining what constitutes a good auxiliary task and forming a better theoretical understanding of how they contribute.

^{1}*Note that the RL agent does not have access to the time stamp or a memory, so it must predict its time relative to the terminal state afresh at each time step.*