The 38th Conference on Computer Vision and Pattern Recognition was held on June 14-19 and, for the first time, was held as a fully virtual conference. The first CVPR was held in 1977 (although under a different name) and had only 63 papers which certainly would have made going through the proceedings a much easier endeavour. Since then it has grown into the premier conference in the field of computer vision with a massive technical program. This year there were the usual highlights at CVPR: the plenary talks (fireside chats with Satya Nadella and Charlie Bell), the award winning papers (presented by Greg Mori, CVPR 2020 program chair and Borealis AI’s Senior Research Director) and the retrospective Longuet-Higgins Prize. (Be sure to check out this excellent post by Michael J Black about one of the Longuet-Higgins Prize winning papers.) But there’s much more to CVPR 2020; this year alone had nearly 1,500 papers along with a wide array of tutorials and workshops. Our researchers have picked out a few of their favourites to highlight.

*Wenqian Liu, Runze Li, Meng Zheng, Srikrishna Karanam, Ziyan Wu, Bir Bhanu, Richard J. Radke, and Octavia Camps.*

- Grad-CAM: Visual explanations from deep networks via gradient-based localization
- Disentangling by Factorising
- Adapting Grad-CAM for embedding networks

The paper proposed a gradient-based method to explain Variational Autoencoders (VAE) for images. The proposed method is also able to localize anomalies in images, which are examples that are not seen in training.

The method extends the widely-used Grad-CAM method to explain generative models, specifically VAEs. It can produce visual explanations for images that are generated from VAEs. For example, what is the most important image region for the digit 5? Which region of the digit 7 is different from that of digit 1? Visual explanations for these questions provide a clear understanding of the reasoning behind an algorithms' predictions and adds robustness and performance guarantees.

The method is based on Grad-CAM in which the key technique is the choice of differentiable activation. The differentiable activation will be back-propagated to the last CNN layer to obtain channel-wise weights and thus a visual attention map. This paper uses the latent vector $\mathbf{z}$ as the differentiable activation. Specifically, each element $\mathbf{z}_i$ is backpropagated independently to generate an element-wise attention map. Then, the overall attention map is the mean of element-wise attention maps. Figure 1 illustrates the process of element-wise attention generation.

By modifying the differentiable activation to the sum of all elements in the (inferred) mean vector, the method is also able to generate anomaly attention maps. Further, the method can help the VAE to learn improved latent space disentanglement by adding an attention disentanglement loss.

The performance of the method is demonstrated qualitatively and quantitatively.

Figure 2 shows qualitative results on the MNIST dataset for anomaly attention explanations. The method correctly highlights the difference between the training digit and the testing digits. For example, the heatmap highlights a key difference region between the "1'' and the "7'', which is the top-horizontal bar in "7''.

The paper also shows quantitative results on a pedestrian video dataset (UCSD Ped 1) and a more comprehensive anomaly detection dataset (MVTec AD) using pixel-level segmentation scores. Its performance is significantly better than vanilla-VAE on UCSD Ped 1 and moderately better than previous work on MVTec AD.

*Boyang Deng, Kyle Genova, Soroosh Yazdani, Sofien Bouaziz, Geoffrey Hinton and Andrea Tagliasacchi.*

**Related Papers:**

- BSP-Net: Generating Compact Meshes via Binary Space Partitioning
- OctNet: Learning Deep 3D Representations at High Resolutions

This paper presents a new, differentiable representation of 3D shape based on convex polytopes.

There are numerous shape representations that are widely used in computer vision and computer graphics. However, most are extremely expensive to work with, not easily differentiable or impractical to use in some settings. A good, compact representation of 3D shape which is differentiable and efficient to work with would be a boon to a wide range of applications which work with 3D shape, e.g., 3D reconstruction from images, recognition of 3D objects, and rendering and physical simulation of 3D shapes.

There are many different shape representations in common use including voxel grids, signed distance functions, implicit surfaces, meshes, etc. In computer vision one of the most common is in terms of voxels where a 3D object is represented by a 3D grid of values which indicate whether the object is occupying a given point. Voxel grids are convenient because modern machine learning tools like convolutional networks can be naturally used, just as with images. Unfortunately, the memory requirements for voxel representations grows cubicly (i.e., $O(n^3)$) with resolution $n$, quickly outgrowing the available memory on GPUs. This has led to the development of OctTrees and other specialized methods to attempt to save memory. Other approaches, including shape primitives, implicit shape representations and meshes have been used but come with their own challenges, for instance being too simplistic to capture real geometry, non-differentiable or computationally expensive.

Instead, this paper proposes to represent a shape by a combination of convex shapes which are themselves defined by a set of planes. The signed distance from a plane $h$ is defined as

\[

\mathcal{H}_h(\mathbf{x}) = \mathbf{n}_h \cdot \mathbf{x} + d_h

\]

where $\mathbf{n}_h$ is the (unit) normal of the plane and $d_h$ is the distance of the plane from the origin. A convex shape can then be defined as the set of all points which are on the positive side of every plane. To do this smoothly, CvxNet makes use of the Sigmoid and LogSumExp functions, i.e.,

\[

\mathcal{C}(\mathbf{x}) = \textrm{Sigmoid}(-\sigma \textrm{LogSumExp}_h \delta \mathcal{H}_h(\mathbf{x}) )

\]

where $\mathcal{C}(\mathbf{x})$ is close to 1 for points inside the shape and close to 0 for points on the outside. The values $\sigma$ and $\delta$ control the smoothness of the shape. This construction is show in Figure 4. The result of this is a function $\mathcal{C}(\mathbf{x})$ which is fast to compute to determine inside/outside for a given convex shape. More complex (i.e., non-convex) shapes can then be constructed by a union of these convex parts. Further, the underlying planes themselves can be used directly to efficiently construct other representations like polygonal meshes which are convenient for use in simulation.

The paper then goes on to train an autoencoder model of these planes on a database of shapes. The result is a low dimensional latent space and a decoder which can be used for other tasks such as reconstructing 3D shapes from depth or RGB images.

They demonstrate both the fidelity of the overall representation as well as it's usefulness in 3D estimation tasks. Quantitatively, the results shows significant improvement in the depth-to-3D task and competitive performance on RGB-to-3D task. The traditional table of numbers can be found in the paper. However more interestingly, the latent space representation finds natural and semantically meaningful parts (i.e., the individual convex pieces) in an entirely unsupervised manner. These can be seen in Figure 4.

*Organizers: Stephen Gould, Anoop Cherian, Dylan Campbell and Richard Hartley*

- Deep Declarative Networks
- Deep Equilibrium Models
- Residual Flows for Invertible Generative Modeling
- Differentiable Convex Optimization Layers

"Deep Declarative Networks" are a new class of machine learning model which indirectly defines a transformation.

The current success of machine learning to date has been driven by explicitly defining parametric functions which transform inputs into the desired output. However, these models are growing increasingly large (e.g., the recently released GPT-3 has over 175 billion parameters) and data hungry. Further, some recent work has suggested that many of the parameters in these massive models are redundant. In contrast, **declarative networks** operate by defining these transformations indirectly. For instance, the transformation may be the solution to Ordinary Differential Equation, the minimum of an energy function or the root to a non-linear equation. The result are methods which are more compact and efficient and may be the future of machine learning.

This workshop brought together researchers who have been pushing in this direction in one place to both review the results to date and discuss the outstanding problems in the area. For people unfamiliar with this exciting new direction, the talks and papers of this workshop are an excellent starting point. Further, they contain previews of exciting new work to come.

Deep Equilibrium Models, Neural ODEs and Differentiable Convex Optimization Layers were proposed in 2019 and 2018 respectively and this workshop is a natural outgrowth of this interest in finding new ways to define the transformations we aim to learn.

The presented papers and invited talks at this workshop contained many exciting results. However, one notable result was in the talk by Zico Kolter. There he previewed the latest results with Multiscale Deep Equilibrium Models which was just recently released on arXiv. The results, some of which can be seen in Figure 5, show that these models are both competitive with state-of-the-art methods on image problems like classification and semantic segmentation but result in models which are smaller. Further, for size-constrained models (e.g., with limited numbers of parameters or memory usage), these models can significantly outperform existing techniques.

]]>

The partnership is part of Borealis AI’s ongoing commitment to advancing AI in Canada and fostering diversity and inclusion in the field. Borealis AI will be providing mentorship, career advice and online workshops for the 30 women selected to participate in this year’s program, as well as ongoing support for the AI4Good team.

The AI4Good Lab was founded in 2017 in Montreal by Angelique Mannella, Global Alliance Lead at Amazon Web Services, and Dr. Doina Precup, researcher at Mila, McGill University and Deepmind. It's the first program of its kind to combine rigorous teaching in Artificial Intelligence (AI) with tackling diversity and inclusion in research and development, while promoting AI as a tool for social good.

The 2020 AI4Good Lab cohort marks the 4th year of training the next generation of diverse AI leaders, with 110 participants and alumni from across Canada. This year the lab will be held virtually from June 8th to July 28th due to the COVID-19 pandemic.

The 7-week program consists of two parts:

- intensive machine learning training through workshops and lectures by AI experts from academia and industry;
- a prototype development phase, during which the participants will work on AI products to tackle a social good problem of their choosing.

Borealis AI will actively be involved in both parts of the program with presentations and mentorship for the students as well as advice on how to navigate the job market in the AI space.

Speaking about the Lab, co-founder Angelique Mannella explained:

“Creating more diversity in technical environments is hard. While progress is being made, the only way to make sustainable, lasting change is to take an ecosystem approach, where organizations work together to surface new ways of working, new ways of knowledge sharing, and new ways of nurturing talent.”

Dr. Eirene Seiradaki, Director of Research Partnerships at Borealis AI, said:

“Increasing the number of women working in technology and science is a priority for Borealis AI. We are delighted to support the AI4Good Lab program. We hope this program will provide the participants with new skills and tools to help them develop their careers in this exciting and evolving industry.”

Maya Marcus-Sells, Executive Director at the AI4Good Lab, said:

“Our partnership with Borealis AI helps bring women across Canada into the fast-moving tech ecosystem. Through mentorship, speaking, and career guidance, Borealis AI will provide the participants of the AI4Good Lab with insights and networks into AI careers that will help them grow into the AI leaders of tomorrow, ultimately leading towards a more diverse and representative AI talent pool.”

“Borealis AI's commitment from day one to foster an environment for gender diverse talent and to work with us to create new opportunities for knowledge sharing and mentorship has been invaluable to our participants and alumni and also to our ability to amplify our collective impact across Canada,”added Mannella.

“There is a lot more to be done in this area”said Seiradaki.“Borealis AI will continue to partner with universities, government, and industry to further narrow the gender gap and improve the talent pool through a larger presentation of women in AI.”

Borealis AI is a world-class AI Research center backed by RBC. Recognized for scientific excellence, Borealis AI uses the latest in machine learning capabilities to solve challenging problems in the financial industry. Led by award-winning inventor and entrepreneur, Foteini Agrafioti, and with top North America scientists and engineers, Borealis AI is at the core of the bank’s innovation strategy and benefits from RBC’s scale, data, and trusted brand.

With a focus on responsible AI, natural language processing, and reinforcement learning, Borealis AI is committed to building solutions using machine learning and artificial intelligence that will transform the way individuals manage their finances and their futures. For more information please visit www.borealisai.com.

]]>This talk will provide an update on recent progress in this area. It will start out with novel state-of-the-art methods for the self-play setting. Next, it will introduce the Zero-Shot Coordination setting as a new frontier for multi-agent research. Finally it will introduce Other-Play as a novel learning algorithm, which allows agents to coordinate ad-hoc and biases learning towards more human compatible policies.

]]>However, other optimization problems are much more challenging. Consider *hyperparameter search* in a neural network. Before we a train the network, we must choose the architecture, optimization algorithm, and cost function. These choices are encoded numerically as a vector of hyperparameters. To get the best performance, we must find the hyperparameters for which the resulting trained network is best. This hyperparameter optimization problem has many challenging characteristics:

**Evaluation cost:** Evaluating the function that we wish to maximize (i.e., the network performance) in hyperparameter search is very expensive; we have to train the neural network model and then run it on the validation set to measure the network performance for a given set of hyperparameters.

**Multiple local optima:** The function is not convex and there may be many combinations of hyperparameters that are locally optimal.

**No derivatives:** We do not have access to the gradient of the function with respect to the hyperparameters; there is no easy and inexpensive way to propagate gradients back through the model training / validation process.

**Variable types:** There are a mixture of discrete variables (e.g., the number of layers, number of units per layer and type of non-linearity) and continuous variables (e.g., the learning rate and regularization weights).

**Conditional variables:** The existence of some variables depends on the settings of others. For example, the number of units in layer $3$ is only relevant if we already chose $\geq 3$ layers.

**Noise:** The function may return different values for the same input hyperparameter set. The neural network training process relies on stochastic gradient descent and so we typically don't get exactly the same result every time.

Bayesian optimization is a framework that can deal with optimization problems that have all of these challenges. The core idea is to build a model of the entire function that we are optimizing. This model includes both our current estimate of that function and the uncertainty around that estimate. By considering this model, we can choose where next to sample the function. Then we update the model based on the observed sample. This process continues until we are sufficiently certain of where the best point on the function is.

Let's now put aside the specific example of hyperparameter search and consider Bayesian optimization in its more general form. Bayesian optimization addresses problems where the aim is to find the parameters $\hat{\mathbf{x}}$ that maximize a function $\mbox{f}[\mathbf{x}]$ over some domain $\mathcal{X}$ consisting of finite lower and upper bounds on every variable:

\begin{equation}

\hat{\mathbf{x}} = \mathop{\rm argmax}_{\mathbf{x} \in \mathcal{X}} \left[ \mbox{f}[\mathbf{x}]\right]. \tag{1}

\label{eq:global-opt}

\end{equation}

At iteration $t$, the algorithm can learn about the function by choosing parameters $\mathbf{x}_t$ and receiving the corresponding function value $f[\mathbf{x}_t]$. The goal of Bayesian optimization is to find the maximum point on the function using the minimum number of function evaluations. More formally, we want to minimize the number of iterations $t$ before we can guarantee that we find parameters $\hat{\mathbf{x}}$ such $f[\hat{\mathbf{x}}]$ is less than $\epsilon$ from the true maximum $\hat{f}$.

We'll assume for now that all parameters are continuous, that their existences are not conditional on one another, and that the cost function is deterministic so that it always returns the same value for the same input. We'll return these complications later in this document. To help understand the basic optimization problem let's consider some simple strategies:

**Grid Search:** One obvious approach is to quantize each dimension of $\mathbf{x}$ to form an input grid and then evaluate each point in the grid (figure 1). This is simple and easily parallelizable, but suffers from the curse of dimensionality; the size of the grid grows exponentially in the number of dimensions.

**Random Search:** Another strategy is to specify probability distributions for each dimension of $\mathbf{x}$ and then randomly sample from these distributions (Bergstra and Bengio, 2012). This addresses a subtle inefficiency of grid search that occurs when one of the parameters has very little effect on the function output (see figure 1 for details). Random search is also simple and parallelizable. However, if we are unlucky, we can may either (i) make many similar observations that provide redundant information, or (ii) never sample close to the global maximum.

**Sequential search strategies:** One obvious deficiency of both grid search and random search is that they do not take into account previous measurements. If the measurements are made sequentially then we could use the previous results to decide where it might be strategically best to sample next (figure 2). One idea is that we could *explore* areas where there are few samples so that we are less likely to miss the global maximum entirely. Another approach could *exploit* what we have learned so far by sampling more in relatively promising areas. An optimal strategy would recognize that there is a trade-off between *exploration* and *exploitation* and combine both ideas.

Bayesian optimization is a sequential search framework that incorporates both exploration and exploitation and can be considerably more efficient than either grid search or random search. It can easily be motivated from figure 2; the goal is to build a probabilistic model of the underlying function that will know both (i) that $\mathbf{x}_{1}$ is a good place to sample because the function will probably return a high value here and (ii) that $\mathbf{x}_{2}$ is a good place to sample because the uncertainty here is very large.

A Bayesian optimization algorithm has two main components:

**A probabilistic model of the function:**Bayesian optimization starts with an initial probability distribution (the prior) over the function $f[\bullet]$ to be optimized. Usually this just reflects the fact that we are extremely uncertain about what the function is. With each observation of the function $(\mathbf{x}_t, f[\mathbf{x}_t])$, we learn more and the distribution over possible functions (now called the posterior) becomes narrower.**An acquisition function:**This is computed from the posterior distribution over the function and is defined on the same domain. The acquisition indicates the desirability of sampling each point next and depending on how it is defined it can favor exploration or exploitation.

In the next two sections, we consider each of these components in turn.

There are several ways to model the function and its uncertainty, but the most popular approach is to use Gaussian processes (GPs). We will present other models (Bernoulli-Beta bandits, random forests, and Tree-Parzen estimators) later in this document.

A Gaussian Process is a collection of random variables, where any finite number of these are jointly normally distributed. It is defined by (i) a mean function $\mbox{m}[\mathbf{x}]$ and (ii) a covariance function $k[\mathbf{x},\mathbf{x}']$ that returns the similarity between two points. When we model our function as $\mbox{f}[\mathbf{x}]\sim \mbox{GP}[\mbox{m}[\mathbf{x}],k[\mathbf{x},\mathbf{x}^\prime]]$ we are saying that:

\begin{eqnarray}

\mathbb{E}[\mbox{f}[\mathbf{x}]] &=& \mbox{m}[\mathbf{x}] \tag{2}

\end{eqnarray}

\begin{eqnarray}

\mathbb{E}[(\mbox{f}[\mathbf{x}]-\mbox{m}[\mathbf{x}])(f[\mathbf{x}']-\mbox{m}[\mathbf{x}'])] &=& k[\mathbf{x}, \mathbf{x}']. \tag{3}

\end{eqnarray}

The first equation states that the expected value of the function is given by some function $\mbox{m}[\mathbf{x}]$ of $\mathbf{x}$ and the second equation tells us how to compute the covariance of any two points $\mathbf{x}$ and $\mathbf{x}'$. As a concrete example, let's choose:

\begin{eqnarray}

\mbox{m}[\mathbf{x}] &=& 0 \tag{4}

\end{eqnarray}

\begin{eqnarray}

k[\mathbf{x}, \mathbf{x}']

&=&\mbox{exp}\left[-\frac{1}{2}\left(\mathbf{x}-\mathbf{x}'\right)^{T}\left(\mathbf{x}-\mathbf{x}'\right)\right], \tag{5}

\end{eqnarray}

so here the expected function values are all zero and the covariance decreases as a function of distance between two points. In other words, points very close to one another of the function will tend to have similar values and those further away will be less similar.

Given observations $\mathbf{f} = [f[\mathbf{x}_{1}], f[\mathbf{x}_{2}],\ldots, f[\mathbf{x}_{t}]]$ at $t$ points, we would like to make a prediction about the function value at a new point $\mathbf{x}^{*}$. This new function value $f^{*} = f[\mathbf{x}^{*}]$ is jointly normally distributed with the observations $\mathbf{f}$ so that:

\begin{equation}

Pr\left(\begin{bmatrix}\label{eq:GP_Joint}

\mathbf{f}\\f^{*}\end{bmatrix}\right) = \mbox{Norm}\left[\mathbf{0}, \begin{bmatrix}\mathbf{K}[\mathbf{X},\mathbf{X}] & \mathbf{K}[\mathbf{X},\mathbf{x}^{*}]\\ \mathbf{K}[\mathbf{x}^{*},\mathbf{X}]& \mathbf{K}[\mathbf{x}^{*},\mathbf{x}^{*}]\end{bmatrix}\right], \tag{6}

\end{equation}

where $\mathbf{K}[\mathbf{X},\mathbf{X}]$ is a $t\times t$ matrix where element $(i,j)$ is given by $k[\mathbf{x}_{i},\mathbf{x}_{j}]$, $\mathbf{K}[\mathbf{X},\mathbf{x}^{*}]$ is a $t\times 1$ vector where element $i$ is given by $k[\mathbf{x}_{i},\mathbf{x}^{*}]$ and so on.

Since the function values in equation 6 are jointly normal, the conditional distribution $Pr(f^{*}|\mathbf{f})$ must also be normal, and we can use the standard formula for the mean and variance of this conditional distribution:

\begin{equation}\label{eq:gp_posterior}

Pr(f^*|\mathbf{f}) = \mbox{Norm}[\mu[\mathbf{x}^{*}],\sigma^{2}[\mathbf{x}^{*}]], \tag{7}

\end{equation}

where

\begin{eqnarray}\label{eq:GP_Conditional}

\mu[\mathbf{x}^{*}]&=& \mathbf{K}[\mathbf{x}^{*},\mathbf{X}]\mathbf{K}[\mathbf{X},\mathbf{X}]^{-1}\mathbf{f}\nonumber \\

\sigma^{2}[\mathbf{x}^{*}]&=&\mathbf{K}[\mathbf{x}^{*},\mathbf{x}^{*}]\!-\!\mathbf{K}[\mathbf{x}^{*}, \mathbf{X}]\mathbf{K}[\mathbf{X},\mathbf{X}]^{-1}\mathbf{K}[\mathbf{X},\mathbf{x}^{*}]. \tag{8}

\end{eqnarray}

Using this formula, we can estimate the distribution of the function at any new point $\mathbf{x}^{*}$. The best estimate of the function value is given by the mean $\mu[\mathbf{x}]$, and the uncertainty is given by the variance $\sigma^{2}[\mathbf{x}]$. Figure 3 shows an example of measuring several points on a function sequentially and showing how the predicted mean and variance changes for other points.

Now that we have a model of the function and its uncertainty, we will use this to choose which point to sample next. The *acquisition* function takes the mean and variance at each point $\mathbf{x}$ on the function and computes a value that indicates how desirable it is to sample next at this position. A good acquisition function should trade off exploration and exploitation.

In the following sections we'll describe four popular acquisition functions: the upper confidence bound (Srinivas *et* al., 2010), expected improvement (Močkus, 1975), probability of improvement (Kushner, 1964), and Thompson sampling (Thompson, 1933). Note that there are several other approaches which are not discussed here including those based on entropy search (Villemonteix *et* al., 2009, Hennig and Schuler, 2012) and the knowledge gradient (Wu *et* al., 2017).

**Upper confidence bound:** This acquisition function (figure 4a) is defined as:

\begin{align}

\mbox{UCB}[\mathbf{x}^{*}] = \mu[\mathbf{x}^{*}] + \beta^{1/2} \sigma[\mathbf{x}^{*}]. \label{eq:UCB-def} \tag{9}

\end{align}

This favors either (i) regions where $\mu[\mathbf{x}^{*}]$ is large (for exploitation) or (ii) regions where $\sigma[\mathbf{x}^{*}]$ is large (for exploration). The positive parameter $\beta$ trades off these two tendencies.

**Probability of improvement:** This acquisition function computes the likelihood that the function at $\mathbf{x}^{*}$ will return a result higher than the current maximum $\mbox{f}[\hat{\mathbf{x}}]$. For each point $\mathbf{x}^{*}$, we integrate the part of the associated normal distribution that is above the current maximum (figure 4b) so that:

\begin{equation}

\mbox{PI}[\mathbf{x}^{*}] = \int_{\mbox{f}[\hat{\mathbf{x}}]}^{\infty} \mbox{Norm}_{\mbox{f}[\mathbf{x}^{*}]}[\mu[\mathbf{x}^{*}],\sigma[\mathbf{x}^{*}]] d\mbox{f}[\mathbf{x}^{*}] \tag{10}

\end{equation}

**Expected improvement:** The main disadvantage of the probability of improvement function is that it does not take into account how much the improvement will be; we do not want to favor small improvements (even if they are very likely) over larger ones. Expected improvement (figure 4c) takes this into account. It computes the expectation of the improvement $f[\mathbf{x}^{*}]- f[\hat{\mathbf{x}}]$ over the part of the normal distribution that is above the current maximum to give:

\begin{equation}

\mbox{EI}[\mathbf{x}^{*}] = \int_{\mbox{f}[\hat{\mathbf{x}}]}^{\infty} (f[\mathbf{x}^{*}]- f[\hat{\mathbf{x}}])\mbox{Norm}_{\mbox{f}[\mathbf{x}^{*}]}[\mu[\mathbf{x}^{*}],\sigma[\mathbf{x}^{*}]] d\mbox{f}[\mathbf{x}^{*}]. \tag{11}

\end{equation}

There also exist methods to allow us to trade-off exploitation and exploration for probability of improvement and expected improvement (see Brochu *et* al., 2010).

**Thompson sampling:** When we introduced Gaussian processes, we only talked about how to compute the probability distribution for a single new point $\mathbf{x}^{*}$. However, it's also possible to draw a sample from the joint distribution of many new points that could collectively represent the entire function. Thompson sampling (figure 4d) exploits this by drawing such a sample from the posterior distribution over possible functions and then chooses the next point $\mathbf{x}$ according to the position of the maximum of this sampled function. To draw the sample, we append an equally spaced set of points to the observed ones as in equation 6, use the conditional formula to find a Gaussian distribution over these points as in equation 8, and then draw a sample from this Gaussian.

Figure 5 shows a complete worked example of Bayesian optimization in one dimension using the upper confidence bound. As we sample more points, the function becomes steadily more certain. The method explores the function but also focuses on promising areas, exploiting what it has already learned.

In the previous section, we summarized the main ideas of Bayesian optimization with Gaussian processes. In this section, we'll dig a bit deeper into some of the practical aspects. We consider how to deal with noisy observations, how to choose a kernel, how to learn the parameters of that kernel, how to exploit parallel sampling of the function, and finally we'll discuss some limitations of the approach.

Until this point, we have assumed that the function that we are estimating is noise-free and always returns the same value $\mbox{f}[\mathbf{x}]$ for a given input $\mathbf{x}$. To incorporate a stochastic output with variance $\sigma_{n}^{2}$, we add an extra noise term to the expression for the Gaussian process covariance:

\begin{eqnarray}

\mathbb{E}[(y[\mathbf{x}]-\mbox{m}[\mathbf{x}])(y[\mathbf{x}]-\mbox{m}[\mathbf{x}'])] &=& k[\mathbf{x}, \mathbf{x}'] + \sigma^{2}_{n}. \tag{12}

\end{eqnarray}

We no longer observe the function values $\mbox{f}[\mathbf{x}]$ directly, but observe noisy corruptions $y[\mathbf{x}] = \mbox{f}[\mathbf{x}]+\epsilon$ of them. The joint distribution of previously observed noisy function values $\mathbf{y}$ and a new unobserved point $f^{*}$ becomes:

\begin{equation}

Pr\left(\begin{bmatrix}

\mathbf{y}\\f^{*}\end{bmatrix}\right) = \mbox{Norm}\left[\mathbf{0}, \begin{bmatrix}\mathbf{K}[\mathbf{X},\mathbf{X}]+\sigma^{2}_{n}\mathbf{I} & \mathbf{K}[\mathbf{X},\mathbf{x}^{*}]\\ \mathbf{K}[\mathbf{x}^{*},\mathbf{X}]& \mathbf{K}[\mathbf{x}^{*},\mathbf{x}^{*}]\end{bmatrix}\right], \tag{13}

\end{equation}

and the conditional probability of a new point becomes:

\begin{eqnarray}\label{eq:noisy_gp_posterior}

Pr(f^{*}|\mathbf{y}) &=& \mbox{Norm}[\mu[\mathbf{x}^{*}],\sigma^{2}[\mathbf{x}^{*}]], \tag{14}

\end{eqnarray}

where

\begin{eqnarray}

\mu[\mathbf{x}^{*}]&=& \mathbf{K}[\mathbf{x}^{*},\mathbf{X}](\mathbf{K}[\mathbf{X},\mathbf{X}]+\sigma^{2}_{n}\mathbf{I})^{-1}\mathbf{f}\nonumber \\

\sigma^{2}[\mathbf{x}^{*}] &=& \mathbf{K}[\mathbf{x}^{*},\mathbf{x}^{*}]\!-\!\mathbf{K}[\mathbf{x}^{*}, \mathbf{X}](\mathbf{K}[\mathbf{X},\mathbf{X}]+\sigma^{2}_{n}\mathbf{I})^{-1}\mathbf{K}[\mathbf{X},\mathbf{x}^{*}]. \tag{15}

\end{eqnarray}

Incorporating noise means that there is uncertainty about the function even where we have already sampled points (figure 6), and so sampling twice at the same position or at very similar positions could be sensible.

When we build the model of the function and its uncertainty, we are assuming that the function is smooth. If this was not the case, then we could say nothing at all about the function between the sampled points. The details of this smoothness assumption are embodied in the choice of kernel covariance function.

We can visualize the covariance function by drawing samples from the Gaussian process prior. In one dimension, we do this by defining an evenly spaced set of points $\mathbf{X}=\begin{bmatrix}\mathbf{x}_{1},& \mathbf{x}_{2},&\cdots,& \mathbf{x}_{I}\end{bmatrix}$, drawing a sample from $\mbox{Norm}[\mathbf{0}, \mathbf{K}[\mathbf{X},\mathbf{X}]]$ and then plotting the results. In this section, we'll consider several different choices of covariance function, and use this method to visualize each.

**Squared Exponential Kernel:** In our example above, we used the squared exponential kernel, but more properly we should have included the amplitude $\alpha$ which controls the overall amount of variability and the length scale $\lambda$ which controls the amount of smoothness:

\begin{equation}\label{eq:bo_squared_exp}

\mbox{k}[\mathbf{x},\mathbf{x}'] = \alpha^{2}\cdot \mbox{exp}\left[-\frac{d^{2}}{2\lambda}\right],\nonumber

\end{equation}

where $d$ is the Euclidean distance between the points:

\begin{equation}

d = \sqrt {\left(\mathbf{x}-\mathbf{x}'\right)^{T}\left(\mathbf{x}-\mathbf{x}'\right)}. \tag{16}

\end{equation}

When the amplitude $\alpha^{2}$ is small, the function does not vary too much in the vertical direction. When it is larger, there is more variation. When the length scale $\lambda$ is small, the function is assumed to be less smooth and we quickly become uncertain about the state of the function as we move away from known positions. When it is large, the function is assumed to be more smooth and we are increasingly confident about what happens away from these observations (figure 7). Samples from the squared exponential kernel are visualized in figure 8a-c.

**Matérn kernel:** The squared exponential function assumes that the function is infinitely differentiable. The Matérn kernel (figure 8d-l) relaxes this constraint by assuming a certain degree of smoothness $\nu$. The Matérn kernel with $\nu=0.5$ is once differentiable and is defined as

\begin{equation}

\mbox{k}[\mathbf{x},\mathbf{x}'] = \alpha^{2}\cdot \exp\left[-\frac{d}{\lambda^{2}}\right], \tag{17}

\end{equation}

where once again, $d$ is the Euclidean distance between $\mathbf{x}$ and $\mathbf{x}'$, $\alpha$ is the amplitude, and $\lambda$ is the length scale. The Matérn kernel with $\nu=1.5$ is twice differentiable and is defined as:

\begin{equation}

\mbox{k}[\mathbf{x},\mathbf{x}'] = \alpha^{2} \left(1+\frac{\sqrt{3}d}{\lambda}\right)\exp\left[-\frac{\sqrt{3}d}{\lambda}\right]. \tag{18}

\end{equation}

The Matérn kernel with $\nu=2.5$ is three times differentiable and is defined as:

\begin{equation}

\mbox{k}[\mathbf{x},\mathbf{x}'] = \alpha^{2} \left(1+\frac{\sqrt{5}d}{\lambda} + \frac{5d^{2}}{3\lambda^{2}}\right)\exp\left[-\frac{\sqrt{5}d}{\lambda}\right]. \tag{19}

\end{equation}

The Matérn kernel with $\nu=\infty$ is infinitely differentiable and is identical to the squared exponential kernel (equation 16).

**Periodic Kernel:** If we believe that the underlying function is oscillatory, we use the periodic function:

\begin{equation}

\mbox{k}[\mathbf{x},\mathbf{x}^\prime] = \alpha^{2} \cdot \exp \left[ \frac{-2(\sin[\pi d/\tau])^{2}}{\lambda^2} \right], \tag{20}

\end{equation}

where $\tau$ is the period of the oscillation and the other parameters have the same meanings as before.

A common application for Bayesian optimization is to search for the best hyperparameters of a machine learning model. However, in an ironic twist, the kernel functions used in Bayesian optimization themselves contain unknown hyper-hyperparameters like the amplitude $\alpha$, length scale $\lambda$ and noise $\sigma^{2}_{n}$. There are several possible approaches to choosing these hyperparameters:

**1. Maximum likelihood:** similar to training ML models, we can choose these parameters by maximizing the marginal likelihood (i.e., the likelihood of the data after marginalizing over the possible values of the function):

\begin{eqnarray}\label{eq:bo_learning}

Pr(\mathbf{y}|\mathbf{x},\boldsymbol\theta)&=&\int Pr(\mathbf{y}|\mathbf{f},\mathbf{x},\boldsymbol\theta)d\mathbf{f}\nonumber\\

&=& \mbox{Norm}_{y}[\mathbf{0}, \mathbf{K}[\mathbf{X},\mathbf{X}]+\sigma^{2}_{n}\mathbf{I}], \tag{21}

\end{eqnarray}

where $\boldsymbol\theta$ contains the unknown parameters in the kernel function and the measurement noise $\sigma^{2}_{n}$.

In Bayesian optimization, we are collecting the observations sequentially, and where we collect them will depend on the kernel parameters, and we would have to interleave the processes of acquiring new points and optimizing the kernel parameters.

**2. Full Bayesian approach:** here we would choose a prior distribution $Pr(\boldsymbol\theta)$ on the kernel parameters of the Gaussian process and combine this with the likelihood in equation 21 to compute the posterior. We then weight the acquisition functions according to this posterior:

\begin{equation}\label{eq:snoek_post}

\hat{a}[\mathbf{x}^{*}]\propto \int a[\mathbf{x}^{*}|\boldsymbol\theta]Pr(\mathbf{y}|\mathbf{x},\boldsymbol\theta)Pr(\boldsymbol\theta). \tag{22}

\end{equation}

In practice this would usually be done using an Monte Carlo approach in which the posterior is represented by a set of samples (see Snoek *et* al., 2012) and we sum together multiple acquisition functions derived from these kernel parameter samples (figure 9).

For practical applications like hyperparameter search, we would want to make multiple function evaluations in parallel. In this case, we must consider how to prevent the algorithm from starting a new function evaluation in a place that is already being explored by a parallel thread.

One solution is to use a stochastic acquisition function. For example, Thompson sampling draws from the posterior distribution over the function and samples where this sample is maximal (figure 4d). When we sample several times, we will get different draws from the posterior and hence different values of $\mathbf{x}$.

A more sophisticated approach is to treat the problem in a fully Bayesian way (Snoek et al., 2012). The optimization algorithm keeps track of both the points that have been evaluated and the points that are pending, marginalizing over the uncertainty in the pending points. This can be done using a sampling approach similar to the method in figure 9 for incorporating different length scales. We draw samples from the Gaussians representing the possible pending results and build an acquisition function for each. We then average together these acquisition functions weighted by the probability of observing those results.

In practice, Bayesian optimization with Gaussian Processes works best if we start with a number of points from the function that have already been evaluated. A rule of thumb might be to use random sampling for $\sqrt{d}$ iterations where $d$ is the number of dimensions and then start the Bayesian optimization process. A second useful trick is to occasionally incorporate a random sample into the scheme. This can stop the Bayesian optimization process getting waylaid examining unproductive regions of the space and forces a certain degree of exploration. A typical approach might be to use a random sample every 10 iterations.

The main limitation of Bayesian optimization with GPs is efficiency. As the dimensionality increases, more points need to be evaluated. Unfortunately, the cost of exact inference in the Gaussian process scales as $\mathcal{O}[n^3]$ where $n$ is the number of data points. There has been some work to reduce this cost through different approximations such as:

**Inducing points:**This approach tries to summarize the large number of observed points into a smaller subset known as inducing points (Snelson*et*al., 2006).**Decomposing the kernel:**This approach decomposes the "big" kernel in high dimension into "small" kernels that act on small dimensions (Duvenaud*et*al., 2011).**Using random projections:**This approach relies on random embedding to solve the optimization problem in a lower dimension (Wang*et*al., 2013).

So far, we have considered optimizing continuous variables. What does Bayesian optimization look like in the discrete case? Perhaps we wish to choose which of $K$ discrete conditions (parameter values) yields the best output. In the absence of noise, this problem is trivial; we simply try all $K$ conditions in turn and choose the one that returns the maximum. However, when there is noise on the output, we can use Bayesian optimization to find the best condition efficiently.

The basic approach is model each condition independently. For continuous observations, we could model each output $f_{k}$ with a normal distribution, choose a prior over the mean of the normal and then use the measurements to compute a posterior over this mean. We'll leave developing this model as an exercise for the reader. Instead and for a bit of variety, we'll move to a different setting where the observations are binary we wish to find the configuration that produces the highest proportion of '1's in the output. This setting motivates the Beta-Bernoulli bandit model.

Consider the problem of choosing which of $K$ graphics to present to the user for a web-advert. We assume that for the $k^{th}$ graphic, there is a fixed probability $f_{k}$ that the person will click, but these parameters are unknown. We would like to efficiently choose the graphic that prompts the most clicks.

To solve this problem, we treat the parameters $f_{1}\ldots f_{K}$ as uncertain and place an uninformative Beta distribution prior with $\alpha,\beta=1$ over their values:

\begin{equation}

Pr(f_{k}) = \mbox{Beta}_{f_{k}}\left[1.0, 1.0\right]. \tag{23}

\end{equation}

The likelihood of showing the $k^{th}$ graphic $n_{k}$ times and receiving $c_{k}$ clicks is then

\begin{equation}

Pr(c_{k}|f, n_{k}) = f_{k}^{c_{k}}(1-f_{k})^{n_{k}-c_{k}}, \tag{24}

\end{equation}

and we can combine these two equations via Bayes rule to compute the posterior distribution over the parameter $f_{k}$ (see chapter 4 of Prince, 2012) which will be given by

\begin{equation}

Pr(f_{k}|c_{k},n_{k}) = \mbox{Beta}_{f_{k}}\left[1.0 + c_{k}, 1.0 + n_{k}-c_{k} \right]. \tag{25}

\end{equation}

Now we must choose which value of $k$ to try next given the $k$ posterior distributions over the probabilities $f_{k}$ of getting a click (figure 10). As before, we choose an acquisition function and sample the value of $k$ that maximizes this. In this case, the most practical approach is to use Thompson sampling. We sample from each posterior distribution separately (they are independent) and choose $k$ based on the highest sampled value.

As in the continuous case, this method will trade off exploiting existing knowledge by showing graphics that it knows will generate a high rate of clicks and exploring graphics where the click rate is very uncertain. This model and algorithm are part of a more general literature on bandit algorithms. More information can be found in this book.

When we have many discrete variables (e.g., the orientation, color, font size in an advert graphic), we could treat each combination of variables as one value of $k$ and use the above approach in which each condition is treated independently. However, the number of combinations may be very large and so this is not necessarily practical.

If the discrete variables have a natural order (e.g., font size) then one approach is to treat them as continuous. We amalgamate them into an observation vector $\mathbf{x}$ and use a Gaussian process model. The only complication is that we now only compute the acquisition function at the discrete values that are valid.

If the discrete variables have no natural order then we are in trouble. Gaussian processes depend on the kernel 'distance' between points and it is hard to define such kernels for discrete variables. One approach is to use a one-hot encoding, apply a kernel for each dimension and let the overall kernel be defined by the product of these sub-kernels (Duvenaud *et* al., 2014). However, this is not ideal because there is no way for the model to know about the invalid input values which will be assigned some probability and may be selected as new points to evaluate. One way to move forward is to consider a different underling probabilistic model.

The approaches up to this point can deal with most of the problems that we outline in the introduction, but are not suited to the case where there are many discrete variables (possibly in combination with continuous variables). Moreover, they cannot elegantly handle the case of conditional variables where the existence of some variables is contingent on the settings of others. In this section we consider random forest models and tree-Parzen estimators, both of which can handle these situations.

The *Sequential Model-based Algorithm Configuration* (SMAC) algorithm uses a random forest as an alternative to Gaussian processes. Consider the case where we have made some observations and trained a regression forest. For any point, we can measure the mean of the trees' predictions and their variance (figure 11). This mean and variance are then treated similarly to the equivalent outputs from a Gaussian process model. We apply an acquisition function to choose which of a set of candidate points to sample next. In practice, the forest must be updated as we go along and a simple way to do that is just to split a leaf when it accumulates a certain number of training examples.

Random forests based on binary splits can easily cope with combinations of discrete and continuous variables; it is just as easy to split the data by thresholding a continuous value as it is to split it by dividing a discrete variable into two non-overlapping sets. Moreover, the tree structure makes it easy to accommodate conditional parameters: we do not consider splitting on contingent variables until they are guaranteed by prior choices to exist.

The Tree-Parzen estimator (Bergstra *et* al., 2011) works quite differently from the models that we have considered so far. It describes the likelihood $Pr(\mathbf{x}|y)$ of the data $\mathbf{x}$ given the noisy function value $y$ rather than the posterior $Pr(y|\mathbf{x})$.

More specifically, the goal is to build two separate models $Pr(\mathbf{x}|y\in\mathcal{L})$ and $Pr(\mathbf{x}|y\in\mathcal{H})$ where the set $\mathcal{L}$ contains the lowest values of $y$ seen so far and the set $\mathcal{H}$ contains the highest. These sets are created by partitioning the values according to whether they fall below or above some fixed quantile.

The likelihoods $Pr(\mathbf{x}|y\in\mathcal{L})$ and $Pr(\mathbf{x}|y\in\mathcal{H})$ are modelled with kernel density estimators; for example, we might describe the likelihood as a sum of Gaussians with a mean on each observed data point $\mathbf{x}$ and fixed variance (figure 12). It can be shown that expected improvement is then maximized by choosing the point that maximizes the ratio $Pr(\mathbf{x}|y\in\mathcal{H})/Pr(\mathbf{x}|y\in\mathcal{L})$.

Tree-Parzen estimators work when we have a mixture of discrete and continuous spaces, and when some parameters are contingent on others. Moreover, the computation scales linearly with the number of data points as opposed to with their cube as for Gaussian processes.

In this tutorial, we have discussed Bayesian optimization, its key components, and applications. For further information, consult the recent surveys by Shahriari *et* al. (2016) and Frazier 2018. Python packages for Bayesian optimization include BoTorch, Spearmint, GPFlow, and GPyOpt.

Code for hyperparameter optimization can be found in the Hyperopt and HPBandSter packages. A popular application of Bayesian optimization is for AutoML which broadens the scope of hyperparameter optimization to also compare different model types as well as choosing their parameters and hyperparameters. Python packages for AutoML include Auto-sklearn, Hyperop-sklearn, and NNI.

]]>The ten Fellows won the awards for their outstanding research capabilities and represent leading universities and AI institutes from provinces across Canada.

These fellowships are part of Borealis AI’s commitment to support Canadian academic excellence in AI and machine learning. They provide financial assistance for exceptional domestic and international graduate students to carry out fundamental research as they pursue their masters and PhDs in various fields of AI. The program is one of a number of Borealis AI initiatives designed to strengthen the partnership between academia and industry and advance the momentum of Canada’s leadership in the AI space.

This year’s winners demonstrated exceptional talent, vision and passion for high-quality research. Backed by some of Canada’s leading AI professors, the projects range from using AI in metabolomics, quantum physics and in areas such as natural language processing, deep learning, uncertainty, and computer vision.

Speaking about the program, Prof. Geoffrey Hinton, Chief Scientific Advisor at Vector Institute, said:

“Deep learning is poised to change the way we work and live and I am proud of the talent and caliber that our Universities have to offer. Canada is a top destination for research in machine learning globally and the Borealis AI Fellowships demonstrate the continuous support of the industry in that regard. Supporting students with the means to conduct their research is very important for our community.”

Foteini Agrafioti, Head of Borealis AI, said:

“

AI was pioneered in Canada, and our universities have trained some of the most prolific experts in the world. There is a huge demand for AI expertise, and that is why we are committed to nurturing talent in this highly critical field in Canada. I’m impressed by the caliber of this year’s winners and am excited to provide them with additional resources to advance their research and kick start their promising careers.”

Ibtihel Amara, a PhD student at McGill University at the Centre for Intelligent Machines (CIM), was awarded a fellowship this year for her work on addressing uncertainty and integrating reliability and trust into modern AI systems. Speaking about her award, Ibtihel said:

“The Borealis AI fellowship is important to me because it means I can focus completely on my research work. This award has also encouraged me to believe in the significance of my research, especially to be chosen among the best candidates in the AI research community. It compels me to achieve more and dream big!”

Click here to meet the class of 2020.

]]>

**School: **Perimeter Institute for Theoretical Physics, University of Waterloo.

**Research areas: **Theoretical physics.

Research topic: Machine Learning for physics and physics for machine learning.

**School: **AMII, University of Alberta.

**Research areas: **Machine learning, deep learning and natural language processing.

Research topic: Towards empathetic conversational AI.

**School: **Concordia Institute for Information System Engineering (CIISE), Concordia University, Montreal.

**Research areas: **Deep learning in health domain applications.

Research topic: Designing optimal deep neural networks for hand gesture recognition and force prediction, and developing domain adaptation algorithms for time-domain features.

**School: **University of Toronto.

**Research areas: **Machine learning.

Research topic: Interplay between optimization, generalization and uncertainty of deep learning.

**School: **Centre for Intelligent Machines (CIM), McGill University.

**Research areas: **Computer vision and artificial intelligence.

Research topic: Towards building reliable deep neural network.

**School: **University of British Columbia.

**Research areas: **Deep learning, chemistry, natural language processing, artificial intelligence, generative models, mass spectrometry.

Research topic: Automated discovery of unknown molecules using deep neural networks.

**School: **Mila, Université de Montréal.

**Research areas: **Optimization and deep learning.

Research topic: A dynamical systems perspective into game optimization.

**School: **Mila, Université de Montréal.

**Research areas: **Natural language processing and deep learning.

Research topic: Learning and modeling neural representations of text.

**School: **Artificial Intelligence and Algorithms laboratories at the University of British Columbia.

**Research areas: **Stochastic processes - neural networks - DNA computing.

Research topic: Mean first passage time and parameter estimation for continous-time Markov Chains.

**School: **McMaster University.

**Research areas: **Deep learning, watermarking, steganography, information-theoretical principles.

Research topic: New deep neural network architectures for blind image watermarking based on the information-theoretic principles.

Our study shows that maximizing margins can be achieved by minimizing the adversarial loss on the decision boundary at the "shortest successful perturbation", demonstrating a close connection between adversarial losses and the margins. We propose Max-Margin Adversarial (MMA) training to directly maximize the margins to achieve adversarial robustness.

Instead of adversarial training with a fixed $\epsilon$, MMA offers an improvement by enabling adaptive selection of the "correct" $\epsilon$ as the margin individually for each datapoint. In addition, we rigorously analyze adversarial training with the perspective of margin maximization, and provide an alternative interpretation for adversarial training, maximizing either a lower or an upper bound of the margins. Our experiments empirically confirm our theory and demonstrate MMA training's efficacy on the MNIST and CIFAR10 datasets w.r.t. $\ell_\infty$ and $\ell_2$ robustness.

]]>While sharing data between institutions and communities can help boost AI innovation, this practice runs the risk of exposing sensitive and private information about involved parties. Private and secure sharing of data is imperative and necessary for the AI field to succeed at scale.

Traditionally, a common practice has been to simply delete PII (personal identifiable information)—such as name, social insurance number, home address, or birth date—from data before it is shared. However, “Scrubbing,” as it is often called, is no longer a reliable way to protect privacy, because widespread proliferation of user-generated data has made it possible to reconstruct PII from scrubbed data. For example, in 2009 a Netflix user sued the company because her “anonymous” viewing history could be de-anonymized using other publicly-available data sources, inferring her sexual orientation.

Differentially private synthetic data generation (differential privacy) presents an interesting solution to this problem. In a nutshell, this technology adds “noise” to sensitive data while preserving the statistical patterns from which machine learning algorithms learn, allowing data to be shared safely and innovation to proceed rapidly.

Differential privacy preserves the statistical properties of a data set—the patterns and trends that algorithms care about to drive insights or automate processes—while obfuscating the underlying data themselves. The key idea behind data generation is to mask PII by adding statistical noise. The noisy synthetic data can be shared without compromising users’ privacy but still yields the same aggregate insights as the raw, noiseless data. Compare it to a doctor sharing trends and statistics about a patient base without ever revealing individual patients’ specific details.

Many major technology companies are already using differential privacy. For example, Google has applied differential privacy in the form of RAPPOR, a novel privacy technology that allows inferring statistics about populations while preserving the privacy of individual users. Apple also applies statistical noise to mask users’ individual data.

Differential privacy is not a free lunch, however: adding noise makes ML algorithms less accurate, especially with smaller datasets. This allows groups to join forces and safely leverage the combined size of their data to gain new insights. Consider a network of hospitals studying diabetes and needing to use their patient records to construct early diagnostics techniques from their collective intelligence. Each hospital could analyze its own patient records independently however, modern AI systems thrive on massive amounts of data, a reality that can only be practically achieved through large scale merging of patient records. Differential privacy presents a way of achieving that through the sharing of synthetic data and creation of a single, massive—but still privacy-preserving—dataset for scientists.

While differential privacy is not a universal solution, it bridges the gap between the need for individual privacy and the need for statistical insights – opening the doors to new possibilities.

]]>- If we use greedy search or beam search to select the output tokens, then the model's output tends to stay in high probability regions and so the outputs are less diverse than realistic human speech.
- If we sample randomly from the predicted distribution over the output tokens then we get more diverse outputs, but can enter a degenerate state where phrases are repeated if a low probability token is chosen.

It is hypothesized that both of these phenomena are side-effects of using a training criterion that is based on predicting one word of the output sequence at a time and comparing to the ground truth.

In the second part of this tutorial (figure 2), we consider alternative training approaches that compare the complete generated sequence to the ground truth at the sequence level. We'll consider two families of methods; in the first, we take models that have been trained using the maximum likelihood criterion and fine-tune them with a sequence-level cost function -- we'll consider using both reinforcement learning and minimum risk training for this fine tuning. In the second family, we consider structured prediction approaches which aim to train the model from scratch using a sequence level cost function.

Type of method | Training | Inference |
---|---|---|

Decoding algorithms | Maximum likelihood Maximum likelihood Maximum likelihood Maximum likelihood Maximum likelihood Maximum likelihood |
greedy search beam search diverse beam search iterative beam search top-k sampling nucleus sampling |

Sequence-level fine-tuning | Fine-tune with reinforcement learning Fine-tune with minimum risk training Scheduled sampling |
greedy search/beam search beam search |

Sequence-level training | Beam search optimization SeaRNN Reward augmented max likelihood |
greedy search / beam search beam search greedy search / beam search beam search |

Metrics that could provide a sequence-level comparison between the generated sentence and the ground truth include the BLEU (Papineni et al, 2002), ROUGE (Lin 2004), and METEOR (Bannerjee *et al.* 2005) scores. The best known of these is the BLEU score which compares the overlap of n-grams between the gold sequence and a generated sequence. This metric is not perfect because valid generated sentences that express the same ideas as the gold ones but with different words will be rated poorly. Despite its imperfections, the BLEU score has often been used to fine tune language generation models.

Unfortunately, the use of a sequence-level metric introduces a new problem. Unlike the maximum-likelihood criterion, the BLEU score is not easily differentiable. For a given output sentence it can be evaluated but there is no sense of how to smoothly change the sentence to improve it. The solution to this problem is to use reinforcement learning (RL) which, through its reward formalism, allows us to train a model to maximize a non-differentiable quantity. We will train the RL agent to output entire sequences (rather than just the next token) and learn from the feedback given from the BLEU reward function.

Let's briefly recap the reinforcement learning framework. In RL an agent performs actions in an environment and observes the consequences of these actions through (i) changes in the environment and (ii) a numerical reward which indicates whether the agent is achieving its task or not. More formally, an agent can choose actions from a set $\mathcal{A}=\{a_0,..,a_n\}$. At each time $t$ the agent makes an observation $\mathbf{o}_{t}$ and aggregates it with past observations into a state $\mathbf{s}_t$. The agent's behaviour is called a *policy* $\boldsymbol\pi[a|\mathbf{s}_{t}]$ and describes the probability of taking action $a$ in state $\mathbf{s}_{t}$. The agent's goal is to learn a policy which maximizes the expected sum of rewards at each state.

Now let's map the RL framework to neural natural language generation. An action consists of selecting the next token, so there as many possible actions $a$ as there are words $y\in|\mathcal{V}|$ in the vocabulary. The policy $\boldsymbol\pi$ is a distribution over possible actions. Hence, we consider the policy to be the likelihood $Pr(y_{t}|\hat{\mathbf{y}}_{<t},\mathbf{x},\boldsymbol\phi)$ of producing different tokens $y_{t}$ given the previous tokens $\hat{\mathbf{y}}_{<t}$, the input sentence $\mathbf{x}$ and the model parameters $\boldsymbol\phi$. The reward is a computed score like BLEU which is only provided after generating the entire translated sentence; intermediate rewards are zero.^{1} The state $\mathbf{s}_{t}$ corresponds to the state of the decoder. Note that it's not clear what constitutes the environment in this description; we will return to this question later.

The REINFORCE algorithm (Williams 1992) is a policy gradient method that describes the policy $\boldsymbol\pi[a|s,\boldsymbol\phi]$ as a function of parameters $\boldsymbol\phi$ and then manipulates these parameters to improve the rewards received. Translating to the NNLG case, this means that we will manipulate the parameters $\boldsymbol\phi$ of the model (technically we backpropagate through the encoder as well) so that it produces sensible distributions over the words at each step, and the resulting BLEU scores are good.

More specifically, the REINFORCE algorithm maximizes the expected discounted sum of rewards when the actions $a$ are drawn from the policy $\boldsymbol\pi$:

\begin{equation}

J[\boldsymbol\phi] = \mathbb{E}_{a\sim \boldsymbol\pi\left[\boldsymbol\phi\right]}\left[\sum_{t=1}^L\gamma^{t-1}r[\mathbf{s}_t,\hat{y}_{t}]\right], \tag{1}

\end{equation}

where $L$ is the length of a decoding episode, and $r[s_t,\hat{y}_{t}]$ is the reward obtained for generating word $\hat{y}_{t}$ when the decoder has hidden state $\mathbf{s}^t$. The term $\gamma\in(0,1]$ is the discount factor which weights rewards less the further they are into the future. For NNLG, we commonly only receive a reward at the end of the decoding when we compare to the BLEU score, and the intermediate rewards are all zero.

We need the derivative of this expression so we can change the parameters to increase this measure and this derivative is found using the *policy gradient theorem* (Sutton *et al.* 2000). The REINFORCE algorithm uses the resulting expression to estimate the derivative from a set of $I$ decoding results sampled with the current policy $\boldsymbol\pi[\boldsymbol\phi]$. The $i^{th}$ decoding result consists of series of tuples $\{\mathbf{s}_{i,t}, \hat{y}_{i,t}, r_{i,t}\}_{t=1}^{L_{i}}$ where each tuple consists of the decoder state $\mathbf{s}_{i,t}$ , the predicted tokens $\hat{y}_{i,t}$, and the rewards $r_{i,t}$ respectively. The estimated gradient is:

\begin{equation}

\nabla J[\theta] \approx \frac{1}{I}\sum_{i=1}^{I}\sum_{t=1}^{L_{i}} \nabla_{\theta} \log \left[\boldsymbol\pi\left[\hat{y}_{i,t}|\mathbf{s}_{i,t},\boldsymbol\phi\right]\right] \left(Q^{\boldsymbol\pi}[\mathbf{s}_{i,t}, \hat{y}_{i,t}]- b[\mathbf{s}_{i,t}]\right) \tag{2}

\end{equation}

where the state-action value function $Q^{\boldsymbol\pi}[\mathbf{s}_{i,t}, \hat{y}_{i,t}]$ is the expected sum of rewards $\mathbb{E}_{\sim \boldsymbol\pi}[\sum_{t'\geq t}^{T}\gamma^{t'-t} r[\mathbf{s}_{i,t'},\hat{y}_{i,t'}]|\mathbf{s}_{i,t},\hat{y}_{i,t}]$ after generating the token $\hat{y}_{i,t}$ given the state $\mathbf{s}_{i,t}$, and $b[\bullet]$ is a *baseline* function which helps reduce variance.

It is common that the state-action value function $Q^{\boldsymbol\pi}(s_{i,t}, \hat{y}_{i,t})$ is also estimated through Monte-Carlo roll-outs, which leads to:

\begin{equation}

\nabla J[\theta] \approx \frac{1}{I}\sum_{i=1}^{I}\sum_{t=1}^{L_{i}} \nabla_{\theta} \log \left[\boldsymbol\pi\left[\hat{y}_{i,t}|\mathbf{s}_{i,t},\boldsymbol\phi\right]\right] \left(\sum_{t'=t}^T \gamma^{t'-t} r[\mathbf{s}_{i,t'}, \hat{y}_{i,t'}] - b[\mathbf{s}_{i,t'}]\right). \tag{3}

\label{eq:reinforce}

\end{equation}

In practice, any RL fine-tuning approach must choose a decoding method. Wu *et al.* (2018) compare the performance of beam search and greedy search and find that RL fine-tuning seems to bring the largest improvement for greedy search. They explain this result in terms of *exploration-exploitation trade-off*. At each time step, the agent could choose to apply only the actions that it thinks are most valuable (exploitation), or it could choose to take different actions to increase its confidence in what it believes to be the best actions (exploration). Beam search exploits the current policy more efficiently than greedy search, and to the extent that greedy search is sub-optimal, it also permits more exploration of the sequence space.

RL fine-tuning has been shown to improve the performance of NNLG models for many tasks (Ranzato *et al.* 2017; Bahdanau *et al.* 2017; Strub *et al.* 2017; Das *et al.* 2017; Wu *et al.* 2018). However, the improvement over maximum likelihood training has been small (Wu *et al.* 2018) and although it is theoretically simple to move to an RL objective, in practice, it is not uncommon to have to resort to "tricks" to make it work (Bahdanau *et al.* 2017).

If RL is used as a solution to "fix" problems caused by log-likelihood training, then why not train a model with RL from scratch? Such training requires that the model outputs sequences, observes rewards, and updates its behaviour. However, it is highly unlikely that a randomly-initialized model would generate a sequence of tokens that would be relevant to a sentence to translate. Consequently, the model would never observe any significant reward and would be unable to learn to improve.

A different approach might be to use *off-policy training* which observes trajectories from another model and uses these to learn a policy; it differs from on-policy training in that it learns without ever actually generating tokens itself. For machine translation, the "other model" is the human who generated the gold translations. The NMT model observes these translations as well as the associated rewards, and updates its policy to maximize the expected sum of rewards at each state. To account for the differences between the learned policy and the gold policy, we use importance sampling and Equation 3 becomes:

\begin{eqnarray} \label{eq:offpol}

\nabla J[\theta] &\approx& \\

&&\hspace{-1.3cm}\frac{1}{I}\sum_{i=1}^{I}\sum_{t=1}^{L_{i}} \frac{\boldsymbol\pi\left[y_{i,t}|\mathbf{s}_{i,t},\boldsymbol\phi\right]}{q(y_{i,t}|\mathbf{s}_{i,t})}\nabla_{\boldsymbol\phi} \log \left[\boldsymbol\pi\left[y_{i,t}|\mathbf{s}_{i,t},\boldsymbol\phi\right]\right] \left(\sum_{t'=t}^T \gamma^{t'-t} r[\mathbf{s}_{i,t'}, y_{i,t'}] - b[\mathbf{s}_{i,t'}]\right). \nonumber \tag{4}

\end{eqnarray}

where $q(y_{i,t}|\mathbf{s}_{i,t})$ is the probability that a human would generate token $y_{i,t}$ given state $\mathbf{s}_{i,t}$.

This seems convenient, but there is a problem here too. Since the human-generated translation is the gold standard, the reward is always going to be the maximum BLEU score of 1. Consequently, the sum of rewards in Equation 4 is always 1 and the training objective reduces to negative log-likelihood. In fact, this method really only differs from ML training when trajectories are of variable quality and have different rewards, or when rewards for partial sequences are available (Zhou *et al.* 2017; Kandasamy *et al. *2017), and even then we have the additional complication of having to estimate $q[\bullet].$ Overall, training with RL from scratch is not adapted to the usual case where only good generations are available for learning.

Let's return to the question of what exactly constitutes the 'environment' when we frame natural language generation as RL. One of the difficulties of RL training is that the agent must interact with the unknown environment. For each action that the agent takes, it must wait for the environment to provide a new observation to be able to decide its next action. The agent can thus only produce one trajectory at a time.

However, for natural language generation the environment is the decoder itself: it outputs a word and then directly updates its state solely based on this word, without requiring any external input. Therefore, we can run the decoding process as many times as we want since there is no external environment that conditions the states of the decoder.

In this sense, natural language generation might be better framed as *structured prediction*; we treat it as a machine learning problem with multiple outputs that are mutually dependent. For machine translation, these dependencies represent the constraints between words in the output sentence that ensure that it is syntactically correct and semantically meaningful. When we consider natural language generation in this light, we can contemplate methods in which we roll-out as many trajectories as computationally-feasible and learn through sequence-level costs like BLEU. Further discussion of the relationship between RL and structured prediction can be found in Daume (2017), and Kreuzer (2018).

*Minimum risk Training* (Shen *et al.* 2016) is a structured prediction technique that is practically very similar to the REINFORCE algorithm except that is does not only consider one generated translation of the input sentence at a time. Instead, it evaluates multiple possible translations $\{\hat{\mathbf{y}}_{j}\}$, computes a cost $\Delta[\hat{\mathbf{y}}_j, \mathbf{y}_i]$ for each, and weighs each cost by the probability $Pr(\hat{\mathbf{y}}_j|\mathbf{x}_i,\boldsymbol\phi)$ of its trajectory. The overall loss function $\mathcal{L}[\boldsymbol\phi]$ is the total cost:

\begin{equation}

\mathcal{L}[\boldsymbol\phi] = \sum_{i=1}^{I}\left(\frac{1}{Z_{i}}\sum_{\hat{\mathbf{y}}_j \in \mathcal{S}[\mathbf{x}_i]} Pr(\hat{\mathbf{y}}_j|\mathbf{x}_i,\boldsymbol\phi)^{\alpha} \Delta[\hat{\mathbf{y}}_j, \mathbf{y}_i] \right) \tag{5}

\end{equation}

where $\mathcal{S}[\mathbf{x}_i]$ is a selected subset of all possible translations of $\mathbf{x}_i$ and $\Delta[\hat{\mathbf{y}}_i, \mathbf{y}_i]$ computes the risk associated with $\hat{\mathbf{y}}_i$ in the form of a distance between the estimated translation $\hat{\mathbf{y}}_i$ and the gold translation $\mathbf{y}_i$ (e.g., the negative BLEU score). The exponent $\alpha$ controls the sharpness of the weightings of each possible translation and $Z_{i}$ is a normalizing factor that ensures that the exponentiated probabilities sum to one. Note that $\mathcal{S}[\mathbf{x}_i]$ is constructed to always contain the ground truth translation $\mathbf{y}_i$. The remaining candidate translations in this subset are generated by sampling from the model distribution.

This model is efficient, because it evaluates multiple trajectories at once and is a viable alternative to reinforcement learning for fine tuning after maximum likelihood pre-training (Shen *et al.* 2016).

Scheduled sampling tries to address the main problem with the maximum likelihood approach; the decoder is never exposed to its own outputs during the training procedure. The idea is simple: during training, the input to the decoder at time t+1 is the ground truth token $y_t$ with probability $\epsilon$ or the previously decoded token $\hat{y}_{t}$ with probability $1−\epsilon$. The probability $\epsilon$ is adjusted during training: it starts at 1 (where the model learns from ground truth) and progressively decays to 0 (where the model only learns from its own outputs). This is a fine-tuning technique which progressively blends in model predictions.

This approach takes inspiration from the DAgger method in imitation learning (Ross *et al.*, 2011) in which an agent learns from both its own actions and the actions of an oracle which acts expertly. However, unlike imitation learning, scheduled sampling does not rely on a live oracle but instead uses the dataset (Figure 3). Consequently, the corrections provided to the model might not make sense given the model's errors: it optimizes an objective that does not guarantee good behaviour if trained until convergence (Huszár, 2015). Nonetheless, this method is empirically successful and outperforms RL on a paraphrase generation task (Du & Ji, 2019).

The preceding methods trained at the token level and then fine-tuned at the sequence level. The methods in this section combine searching and learning to directly train the model at the sequence level. In the next three sections, we consider beam search optimization, SeaRNN, and reward augmented maximum likelihood respectively.

Similarly to scheduled sampling, *Beam Search Optimization* also uses ground truth data as an oracle. However, unlike scheduled sampling, it tries to maintain semantic and syntactic correctness. It (i) uses a sequence-level cost function and (ii) maintains a beam of hypotheses during the learning procedure. At each point in time, the oracle ensures that the ground truth is among these hypotheses.

Let the score of a partial sequence at time $t$ be $\mbox{s}[w,\hat{\mathbf{y}}_{\leq t}, \mathbf{x}_i,\boldsymbol\phi]$, where $\boldsymbol\phi$ are the weights of the model, $w$ is a token in the vocabulary $\mathcal{V}$, $\hat{\mathbf{y}}_{<t}$ is the sequence decoded so far, and $\mathbf{x}_i$ is the input sentence to be translated. This score is the output of the decoder *before* passing through the softmax function. During decoding, the $K$ most highly scored sequences are retained. Let $\hat{\mathbf{y}}_{<t}^{(K)}$ be the $K$-th ranked sequence at time $t$, so that there are exactly $K-1$ sequences scored more highly than this. The loss function for a single sequence is as follows:

\begin{eqnarray}

\mathcal{L}[\boldsymbol\phi] &=& \sum_{t=1}^{T-1} \Delta\left[\hat{\mathbf{y}}^{(K)}_{\leq t}\right] \left(1 - \mbox{s}\left[w,\hat{\mathbf{y}}_{\leq t}^{(K)}, \mathbf{x},\boldsymbol\phi\right] + \mbox{s}\left[w,\mathbf{y}_{\leq t}, \mathbf{x},\boldsymbol\phi\right]\right) \\ \nonumber

&&\hspace{1cm}+ \Delta\left[\hat{\mathbf{y}}^{(1)}_{\leq T}\right] \left(1 - \mbox{s}\left[w,\hat{\mathbf{y}}_{\leq T}^{(1)}, \mathbf{x},\boldsymbol\phi\right] + \mbox{s}\left[w,\mathbf{y}_{\leq T}, \mathbf{x},\boldsymbol\phi\right]\right), \tag{6}

\end{eqnarray}

where $\Delta[\bullet]$ is a cost that derives from inverse BLEU score of the partial sequence $\hat{\mathbf{y}}^{(K)}_{\leq t}.$ The first term in the loss function encourages the model to score the ground truth translation $\mathbf{y}_{\leq t}$ higher than the $K$-th ranked sequence with a margin. The second term encourages the ground truth translation to be the most highly scored sequence at the last time step of decoding $T$. The function $\Delta[\bullet]$ is defined so that it returns 0 when there is no margin violation and a positive quantity otherwise.

To train this model, we need to update the set $\mathcal{S}_{t}$ of the $K$ most highly scored sequences at each time step $t$ and make sure that the ground truth sequence is in this set. To do this, Wiseman and Rush (2016) propose the following mechanism:

\begin{equation}

\mathcal{S}_t = \mbox{topK}

\begin{cases}

\mbox{succ}[\mathbf{y}_{<t}] & \text{if violation} \\

\bigcup\limits_{k=1}^K \mbox{succ}[\hat{\mathbf{y}}^{(k)}_{<t}] & \text{otherwise},

\end{cases} \tag{7}

\end{equation}

where $\mbox{succ}[w]$ returns the set of all valid sequences that can be formed by appending a token to the sequence $w$. If the gold sequence is in the top-$K$ results then beam search continues as normal. If the gold sequence falls off the beam, this is a *violation* and the model now uses the ground truth as prefix for subsequent generation (figure 4).

In this way, beam search optimization trains the model to output sequences where the ground truth sequence is always the most highly scored one at the end of decoding and is among the top $K$ sequences during decoding. The mistake-specific scoring function $\Delta$ takes into account a sequence-level cost during training.

SeaRNN (LeBlond *et al.* 2018) searches over all possible next tokens, taking into account the subsequent completion of the sequence. It generates the first part of the sequence using the *roll-in* policy, chooses the next token to evaluate and then completes the sentence with a *roll-out* policy. The cost of this full completed sentence can then be evaluated and the best token chosen (figure 5).

In more detail, the roll-in policy is first applied to generate a sequence which is stored together with the corresponding hidden states. The algorithm then steps through this sequence. At step $t$, the roll-out policy takes the partial sequence $\hat{\mathbf{y}}_{<t}$ and a possible next token $w \in \mathcal{V}$, and completes the sentence $\hat{\mathbf{y}}_{>t}$. A cost $\mbox{c}[\{\hat{\mathbf{y}}_{<t},w,\hat{\mathbf{y}}_{>t}\},\mathbf{y}]$ is then computed based on the similarity of the current sequence $\{\hat{\mathbf{y}}_{<t},w,\hat{\mathbf{y}}_{>t}\}$, consisting of partial roll-in, word choice and roll-out, and the ground truth $\mathbf{y}$. In this way we can select the best choice of word $w^{*}$.

LeBlond {*et al.* (2018) propose to use the log-loss for training the system:

\begin{align}

\mathcal{L}[\boldsymbol\phi] &= \sum_{t=1}^T -\log \left(\frac{\exp[s\left[w^{*}_{t},\hat{\mathbf{y}}_{<t}, \mathbf{x},\boldsymbol\phi\right]]} {\sum_{i \in \mathcal{V}} \exp\left[s\left[i,\hat{\mathbf{y}}_{<t}, \mathbf{x},\boldsymbol\phi\right]\right]} \right), \tag{8}

\end{align}

where $\mbox{s}\left[w^{*}_{t},\hat{\mathbf{y}}_{<t}, \mathbf{x},\boldsymbol\phi\right]$ is the pre-softmax score for the generated sequence $\hat{\mathbf{y}}_{\leq t} = \{\hat{\mathbf{y}}_{<t}, w^*_t\}$ given the input $\mathbf{x}$.

There are three possible types of roll-in and roll-out policy: we can use the model's predictions, the ground truth tokens, or a mix of the model's predictions and the ground truth tokens. Note that if the roll-in and roll-out policies are the ground truth predictions, then this loss reduces to negative log-likelihood since for each $t$, $w^*_t$ will be the ground truth token $y^t$. LeBlond *et al.* (2018) ran experiments for all 9 different combinations and they concluded that it was best to use the model's predictions as roll-in strategy and a mix of the two policies for roll-out.

Although this approach allows us to train the model at the sequence level, the computational cost is too high for practical word-level generation: at each time step of decoding, SeaRNN has to generate sequences for every possible next token and so in practice it is only applied to a reduced vocabulary.

The preceding approaches have relied on the model's own predictions to support sequence-level optimization. Instead, *Reward Augmented Maximum Likelihood* or *RAML* (Norouzi *et al.* 2016) aims to teach the model about the space of good solutions. To this end, the algorithm augments the ground truth dataset with sequences which are known to maximize a reward $r$. RAML uses sequence-level information not to train the model to learn from its own output, but to teach the model about the space of good solutions.

The algorithm exploits the fact that the global minimum of the REINFORCE loss function with reward $r$ and with entropy regularization is achieved when the policy $\boldsymbol\pi[\boldsymbol\phi]$ matches the *exponential payoff distribution*:

\begin{equation}

q(\hat{\mathbf{y}}_i|\mathbf{y}_i,\tau) = \frac{1}{Z(\mathbf{y}_i,\tau)} \exp(r[\hat{\mathbf{y}}_i, \mathbf{y}_i]/\tau), \tag{9}

\end{equation}

where $\tau$ is a temperature parameter, Z is a normalizing term, and $r[\hat{y}_i,y_i]$ is a sequence-level reward (e.g., the BLEU score).

It follows that if we sample values $\hat{\mathbf{y}}_i$ from $q(\hat{\mathbf{y}}_i|\mathbf{y}_i, \tau)$ we will maximize the regularized expected sum of rewards. Hence, we draw samples from $q$ and train the model to maximize the likelihood of these samples. The resulting training objective is:

\begin{equation}

\mathcal{L}[\boldsymbol\phi] = \sum_{i=1}^{I} \left(-\sum_{\hat{\mathbf{y}}_i \in \mathcal{Y}} q(\hat{\mathbf{y}}_i|\mathbf{y}_i, \tau) \log[\boldsymbol\pi[\mathbf{y}_i|\mathbf{x}_i,\boldsymbol\phi]]\right), \tag{10}

\end{equation}

where $\mathcal{Y}$ is the set of all possible translations. To minimize $\mathcal{L}$, one must minimize the negative log likelihood of samples weighted according to $q(\bullet|\mathbf{y}_i, \tau)$.

RAML thus serves as a data-augmentation method which not only presents ground truth samples to the model, but also samples that maximize the chosen reward function. Note that sampling from $q(\hat{\mathbf{y}}_i|\mathbf{y}_i, \tau)$ is not straightforward. See Norouzi *et al.* (2016) for more details.

In the first part of this tutorial, we considered training with maximum-likelihood and then presented different decoding algorithms. We discussed that one of the main problems with maximum likelihood training is *exposure bias*: during training the sequential decoderonly sees ground truth tokens whereas during testing it must generate new words based on its own previous outputs.

If the model samples one token from the tail of the distribution at some point during decoding, it enters a space that it has not observed during training so it does know not how to continue the generation and it ends up outputting a sequence of low-quality.^{2} Approaches to avoid this problem fall into four categories:

- During inference, force the model to stay in high-likelihood space to avoid errors that take the model to an unknown part of the output space (top-k sampling, nucleus sampling);
- During inference, force the model to stay in high-likelihood space as it learns from its own outputs (fine-tuning with RL, MRT or scheduled sampling);
- Make the model learn to recover from its own mistakes during training (BSO, SeaRNN);
- Teach the model about the space of
*good*solutions according to a sequence-level reward (RAML).

We speculate that the most successful future approaches will probably combine learning to recover from mistakes during training and preventing the solution from straying away from the known output space during inference.

To conclude the tutorial, we draw attention to several areas that we believe would benefit from further investigation.

**Exploring the space of solutions**: During inference, it is unclear how to constrain exploration. Beam search, top-k sampling, and nucleus sampling all rely on setting a threshold. However, instead of setting an arbitrary threshold, it might be interesting to try to quantify the model's confidence. In particular, approaches inspired from the *safe reinforcement learning* literature might be useful (García *et al.* 2015).

**Model distribution**: Technically encoder-decoder NNLG models are trained to model the distribution $Pr(\bullet|\mathbf{y}_{<t})$ at each time step $t$. However, it is unclear whether they are really successful at learning this distribution and in particular, whether they do use all the history to predict the next token. Since this is a fundamental assumption of NNLG, work that explore when models learn the joint distribution and when they do not would be very enlightening.

**Encoder-decoder relations**: All the work that we have presented here has focused on what is happening in the decoder. The underlying assumption is that it is possible to decouple encoder and decoder representations when investigating decoding strategies. This assumption might not always hold and phenomena induced by the encoder representations might explain some of the behaviour observed during decoding (see El Asri & Trischler, 2019).

**Metrics**: One hindrance to progress in this field has been the lack of reliable automatic metrics (Liu *et al.* 2016) and the fact that the community does not seem to have gathered around a small common set of benchmarks. There is an important body of work on the topic of evaluation (Holtzman *et al.* 2018; Xu *et al.* 2019}) and some metrics to measure various aspects such as coherence and entailment have been proposed and might become more widely used.

**Linguistic considerations**: In this post, we have only focused on the NNLG problem from a machine learning point of view, without any linguistic considerations. We encourage the reader to look into other approaches such as Shen *et al.* (2019) (which received a best paper award at ICLR 2019!) which modifies the RNN architecture to take into account linguistic properties. Following the *#BenderRule*, we should also mention that the work we have described has only been applied to a few languages and that these approaches might not be universal.

In conclusion, on high-resource languages, the field of NNLG has known great empirical success and has made significant progress towards generating coherent text. We hope that this post has provided a useful overview for young researchers or anybody who is considering doing research in this area. This is a truly exciting area where, we believe, most of the building blocks to build coherent and robust models are already in place.

^{1}Note that this is a common setting, but it is possible to use intermediate rewards, e.g., the BLEU scores on partial sequences.

^{2}This phenomenon is known as the problem of cascading errors in the imitation learning literature (Bagnell 2015).

We will not focus on the input type; we assume that the input has been processed by a suitable *encoder* to create an embedding in a latent space. Instead, we concentrate on the *decoder* which takes this embedding and generates sequences of natural language tokens.

We will use the running example of *neural machine translation*: given a sentence in language A, we aim to generate a translation in language B. The input sentence is processed by the encoder to create a fixed-length embedding. The decoder then uses this embedding to output the translation word-by-word.

We'll now describe the *encoder-decoder architecture* for this translation model in more detail (figure 1). The encoder takes the sequence $\mathbf{x}$ which consists of $K$ words $\{x_t\}_{t=1}^{K}$ and outputs the latent space embedding $\mathbf{h}$. The *decoder* takes this latent representation $\mathbf{h}$ and outputs a sequence of $L$ word tokens $\{\hat{y}_t\}_{t=1}^{L}$ one by one.

We consider an encoder which converts each word token to a fixed length word embedding using a method such as the SkipGram algorithm (Mikolov *et* al. 2013). Then these word embeddings are combined by a neural architecture such as a recurrent neural network (Sundermeyer *et* al. 2012), self-attention network (Vaswani *et* al. 2017), or convolutional neural network (Dauphin *et* al. 2017) to create a fixed-length hidden state $\mathbf{h}$ that describes the whole input sequence.

At each step $t$, the decoder takes this hidden state and an input token, and outputs a probability distribution over the vocabulary $\mathcal{V}$ from which the output word $\hat{y}_{t}$ can be chosen. The hidden state itself is also updated, so that it knows about the history of the generated words. The first input token is always a special *start of sentence* (SOS) token. Subsequent tokens correspond to the predicted output word from the previous time-step during inference, or the ground truth word during training. There is a special *end of sentence* (EOS) token. When this is generated, it signifies that no more tokens will be produced.

This tutorial is divided into two parts. In this first part, we assume that the system has been trained with a maximum likelihood criterion and discuss algorithms for the decoder. We will see that maximum likelihood training has some inherent problems related to the fact that the cost function does not consider the whole output sequence at once and we'll consider some possible solutions.

In the second part of the tutorial we change our focus to consider alternative training methods. We consider fine-tuning the system using reinforcement learning or minimum risk training which use sequence-level cost functions. Finally, we review a series of methods that frame the problem as structured prediction. A summary of these methods is given in table 1.

Type of method | Training | Inference |
---|---|---|

Decoding algorithms | Maximum likelihood Maximum likelihood Maximum likelihood Maximum likelihood Maximum likelihood Maximum likelihood |
greedy search beam search diverse beam search iterative beam search top-k sampling nucleus sampling |

Sequence-level fine-tuning | Fine-tune with reinforcement learning Fine-tune with minimum risk training Scheduled sampling |
greedy search/beam search beam search |

Sequence-level training | Beam search optimization SeaRNN Reward augmented max likelihood |
greedy search / beam search beam search greedy search / beam search beam search |

In this section, we describe the standard approach to train encoder-decoder architectures, which uses the maximum likelihood criterion. Specifically, the model is trained to maximize the conditional log-likelihood for each of $I$ input-output sequence pairs $\{\mathbf{x}_{i},\mathbf{y}_{i}\}_{i=1}^I$ in the training corpus so that:

\begin{equation}\label{eq:local_max_like}

L = \sum_{i=1}^{I} \log\left[ Pr(\mathbf{y}_i | \mathbf{x}_i, \boldsymbol\phi)\right] \tag{1}

\end{equation}

where $\boldsymbol\phi$ are the weights of the model.

The probabilities of the output words are evaluated sequentially and each depends on the previously generated tokens, so the probability term in equation 1 takes the form of an auto-regressive model and can be decomposed as:

\begin{equation}

Pr(\mathbf{y}_i | \mathbf{x}_i, \boldsymbol\phi)= \prod_{t=1}^{L_i} Pr(y_{i,t} |\mathbf{y}_{i,<t}, \mathbf{x}_i, \boldsymbol\phi). \tag{2}

\end{equation}

Here, the probability of token $y_{i,t}$ from the $i^{th}$ sequence at time $t$ depends on the tokens $\mathbf{y}_{i,<t} = \{y_{i,0}, y_{i,1},\ldots y_{i,t-1}\}$ seen up to this point as well as the input sentence $\mathbf{x}_{i}$ (via the latent embedding from the encoder).

This training criterion seems straightforward, but there is a subtle problem. In training, the previously seen ground truth tokens $\mathbf{y}_{i,<t}$ are used to compute the probability of the current token $y_{i,t}$. However, when we perform inference with the model and generate text, we no longer have access to the ground truth tokens; we only know the actual tokens $\hat{\mathbf{y}}_{i,<t}$ that we have generated so far (figure 2).

The approach of using the ground truth tokens for training is known as *teacher forcing*. In a sense, it means that the training scenario is unrealistic and does not map to the real situation when we perform inference. In training, the model is only exposed to sequences of ground truth tokens, but sees its own output when deployed. As we shall see in the following discussion, this *exposure bias* may result in some problems in the decoding process.

We now return to the decoding (inference) process. At each time step, the system predicts the probability $Pr(\hat{y}_t |\hat{\mathbf{y}}_{<t}, \mathbf{x}, \boldsymbol\phi)$ over items in the vocabulary, and we have to select a particular word from this distribution to feed into the next decoding step. The goal is to pick the sequence with the highest overall probability:

\begin{equation}

Pr(\hat{\mathbf{y}} | \mathbf{x}, \boldsymbol\phi) = \prod_t Pr(\hat{y}_t | \hat{\mathbf{y}}_{<t},\mathbf{x}, \boldsymbol\phi). \tag{3}

\end{equation}

In principle, we can simply compute the probability of every possible sequence by brute force. However, there are as many choices as there are words $|\mathcal{V}|$ in the vocabulary for each position and so a sentence of length $L$ would have $|\mathcal{V}|^{L}$ possible sequences. The vocabulary size $|\mathcal{V}|$ might be as large as 50,000 words and so this might not be practical.

We can improve the situation by re-using partial computations; there are many sequences which start in the same way and so there is no need to re-compute these partial likelihoods. A dynamic programming approach can exploit this structure to produce an algorithm with complexity $\mathcal{O}[L|\mathcal{V}|^2]$ but this is still very expensive.

Since we cannot find the sequence with the maximum probability, we must resort to tractable search strategies that produce a reasonable approximation of this maximum. Two common approximations are *greedy search* and *beam search* which we discuss respectively in the next two sections.

The simplest strategy is *greedy search*. It consists of picking the most likely token according to the model at each decoding time step $t$ (figure 3a).

\begin{equation}

\hat{y}_t =\underset{w \in \mathcal{V}}{\mathrm{argmax}}\left[ Pr(y_{t} = w | \hat{\mathbf{y}}_{<t}, \mathbf{x}, \boldsymbol\phi)\right] \tag{4}

\end{equation}

Note that this does not guarantee that the complete output $\hat{\mathbf{y}}$ will have high overall probability relative to others. For example, having selected the most likely first token $y_{0}$, it may transpire that there is no token $y_{1}$ for which the probability $Pr(y_{1}|y_{0},\mathbf{x},\boldsymbol\phi)$ is high. It might have been overall better to choose a less probable first token, which is more compatible with a second token.

We have seen that searching over all possible sequences is intractable, and that greedy search does not necessarily produce a good solution. Beam search seeks a compromise between these extremes by performing a restricted search over possible sequences. In this regard, it produces a solution that is both tractable and superior in quality to greedy search (figure 3b).

At each step of decoding $t$, the $B$ most probable sequences $\mathcal{B}^{t-1} =\{\hat{\mathbf{y}}_{<t,b}\}_{b=1}^{B}$ are stored as candidate outputs. For each of these hypotheses the log probability is computed for each possible next token $w$ in the vocabulary $\mathcal{V}$, so $B|\mathcal{V}|$ probabilities are computed in all. From these, the new $B$ most probable sequences $\mathcal{B}^t =\{\hat{\mathbf{y}}_{<t+1,b}\}_{b=1}^{B}$ are retained. By analogy with the formula for greedy search, we have:

\begin{equation}\label{eq:beam_search}

\mathcal{B}^t =\underset{w \in \mathcal{V},b\in1\ldots B}{\mathrm{argtopk}}\left[B, Pr(y_{t} = w | \hat{\mathbf{y}}_{<t,b}, \mathbf{x}, \boldsymbol\phi)\right] \tag{5}

\end{equation}

where the function $\mathrm{argtopk}[K,\bullet]$ returns the set of the top $K$ items that maximize the second argument. The process is repeated until EOS tokens are produced or the maximum decoding length is reached. Finally, we return the most likely overall result.

The integer $B$ is known as the *beam width*. As this increases, the search becomes more thorough but also more computationally expensive. In practice, it is common to see values of $B$ in the range 5 to 200. When the beam width is 1, the method becomes equivalent to greedy search.

When we train a decoder with a maximum-likelihood criterion, the resulting sentences can exhibit a lack of diversity. This happens at both (i) the beam level (many sentences in the same beam may be very similar) and (ii) the decoding level (words are repeated during one iteration of decoding). In the next two sections we look at methods that have been proposed to ameliorate these issues.

While beam search is superior to greedy search, it often produces sentences that have the same or a very similar start (figure 4a). Decoding is done in a left-to-right fashion and the probability weights are often concentrated at the beginning; even if search is performed over $B$ sentences, the weight of the first few words will mean that most of these sentences start with these words and there is little diversity.

4a) Beam SearchA steam engine train travelling down train tracks.A steam engine train travelling down tracks.A steam engine train travelling through a forest.A steam engine train travelling through a lush green forest.A steam engine train travelling through a lush green countryside.A train on a train track with a sky background. |

4b) Diverse Beam SearchA steam engine train travelling down train tracks.A steam engine train travelling through a forest. An old steam engine train travelling down train tracks. An old steam engine train travelling through a forest. A black train is on the tracks in a wooded area. A black train is on the tracks in a rural area. |

This raises the question of whether the likelihood objective correlates with our end-goals. We might care about other criteria such as diversity, which is important for chatbots: if a chatbot always said the same thing in response to a generic input such as "how are you today?", it would become quickly dull. Hence, we might want to factor in other criteria of quality when decoding.

To counter beam-level repetition, Vijayakumar *et* al. (2018) proposed a variant of beam search, called *diverse beam search*, which encourages more variation in the generated sentences than pure beam search (figure 4b). The beam is divided into $G$ groups. Regular beam search is performed in the first group to generate $B'=\frac{B}{G}$ sentences. For the second group, at step $t$ of decoding, the beam search criterion is augmented with a factor that penalizes token sequences that are similar to the first $t$ words of the hypotheses in the first group. For the third group, sequences that are similar to those in either of the first two groups are penalized, and so on (figure 5).

Vijayakumar *et* al. (2018) investigate several similarity metrics including *Hamming diversity* which penalizes tokens based on their number of occurrences in the previous groups.

Diverse beam search has the disadvantage that it only discourages sequences that are close to the final sequences found in previous beams. However, there may be significant portions of the space that were searched to find these hypotheses and since we didn't store the intermediate results there is nothing to stop us from redundantly considering the same part of the search space again (figure 6).

Kulikov *et* al. (2018) introduced *iterative beam search* which aims to solve this problem. It resembles diverse beam search in that beams (groups of hypotheses) are computed and recorded. These beams are ordered and each is affected by the previous beams. However, unlike diverse beam search we do not wait for a beam search to complete before computing the others. Instead, they are computed concurrently.

Consider the situation where at ouput time $t-1$ we have $G$ groups of beams, each of which contains $B^{\prime}$ hypotheses. We extend the first beam to length $t$ in the usual way; we consider concatenating every possible vocabulary word with each of the $B^{\prime}$ hypotheses, evaluate the probabilities and retain the best overall $B^{\prime}$ solutions of length $t$.

When we extend the second beam to length $t$, we follow the same procedure. However, we now set any hypotheses that are too close in Hamming distance to those in the first beam to have zero probability. Likewise, when we extend the third beam, we discount any hypotheses that are close to those in the first two beams. The result is that each beam is forced to explore a different part of the search space and the final results have increased diversity (figure 7).

Until this point, we have assumed that the best way to decode is by maximizing the probability of the output words, using either greedy search, beam search, or a variation on these techniques. However, Holtzman *et* al. (2019) demonstrate that human speech does not stay in high probability zones and is often much more surprising than the text generated by these methods.

This raises the question of whether we should sample randomly from the output probability distribution rather than search for likely decodings. Unfortunately, this can also lead to degenerate cases. Holtzman *et* al. (2019) conjecture that at some point during decoding, the model is likely to sample from the tail of the distribution (i.e., from the set of tokens which are much less probable than the gold token). Once the model samples from this tail, it might not know how to recover. In the next two sections, we consider two methods that aim to repress such behavior.

Fan *et* al. (2018) proposed *top-$k$ sampling* as a possible remedy. Consider one iteration of decoding at the $t$-th time step. Let us define $\mathcal{V}_{K,t}$ as the set of the $K$ most probable next tokens according to $Pr(y_{t}=w|\hat{y}_{<t},\boldsymbol\phi)$. Let us also define the sum $Z$ of these $K$ probabilities:

\begin{equation}

Z = \sum_{w \in \mathcal{V}_K^t} Pr(y_{t}=w|\hat{\mathbf{y}}_{<t},\boldsymbol\phi). \tag{6}

\end{equation}

Top-$k$ sampling proposes to re-scale these probabilities and ignore the other possible tokens:

\begin{equation}

Pr(y_{t}=w | \hat{\mathbf{y}}_{<t}|\boldsymbol\phi) \leftarrow

\begin{cases}

Pr(y_{t}=w | \hat{\mathbf{y}}_{<t},\boldsymbol\phi) / Z & \text{if } w \in \mathcal{V}_K^t \\

0 & \text{otherwise}. \tag{7}

\end{cases}

\end{equation}

With this strategy, we only sample from the $K$ most likely tokens and thus avoid tokens from the tail of the distribution (their probability is set to zero).

Unfortunately, it can be difficult to fix $K$ in practice. There exist extreme cases where the distribution is very peaked and so the top-$K$ tokens include tokens from the tail. Similarly, there may be cases where the distribution is very flat and valid tokens are excluded from the top-$K$ list.

Nucleus sampling (Holtzman *et* al. 2019) aims to solve this problem by retaining a fixed proportion of the probability mass. They define $\mathcal{V}_\tau^t$ as the smallest set such that:

\begin{equation*}

Z = \sum_{w \in \mathcal{V}^t_\tau} Pr(y_{t}=w|\hat{\mathbf{y}}_{<t}, \boldsymbol\phi) \geq \tau,

\end{equation*}

where $\tau$ is a fixed threshold. Then, as in top-$k$ sampling, probabilities are re-scaled to:

\begin{equation}

Pr(y_{t}=w | \hat{\mathbf{y}}_{<t}|\boldsymbol\phi) \leftarrow

\begin{cases}

Pr(y_{t}=w | \hat{\mathbf{y}}_{<t},\boldsymbol\phi) / Z & \text{if } w \in \mathcal{V}_\tau^t \\

0 & \text{otherwise}. \tag{8}

\end{cases}

\end{equation}

Since the set $\mathcal{V}_\tau^t$ is chosen so that the cumulative probability mass is at least $\tau$, nucleus sampling does not suffer in the case of a flat or a peaked distribution.

Part I of this tutorial has highlighted some of the problems that occur during decoding. We can classify these problems into two categories.

First, maximum-likelihood trains the model to stay in high probability regions of the token space. As shown by Holtzman *et* al. (2019), this differs significantly from human speech. If we want to take into account other criteria of quality, such as diversity, search strategies must be put in place to explore the space of likely outputs and greedy sampling or vanilla beam search are not enough.

Second, the exposure bias introduced by teacher forcing forces NNLG models to be myopic, and only look for the next most likely token given a ground truth prefix. As a consequence, if the model samples a token from the long tail, it might enter a "degenerate'' case. When this happens, the model "makes an error'' by sampling a low-probability token and does not know how to recover from this error.

We have presented some sampling strategies that alleviate these issues at inference time. However, since these problems are both side-effects of the maximum likelihood teacher-forcing methodology for training, another way to approach this is to modify the training method. In part II of this tutorial, we describe how Reinforcement Learning (RL) and structured prediction help with both maximum-likelihood and teacher-forcing induced issues.

]]>The structure of this post is as follows. First, we briefly review knowledge graphs and knowledge graph completion in static graphs. Second, we discuss the extension to temporal knowledge graphs. Third, we present our new method for knowledge completion in temporal knowledge graphs and demonstrate the efficacy of this method in a series of experiments. Finally, we draw attention to some possible future directions for work in this area.

Knowledge graphs are knowledge bases of facts where each fact is of the form $(Alice, Likes, Dogs)$. Here $Alice$ and $Dogs$ are called the head and tail entities respectively and $Likes$ is a relation. An example knowledge graph is depicted in figure 1.

KG completion is the problem of inferring new facts from a KG given the existing ones. This may be possible because the new fact is logically implied as in:

\begin{equation*}

(Alice, BornIn, London) \land (London, CityIn, England) \implies (Alice, BornIn, England)

\end{equation*}

or it may just be based on observed correlations. If $(Alice, Likes, Dogs)$ and $(Alice, Likes, Cats)$ then there's a high probability that $(Alice, Likes, Rabbits)$.

For a single relation-type, the problem of knowledge graph completion can be visualised in terms of completing a binary matrix. To see this, consider the simpler knowledge graph depicted in figure 2a, where there are only two types of entities and one type of relation. We can define a binary matrix with the head entities in the rows and the tail entities in the columns (figure 2b). Each known positive relation corresponds to an entry of '1' in this matrix. We do not typically know negative relations. However, we can generate putative negative relations, by randomly sampling combinations of head entity, tail entity and relation. This is reasonable for large graphs where almost all combinations are false. This process is known as negative sampling and these negatives correspond to entries of '0' in the matrix. The remainin missing values in the matrix are the relations that we wish to infer in the KG completion process.

This matrix representation of the single-relation knowledge graph completion problem suggests a way forward. We can consider factoring the binary matrix $\mathbf{M} = \mathbf{A}^{T}\mathbf{B}$ into the outer product of a portrait matrix $\mathbf{A}^{T}$ in which each row corresponds to the head entity and a landscape matrix $\mathbf{B}$ in which each column corresponds to the tail entity. This is illustrated in figure 3. Now the binary value representing whether a given fact is true or not is approximated by the dot product of the vector (embedding) corresponding to the head entity and the vector corresponding to the tail entity. Hence, the problem of knowledge graph embedding becomes equivalent to learning these embeddings.

More formally, we might define the likelihood of a relation being true as:

\begin{eqnarray}

Pr(a_{i}, Likes, b_{j}) &=& \mbox{sig}\left[\mathbf{a}_{i}^{T}\mathbf{b}_{j} \right]\nonumber \\

Pr(a_{i}, \lnot Likes, b_{j}) &=& 1-\mbox{sig}\left[\mathbf{a}_{i}^{T}\mathbf{b}_{j} \right] \tag{1}

\end{eqnarray}

$\mbox{sig}[\bullet]$ is a sigmoid function. The term $\mathbf{a}_{i}$ is the embedding for the $i^{th}$ head entity (from the $i^{th}$ row of the portrait matrix $\mathbf{A}^{T}$) and $\mathbf{b}_{j}$ is the embedding for the $j^{th}$ tail entity $b$ (from the $j^{th}$ column of the landscape matrix $\mathbf{B}$). We can hence learn the embedding matrices $\mathbf{A}$ and $\mathbf{B}$ by maximizing the log likelihood of all of the known relations.

The above discussion considered only the simplified case where there is a single type of relation between entities. However, this general strategy can be extended to the case of multiple relations by considering a three dimensional binary tensor in which the third dimension represents the type of relation (figure 4). During the factorization process, we now also generate a matrix containing embeddings for each type of relation.

In the previous section, we considered KG completion in terms of factorizing a matrix or tensor into matrices of embeddings for the head entity, tail entity and relation.

We can generalize this idea, by retaining the notion of embeddings, but use more general *score functions* than the one implied by factorization to provide scores for each tuple. For example, TransE (Bordes *et* al. 2013) maps each entity and each relation to a vector of size $d$ and defines the score for a tuple $(Alice, Likes, Fish)$ as:

\[-|| {z}_{Alice} + {z}_{Likes} - {z}_{Fish}||\]

where ${z}_{Alice},{z}_{Likes},{z}_{Fish}\in\mathbb{R}^d$, corresponding to the embeddings for $Alice$, $Likes$, and $Fish$, are vectors with learnable parameters. To train, we define a likelihood such that $-|| {z}_{Alice} + {z}_{Likes} - {z}_{Fish}||$ becomes large if $(Alice, Likes, Fish)$ is in the KG and small if $(Alice, Likes, Fish)$ is likely to be false.

Other models map entities and relations to different spaces and/or use different score functions. For a comprehensive list of existing approaches and their advantages and disadvantages, see Nguyen (2017).

Temporal KGs are KGs where each fact can have a timestamp associated with it. An example of a fact in a temporal KG is $(Alice, Liked, Fish, 1995)$. Temporal KG completion (TKGC) is the problem of inferring new temporal facts from a KG based on the existing ones.

Existing approaches for TKGC usually extend (static) KG embedding models by mapping the timestamps to latent representations and updating the score function to take into account the timestamps as well. As an example, TTransE extends TransE by mapping each entity, relation, and timestamp to a vector in $\mathbb{R}^d$ and defining the score function for a tuple $(Alice, Liked, Fish, 1995)$ as:

\[-|| z_{Alice} + z_{Liked} + z_{1995} - z_{Fish}||\]

For a comprehensive list of existing approaches for TKGC and their advantages and disadvantages, see Kazemi *et* al. (2019).

We develop models for TKGC based on an intuitive assumption: to provide a score for, $(Alice, Liked, Fish, 1995)$, one needs to know $Alice$'s and $Fish$'s features in $1995$; providing a score based on their current features or an aggregation of their features over time may be misleading. That is because $Alice$'s personality and the sentiment towards $Fish$ may have been quite different in 1995 as compared to now (figure 5). Consequently, learning a static embedding for each entity - as is done by existing approaches - may be sub-optimal as such a representation only captures an aggregation of entity features during time.

To provide entity features at any given time, we define the entity embedding as a function which takes an entity and a timestamp as input and provides a hidden representation for the entity at that time. Inspired by diachronic word embeddings, we call our proposed embedding a *diachronic embedding (DE)*. In particular, we define the diachronic embedding for an entity $E$ using vector(s) defined as follows:

\begin{equation}

\label{eq:demb}

z^t_E[n]=\begin{cases}

a_E[n] \sigma(w_E[n] t + b_E[n]), & \text{if $1 \leq n\leq \gamma d$}. \\

a_E[n], & \text{if $\gamma d < n \leq d$}. \tag{2}

\end{cases}

\end{equation}

where $a_E\in\mathbb{R}^{d}$ and $w_E,b_E\in\mathbb{R}^{\gamma d}$ are (entity-specific) vectors with learnable parameters, $z^t_E[n]$ indicates the $n^{th}$ element of $z^t_E$ (similarly for $a_E$, $w_E$ and $b_E$), and $\sigma$ is an activation function.

Intuitively, entities may have some features that change over time and some features that remain fixed (figure 6). The first $\gamma d$ elements of $z^t_E$ in Equation (2) capture temporal features and the other $(1-\gamma)d$ elements capture static features. The hyperparameter $\gamma\in[0,1]$ controls the percentage of temporal features. In principle static features can be potentially obtained from the temporal ones if the optimizer sets some elements of $w_E$ in Equation (2) to zero. However, explicitly modeling static features helps reduce the number of learnable parameters and avoid overfitting to temporal signals.

Intuitively, by learning $w_E$s and $b_E$s, the model learns how to turn entity features on and off at different points in time so accurate temporal predictions can be made about them at any time. The terms $a_E$s control the importance of the features. We mainly use $\sin[\bullet]$ as the activation function for Equation (2) because one sine function can model several on and off states (figure 7). Our experiments explore other activation functions as well and provide more intuition.

It is possible to take any static KG embedding model and make it temporal by replacing the entity embeddings with diachronic entity embeddings as in Equation (2). For instance, TransE can be extended to TKGC by changing the score function for a tuple $(Alice, Liked, Fish, 1995)$ as:

\[-|| z^{1995}_{Alice} + z_{Liked} + z^{1995}_{Fish}||\]

where $z^{1995}_{Alice}$ and $z^{1995}_{Fish}$ are defined as in Equation (2). We call the above model DE-TransE where $DE$ stands for diachronic embedding. Besides TransE, we also test extensions of DistMult and SimplE, two effective models for static KG completion. We name the extensions DE-DistMult and DE-SimplE respectively.

**Table 1** Results on ICEWS14, ICEWS05-15, and GDELT. Best results are in bold blue.

**Datasets: **Our datasets are subsets of two temporal KGs that have become standard benchmarks for TKGC: ICEWS and GDELT. For ICEWS, we use the two subsets generated by García-Durán *et* al. (2018): 1- *ICEWS14* corresponding to the facts in 2014 and 2- *ICEWS05-15* corresponding to the facts between 2005 to 2015. For GDELT, we use the subset extracted by Trivedi *et* al. (2017) corresponding to the facts from April 1, 2015 to March 31, 2016. We changed the train/validation/test sets following a similar procedure as in Bordes *et* al. (2013) to make the problem into a TKGC rather than an extrapolation problem.

**Baselines:** Our baselines include both static and temporal KG embedding models. From the static KG embedding models, we use TransE, DistMult and SimplE where the timestamps are ignored. From the temporal KG embedding models, we compare to TTransE, HyTE, ConT, and TA-DistMult.

**Metrics:** We report filtered MRR and filtered hit@k measures. These essentially create queries such as $(v, r, ?)$ and measure how well the model predicts the correct answer among possible entities $u'$. See Bordes *et* al. 2013 for details.

Table 1 and figure 8 show the performance of our models compared to several baselines. According to the results, the temporal versions of different models outperform the static counterparts in most cases, thus providing evidence for the merit of capturing temporal information.

DE-TransE outperforms the other TransE-based baselines (TTransE and HyTE) on ICEWS14 and GDELT and gives on-par results with HyTE on ICEWS05-15. This result shows the superiority of our diachronic embeddings compared to TTransE and HyTE. DE-DistMult outperforms TA-DistMult, the only DistMult-based baseline, showing the superiority of our diachronic embedding compared to TA-DistMult. Moreover, DE-DistMult outperforms all TransE-based baselines. Finally, just as SimplE beats TransE and DistMult due to its higher expressivity, our results show that DE-SimplE beats DE-TransE, DE-DistMult, and the other baselines due to its higher expressivity.

We perform several studies to provide a better understanding of our models. Our ablation studies include i) different choices of activation function, ii) using diachronic embeddings for both entities and relations as opposed to using it only for entities, iii) testing the ability of our models in generalizing to timestamps unseen during training, iv) the importance of model parameters in Equation (2), v) balancing the number of static and temporal features in Equation (2), and vi) examining training complications due to the use of sine functions in the model. We refer the readers to the full paper for these experiments.

Our work opens several avenues for future research including:

- We proposed diachronic embeddings for KGs having timestamped facts. Future work may consider extending diachronic embeddings to KGs having facts with time intervals.
- We considered the ideal scenario where every fact in the KG is timestamped. Future work can propose ways of dealing with missing timestamps, or ways of dealing with a combination of static and temporal facts.
- We proposed a specific diachronic embedding in Equation (2). Future work can explore other possible functions.
- An interesting avenue for future research is to use Equation (2) to learn diachronic word embeddings and see if it can perform well in the context of word embeddings as well.

View the code here.

]]>It is common to talk about the variational autoencoder as if it *is* the model of $Pr(\mathbf{x})$. However, this is misleading; the variational autoencoder is a neural architecture that is designed to help learn the model for $Pr(\mathbf{x})$. The final model contains neither the 'variational' nor the 'autoencoder' parts and is better described as a *non-linear latent variable model*.

We'll start this tutorial by discussing latent variable models in general and then the specific case of the non-linear latent variable model. We'll see that maximum likelihood learning of this model is not straightforward, but we can define a lower bound on the likelihood. We then show how the autoencoder architecture can approximate this bound using a Monte Carlo (sampling) method. To maximize the bound, we need to compute derivatives, but unfortunately, it's not possible to compute the derivative of the sampling component. We'll show how to side-step this problem using the reparameterization trick. Finally, we'll discuss extensions of the VAE and some of its drawbacks.

Latent variable models take an indirect approach to describing a probability distribution $Pr(\mathbf{x})$ over a multi-dimensional variable $\mathbf{x}$. Instead of directly writing the expression for $Pr(\mathbf{x})$ they model a joint distribution $Pr(\mathbf{x}, \mathbf{h})$ of the data $\mathbf{x}$ and an unobserved latent (or hidden) variable $\mathbf{h}$. They then describe the probability of $Pr(\mathbf{x})$ as a marginalization of this joint probability so that

\begin{equation}

Pr(\mathbf{x}) = \int Pr(\mathbf{x}, \mathbf{h}) d\mathbf{h}.\tag{1}

\end{equation}

Typically we describe the joint probability $Pr(\mathbf{x}, \mathbf{h})$ as the product of the *likelihood* $Pr(\mathbf{x}|\mathbf{h})$ and the *prior* $Pr(\mathbf{h})$, so that the model becomes

\begin{equation}

Pr(\mathbf{x}) = \int Pr(\mathbf{x}| \mathbf{h}) Pr(\mathbf{h}) d\mathbf{h}.\tag{2}

\end{equation}

It is reasonable to question why we should take this indirect approach to describing $Pr(\mathbf{x})$. The answer is that relatively simple expressions for $Pr(\mathbf{x}| \mathbf{h})$ and $Pr(\mathbf{h})$ can describe a very complex distribution for $Pr(\mathbf{x})$.

A well known latent variable model is the mixture of Gaussians. Here the latent variable $h$ is discrete and the prior $Pr(h)$ is a discrete distribution with one probability $\lambda_{k}$ for each of the $K$ component Gaussians. The likelihood $Pr(\mathbf{x}|h)$ is a Gaussian with a mean $\boldsymbol\mu_{k}$ and covariance $\boldsymbol\Sigma_{k}$ that depends on the value $k$ of the latent variable $h$:

\begin{eqnarray}

Pr(h=k) &=& \lambda_{k}\nonumber \\

Pr(\mathbf{x} |h = k) &=& \mbox{Norm}_{\mathbf{x}}[\boldsymbol\mu_{k},\boldsymbol\Sigma_{k}].\label{eq:mog_like_prior}\tag{3}

\end{eqnarray}

where $\mbox{Norm}_{\mathbf{x}}[\boldsymbol\mu, \boldsymbol\Sigma]$ represents a multivariate probability distribution over $\mathbf{x}$ with mean $\boldsymbol\mu$ and covariance $\boldsymbol\Sigma$.

As in equation 2, the likelihood $Pr(x)$ is given by the marginalization over the latent variable $h$. In this case, this is a sum as the latent variable is discrete:

\begin{eqnarray}

Pr(\mathbf{x}) &=& \sum_{k=1}^{K} Pr(\mathbf{x}, h=k) \nonumber \\

&=& \sum_{k=1}^{K} Pr(\mathbf{x}| h=k) Pr(h=k)\nonumber \\

&=& \sum_{k=1}^{K} \lambda_{k} \mbox{Norm}_{\mathbf{x}}[\boldsymbol\mu_{k},\boldsymbol\Sigma_{k}]. \tag{4}

\end{eqnarray}

This is illustrated in figure 1. From the simple expressions for the likelihood and prior in equation 3, we can describe a complex multi-modal probability distribution.

Now let's consider the non-linear latent variable model, which is what the VAE actually learns. This differs from the mixture of Gaussians in two main ways. First, the latent variable $\mathbf{h}$ is continuous rather than discrete and has a standard normal prior (i.e., one with mean zero and identity covariance). Second, the likelihood is a normal distribution as before, but the variance is constant and spherical. The mean is a non-linear function $\mathbf{f}[\mathbf{h},\bullet]$ of the hidden variable $\mathbf{h}$ and this gives rise to the name. The prior and likelihood terms are:

\begin{eqnarray}

Pr(\mathbf{h}) &=& \mbox{Norm}_{\mathbf{h}}[\mathbf{0},\mathbf{I}]\nonumber \\

Pr(\mathbf{x} |\mathbf{h},\boldsymbol\phi) &=& \mbox{Norm}_{\mathbf{x}}[\mathbf{f}[\mathbf{h},\boldsymbol\phi],\sigma^{2}\mathbf{I}], \tag{5}

\end{eqnarray}

where the function $\mathbf{f}[\mathbf{h},\boldsymbol\phi]$ is a deep neural network with parameters $\boldsymbol\phi$. This model is illustrated in figure 2.

The model can be viewed as an infinite mixture of spherical Gaussians with different means; as before, we build a complex distribution by weighting and summing these Gaussians in the marginalization process. In the next three sections we consider three operations that we might want to perform with this model: computing the posterior distribution, sampling, and evaluating the likelihood.

The likelihood $Pr(\mathbf{x} |\mathbf{h},\boldsymbol\phi)$ tells us how to compute the distribution over the observed data $\mathbf{x}$ given hidden variable $\mathbf{h}$. We might however want to move in the other direction; given an observed data example $\mathbf{x}$ we might wish to understand what possible values of the hidden variable $\mathbf{h}$ were responsible for it (figure 3). This information is encompassed in the posterior distribution $Pr(\mathbf{h}|\mathbf{x})$. In principle, we can compute this using Bayes's rule

\begin{eqnarray}

Pr(\mathbf{h}|\mathbf{x}) = \frac{Pr(\mathbf{x}|\mathbf{h})Pr(\mathbf{h})}{Pr(\mathbf{x})}. \tag{6}

\end{eqnarray}

However, in practice, there is no closed form expression for the left hand side of this equation. In fact, as we shall see shortly, we cannot evaluate the denominator $Pr(\mathbf{x})$ and so we can't even compute the numerical value of the posterior for a given pair $\mathbf{h}$ and $\mathbf{x}$.

Although computing the posterior is intractable, it is easy to generate a new sample $\mathbf{x}^{*}$ from this model using ancestral sampling; we draw $\mathbf{h}^{*}$ from the prior $Pr(\mathbf{h})$, pass this through the network $f[\mathbf{h}^{*},\boldsymbol\phi]$ to compute the mean of the likelihood $Pr(\mathbf{x}|\mathbf{h})$ and then draw $\mathbf{h}$ from this distribution. Both the prior and the likelihood are normal distributions and so sampling from them in each step is easy. This process is illustrated in figure 4.

Finally, let's consider evaluating the likelihood of a data example $\mathbf{x}$ under the model. As before, the likelihood is given by:

\begin{eqnarray}

Pr(\mathbf{x}) &=& \int Pr(\mathbf{x}, \mathbf{h}|\boldsymbol\phi) d\mathbf{h} \nonumber \\

&=& \int Pr(\mathbf{x}| \mathbf{h},\boldsymbol\phi) Pr(\mathbf{h})d\mathbf{h}\nonumber \\

&=& \int \mbox{Norm}_{\mathbf{x}}[\mathbf{f}[\mathbf{h},\boldsymbol\phi],\sigma^{2}\mathbf{I}]\mbox{Norm}_{\mathbf{h}}[\mathbf{0},\mathbf{I}]d\mathbf{h}. \tag{7}

\end{eqnarray}

Unfortunately, there is no closed form for this integral, so we cannot easily compute the probability for a given example $\mathbf{x}$. This is a major problem for two reasons. First, evaluating the probability $Pr(\mathbf{x})$ was of the main reasons for modelling the probability distribution in the first place. Second, to learn the model, we maximize the log likelihood, which is obviously going to be hard if we cannot compute it. In the next section we'll introduce a lower bound on the log likelihood which can be computed and which we can use to learn the model.

During learning we are given training data $\{\mathbf{x}_{i}\}_{i=1}^{I}$ and want to maximize the parameters $\boldsymbol\phi$ of the model with respect to the log likelihood. For simplicity we'll assume that the variance term $\sigma^2$ in the likelihood expression is known and just concentrate on learning $\boldsymbol\phi$:

\begin{eqnarray}

\hat{\boldsymbol\phi} &=& argmax_{\boldsymbol\phi} \left[\sum_{i=1}^{I}\log\left[Pr(\mathbf{x}_{i}|\boldsymbol\phi) \right]\right] \nonumber \\

&=& argmax_{\boldsymbol\phi} \left[\sum_{i=1}^{I}\log\left[\int Pr(\mathbf{x}_{i}, \mathbf{h}_{i}|\boldsymbol\phi) d\mathbf{h}_{i}\right]\right].\label{eq:log_like} \tag{8}

\end{eqnarray}

As we noted above, we cannot write a closed form expression for the integral and so we can't just build a network to compute this and let Tensorflow or PyTorch optimize it.

To make some progress we define a lower bound on the log likelihood. This is a function that is always less than or equal to the log likelihood for a given value of $\boldsymbol\phi$ and will also depend on some other parameters $\boldsymbol\theta$. Eventually we will build a network to compute this lower bound and optimize it. To define this lower bound, we need to use Jensen's inequality which we quickly review in the next section.

Jensen’s inequality concerns what happens when we pass values through a concave function $g[\bullet]$. Specifically, it says that if we compute the expectation (mean) of these values and pass this mean through the function, the results will be greater than if we pass the values themselves through the function and then compute the expectation of the results. In mathematical terms:

\begin{equation}

g[\mathbf{E}[y]] \geq \mathbf{E}[g[y]], \tag{9}

\end{equation}

for any concave function $g[\bullet]$. Some intuition as to why this is true is given in figure 5. In our case, the concave function in question is the logarithm so we have:

\begin{equation}

\log[\mathbf{E}[y]]\geq\mathbf{E}[\log[y]], \tag{10}

\end{equation}

or writing out the expression for expectation in full we have:

\begin{equation}

\log\left[\int Pr(y) y dy\right]\geq \int Pr(y)\log[y]dy. \tag{11}

\end{equation}

We will now use Jensen's inequality to derive the lower bound for the log likelihood. We start by multiplying and dividing the log likelihood by an arbitrary probability distribution $q(\mathbf{h})$ over the hidden variables

\begin{eqnarray}

\log[Pr(\mathbf{x}|\boldsymbol\phi)] &=& \log\left[\int Pr(\mathbf{x},\mathbf{h}|\boldsymbol\phi)d\mathbf{h} \tag{12} \right] \\

&=& \log\left[\int q(\mathbf{h}) \frac{ Pr(\mathbf{x},\mathbf{h}|\boldsymbol\phi)}{q(\mathbf{h})}d\mathbf{h} \tag{13} \right],

\end{eqnarray}

We then use Jensen's inequality for the logarithm (equation 11) to find a lower bound:

\begin{eqnarray}

\log\left[\int q(\mathbf{h}) \frac{ Pr(\mathbf{x},\mathbf{h}|\boldsymbol\phi)}{q(\mathbf{h})}d\mathbf{h} \right]

&\geq& \int q(\mathbf{h}) \log\left[\frac{ Pr(\mathbf{x},\mathbf{h}|\boldsymbol\phi)}{q(\mathbf{h})} \right]d\mathbf{h}, \tag{14}

\end{eqnarray}

where the term on the right hand side is known as the *evidence lower bound* or *ELBO*. It gets this name because the term $Pr(\mathbf{x}|\boldsymbol\phi)$ is known as the evidence when viewed in the context of Bayes' rule (equation 6).

In practice, the distribution $q(\mathbf{h})$ will have some parameters $\boldsymbol\theta$ as well and so the ELBO can be written as:

\begin{equation}

\mbox{ELBO}[\boldsymbol\theta, \boldsymbol\phi] = \int q(\mathbf{h}|\boldsymbol\theta) \log\left[\frac{ Pr(\mathbf{x},\mathbf{h}|\boldsymbol\phi)}{q(\mathbf{h}|\boldsymbol\theta)} \right]d\mathbf{h}. \tag{15}

\end{equation}

To learn the non-linear latent variable model, we'll maximize this quantity as a function of both $\boldsymbol\phi$ and $\boldsymbol\theta$. The neural architecture that computes this quantity (and hence is used to optimize it) is the variational autoencoder. Before we introduce that, we first consider some of the properties of the ELBO.

When first encountered, the ELBO can be a somewhat mysterious object. In this section we'll provide some intuition about its properties. Consider that the original log likelihood of the data is a function of the parameters $\boldsymbol\phi$ and we want to find its maximum. For any fixed $\boldsymbol\theta$, the ELBO is still a function of the parameters, but one that must lie below the original likelihood function. When we change $\boldsymbol\theta$ we modify this function and depending on our choice, it may move closer or further from the log likelihood. When we change $\boldsymbol\phi$ we move along this function. These perturbations are illustrated in figure 6.

The ELBO is described as being *tight* when for a fixed value of $\boldsymbol\phi$ we choose parameters $\boldsymbol\theta$ so that the ELBO and the likelihood function coincide. We can show that this happens when the distribution $q(\boldsymbol\theta)$ is equal to the posterior distribution $Pr(\mathbf{h}|\mathbf{x})$ over the hidden variables. We start by expanding out the joint probability numerator of the fraction in the ELBO using the definition of conditional probability:

\begin{eqnarray}

\mbox{ELBO}[\boldsymbol\theta, \boldsymbol\phi] &=& \int q(\mathbf{h}|\boldsymbol\theta) \log\left[\frac{ Pr(\mathbf{x},\mathbf{h}|\boldsymbol\phi)}{q(\mathbf{h}|\boldsymbol\theta)} \right]d\mathbf{h}\nonumber \\

&=& \int q(\mathbf{h}|\boldsymbol\theta) \log\left[\frac{ Pr(\mathbf{h}|\mathbf{x},\boldsymbol\phi)Pr(\mathbf{x}|\boldsymbol\phi)}{q(\mathbf{h}|\boldsymbol\theta)} \right]d\mathbf{h}\nonumber \\

&=& \int q(\mathbf{h}|\boldsymbol\theta)

\log\left[Pr(\mathbf{x}|\boldsymbol\phi)\right]d\mathbf{h} +\int q(\mathbf{h}|\boldsymbol\theta) \log\left[\frac{ Pr(\mathbf{h}|\mathbf{x},\boldsymbol\phi)}{q(\mathbf{h}|\boldsymbol\theta)} \right]d\mathbf{h} \nonumber\nonumber \\

&=& \log[Pr(\mathbf{x} |\boldsymbol\phi)] +\int q(\mathbf{h}|\boldsymbol\theta) \log\left[\frac{ Pr(\mathbf{h}|\mathbf{x},\boldsymbol\phi)}{q(\mathbf{h}|\boldsymbol\theta)} \right]d\mathbf{h} \nonumber \\

&=& \log[Pr(\mathbf{x} |\boldsymbol\phi)] -\mbox{D}_{KL}\left[ q(\mathbf{h}|\boldsymbol\theta) ||Pr(\mathbf{h}|\mathbf{x},\boldsymbol\phi)\right].\label{eq:ELBOEvidenceKL} \tag{16}

\end{eqnarray}

This equation shows that the ELBO is the original log likelihood minus the Kullback-Leibler divergence $\mbox{D}_{KL}\left[ q(\mathbf{h}|\boldsymbol\theta) ||Pr(\mathbf{h}|\mathbf{x},\boldsymbol\phi)\right]$ which will be zero when these distributions are the same. Hence the bound is tight when $q(\mathbf{h}|\boldsymbol\theta) =Pr(\mathbf{h}|\mathbf{x},\boldsymbol\phi)$. Since the KL divergence can only take non-negative values it is easy to see that the ELBO is a lower bound on $\log[Pr(\mathbf{x} |\boldsymbol\phi)]$ from this formulation.

In the previous section we saw that the bound is tight when the distribution $q(\mathbf{h}|\boldsymbol\theta)$ matches the posterior $Pr(\mathbf{h}|\mathbf{x},\boldsymbol\phi)$. This observation is the basis of the *expectation maximization* (*EM*) algorithm. Here, we alternately (i) choose $\boldsymbol\theta$ so that $q(\mathbf{h}|\boldsymbol\theta)$ equals the posterior $Pr(\mathbf{h}|\mathbf{x},\boldsymbol\phi)$ and (ii) change $\boldsymbol\phi$ to maximize the upper bound (figure 7a). This is viable for models like the mixture of Gaussians where we can compute the posterior distribution in closed form. Unfortunately, for the non-linear latent variable model there is no closed form expression for the posterior distribution and so this method is inapplicable.

We've already seen two different ways to write the ELBO (equations 15 and 16). In fact, there are several more ways to re-express this function (see Hoffman & Johnson 2016). The one that is important for the VAE is:

\begin{eqnarray}

\mbox{ELBO}[\boldsymbol\theta, \boldsymbol\phi] &=& \int q(\mathbf{h}|\boldsymbol\theta) \log\left[\frac{ Pr(\mathbf{x},\mathbf{h}|\boldsymbol\phi)}{q(\mathbf{h}|\boldsymbol\theta)} \right]d\mathbf{h}\nonumber \\

&=& \int q(\mathbf{h}|\boldsymbol\theta) \log\left[\frac{ Pr(\mathbf{x}|\mathbf{h},\boldsymbol\phi)Pr(\mathbf{h})}{q(\mathbf{h}|\boldsymbol\theta)} \right]d\mathbf{h}\nonumber \\

&=& \int q(\mathbf{h}|\boldsymbol\theta) \log\left[ Pr(\mathbf{x}|\mathbf{h},\boldsymbol\phi) \right]d\mathbf{h}

+ \int q(\mathbf{h}|\boldsymbol\theta) \log\left[\frac{Pr(\mathbf{h})}{q(\mathbf{h}|\boldsymbol\theta)}\right]d\mathbf{h}

\nonumber \\

&=& \int q(\mathbf{h}|\boldsymbol\theta) \log\left[ Pr(\mathbf{x}|\mathbf{h},\boldsymbol\phi) \right]d\mathbf{h}

- \mbox{D}_{KL}[ q(\mathbf{h}|\boldsymbol\theta), Pr(\mathbf{h})] \tag{17}

\end{eqnarray}

In this formulation, the first term measures the average agreement $Pr(\mathbf{x}|\mathbf{h},\boldsymbol\phi)$ of the hidden variable and the data (reconstruction loss) and the second one measures the degree to which the auxiliary distribution $q(\mathbf{h}, \boldsymbol\theta)$ matches the prior. This formulation is the one that will be used in the variational autoencoder.

We have seen that the ELBO is tight when we choose the distribution $q(\mathbf{h}|\boldsymbol\theta)$ to be the posterior $Pr(\mathbf{h}|\mathbf{x},\boldsymbol\phi)$ but for the non-linear latent variable model, we cannot write an expression for this posterior.

The solution to the problem is to make a variational approximation: we just choose a simple parametric form for $q(\mathbf{h}|\boldsymbol\theta)$ and use this as an approximation to the true posterior. In this case we'll choose a normal distribution with parameters $\boldsymbol\mu$ and $\boldsymbol\Sigma$. This distribution is not always going to be a great match to the posterior, but will be better for some values of $\boldsymbol\mu$ and $\boldsymbol\Sigma$ than others. When we optimize this model, we will be finding the normal distribution that is "closest" to the true posterior $Pr(\mathbf{h}|\mathbf{x})$ (figure 8). This corresponds to minimizing the KL divergence in equation 16.

Since the optimal choice for $q(\mathbf{h}|\boldsymbol\theta)$ was the posterior $Pr(\mathbf{h}|\mathbf{x})$ and this depended on the data example $\mathbf{x}$, it makes sense that our variational approximation should do the same and so we choose

\begin{equation}\label{eq:posterior_pred}

q(\mathbf{h}|\boldsymbol\theta,\mathbf{x}) = \mbox{Norm}_{\mathbf{h}}[g_{\mu}[\mathbf{x}|\boldsymbol\theta], g_{\sigma}[\mathbf{x}|\boldsymbol\theta]], \tag{18}

\end{equation}

where $g[\mathbf{x},\boldsymbol\theta]$ is a neural network with parameters $\boldsymbol\theta$ that predicts the mean and variance of the normal variational approximation.

Finally, we are in a position to describe the variational autoencoder. We will build a network that computes the ELBO:

\begin{equation}

\mbox{ELBO}[\boldsymbol\theta, \boldsymbol\phi]

= \int q(\mathbf{h}|\mathbf{x},\boldsymbol\theta) \log\left[ Pr(\mathbf{x}|\mathbf{h},\boldsymbol\phi) \right]d\mathbf{h}

- \mbox{D}_{KL}[ q(\mathbf{h}|\mathbf{x},\boldsymbol\theta), Pr(\mathbf{h})] \tag{19}

\end{equation}

where the distribution $q(\mathbf{h}|\mathbf{x},\boldsymbol\theta)$ is the approximation from equation 18.

The first term in equation 19 still involves an integral that we cannot compute, but since it represents an expectation, we can approximate it with a set of samples:

\begin{equation}

E[f[\mathbf{h}]] \approx \frac{1}{N}\sum_{n=1}^{N}f[\mathbf{h}^{*}_N]\tag{20}

\end{equation}

where $\mathbf{h}^{*}_{n}$ is the $n^{th}$ sample. In the limit, we might only use a single sample $\mathbf{h}^{*}$ from $q(\mathbf{h}|\mathbf{x},\boldsymbol\theta)$ as a very approximate estimate of the expectation and here the ELBO will look like:

\begin{eqnarray}

\mbox{ELBO}[\boldsymbol\theta, \boldsymbol\phi] &\approx& \log\left[ Pr(\mathbf{x}|\mathbf{h}^{*}|\boldsymbol\phi) \right]- \mbox{D}_{KL}[ q(\mathbf{h}|\mathbf{x},\boldsymbol\theta), Pr(\mathbf{h})] \tag{21}

\end{eqnarray}

The second term is just the KL divergence between the variational Gaussian $q(\mathbf{h}|\mathbf{x},\boldsymbol\theta) = \mbox{Norm}_{\mathbf{h}}[\boldsymbol\mu,\boldsymbol\Sigma]$ and the prior $Pr(h) =\mbox{Norm}_{\mathbf{h}}[\mathbf{0},\mathbf{I}]$. The KL divergence between two Gaussians can be calculated in closed form and for this case is given by:

\begin{equation}

\mbox{D}_{KL}[ q(\mathbf{h}|\mathbf{x},\boldsymbol\theta), Pr(\mathbf{h})] = \frac{1}{2}\left(\mbox{Tr}[\boldsymbol\Sigma] + \boldsymbol\mu^T\boldsymbol\mu - D - \log\left[\mbox{det}[\boldsymbol\Sigma]\right]\right). \tag{22}

\end{equation}

were $D$ is the dimensionality of the hidden space.

So, to compute the ELBO for a point $\mathbf{x}$ we first estimate the mean $\boldsymbol\mu$ and variance $\boldsymbol\Sigma$ of the posterior distribution $q(\mathbf{h}|\boldsymbol\theta,\mathbf{x})$ for this data point $\mathbf{x}$ using the network $\mbox{g}[\mathbf{x},\boldsymbol\theta]$. Then we draw a sample $\mathbf{h}^{*}$ from this distribution. Finally, we compute the ELBO using equation 21.

The architecture to compute this is shown in figure 9. Now it's clear why it is called a variational autoencoder. It is an autoencoder because it starts with a data point $\mathbf{x}$, computes a lower dimensional latent vector $\mathbf{h}$ from this and then uses this to recreate the original vector $\mathbf{x}$ as closely as possible. It is variational because it computes a Gaussian approximation to the posterior distribution along the way.

The VAE computes the ELBO bound as a function of the parameters $\boldsymbol\phi$ and $\boldsymbol\theta$. When we maximize this bound as a function of both of these parameters, we gradually move the parameters $\boldsymbol\phi$ to values that have give the data a higher likelihood under the non-linear latent variable model (figure 7b).

In this section, we've described how to compute the ELBO for a single point, but actually we want to maximize its sum over all of the data examples. As in most deep learning methods, we accomplish this with stochastic gradient descent, by running mini-batches of points through our network.

You might think that we are done; we set up this architecture, then we allow PyTorch / Tensorflow to perform automatic differentiation via the backpropagation algorithm and hence optimize the cost function. However, there's a problem. The network involves a sampling step and there is no way to differentiate through this. Consequently, it's impossible to make updates to the parameters $\boldsymbol\theta$ that occur earlier in the network than this.

Fortunately, there is a simple solution; we can move the stochastic part into a branch of the network which draws a sample from $\mbox{Norm}_{\epsilon}[\mathbf{0},\mathbf{I}]$ and then use the relation

\begin{equation}

\mathbf{h}^{*} = \boldsymbol\mu + \boldsymbol\Sigma^{1/2}\epsilon, \tag{23}

\end{equation}

to draw from the intended Gaussian. Now we can compute the derivatives as usual because there is no need for the backpropagation algorithm to pass down the stochastic branch. This is known as the reparameterization trick and is illustrated in figure 10.

Variational autoencoders were first introduced by Kingma &Welling (2013). Since then, they have been extended in several ways. First, they have been adapted to other data types including discrete data (van den Oord *et* al. 2017, Razavi *et* al. 2019), word sequences (Bowman *et* al. 2015), and temporal data (Gregor & Besse 2018). Second, researchers have experimented with different forms for the variational distribution, most notably using normalizing flows which can approximate the true posterior much more closely than a Gaussian (Rezende & Mohamed 2015). Third, there is a strand of work investigating more complex likelihood models $Pr(\mathbf{x}|\mathbf{h})$. For example, Gulrajani *et *al. (2016) used an auto-regressive relation between output variables and Dorta *et* al. (2018) modeled the covariance as well as the mean.

Finally, there is a large body of work that attempts to improve the properties of the latent space. Here, one popular goal is to learn a *disentangled* representation in which each dimension of the latent space represents an independent real world factor. For example, when modeling face images, we might hope to uncover head pose or hair color as independent factors. These methods generally add regularization terms to either the posterior $q(\mathbf{h}|\mathbf{x})$ or the aggregated posterior $q(\mathbf{h}) = \frac{1}{J}\sum_{i=1}^{I}q(\mathbf{h}|\mathbf{x}_{i})$ so that the new loss function is

\begin{equation}

L_{new} = \mbox{ELBO}[\boldsymbol\theta, \boldsymbol\phi] - \lambda_{1} \mathbb{E}_{Pr(\mathbf{x})}\left[\mbox{R}_{1}\left[q(\mathbf{h}|\mathbf{x}) \right]\right] - \lambda_{2} \mbox{R}_{2}[q(\mathbf{h})], \tag{24}

\end{equation}

where $\lambda_{1}$ and $\lambda_{2}$ are weights and $\mbox{R}_{1}[\bullet]$ and $\mbox{R}_{2}[\bullet]$ are functions of the posterior and aggregated posterior respectively. This class of methods includes the BetaVAE (Higgins *et* al. 2017), InfoVAE (Zhao *et* al. 2017) and many others (*e.g.*, Kim & Mnih 2018, Kumar *et* al. 2017, Chen *et* al. 2018).

VAEs have several drawbacks. First, we cannot compute the likelihood of a new point $\mathbf{x}$ under the probability distribution efficiently, because this involves integrating over the hidden variable $\mathbf{h}$. We can approximate this integral using a Markov chain Monte Carlo method, but this is very inefficient. Second, samples from VAEs are generally not perfect (figure 11). The naive spherical Gaussian noise model which is independent for each variable generally produces noisy samples (or overly smooth ones if we do not add in the noise).

In practice, training VAEs (particularly sequence VAEs) can be brittle. It's possible that that the system converges to a local minimum in which the latent variable is completely ignored and the encoder always predicts the prior. This phenomenon is known as *posterior collapse* (Bowman *et* al. 2015). One way to avoid this problem is to only gradually introduce the second term in the cost function (equation 19) using an annealing schedule.

The VAE is an architecture to learn a probability model over $Pr(\mathbf{x})$. This model can generate samples, and interpolate between them (by manipulating the latent variable $\mathbf{h}$) but it is not easy to compute the likelihood for a new example. There are two main alternatives to the VAE. The first is generative adversarial networks. These are good for sampling but their samples have quite different properties from those produced by the VAE. Similarly to the VAE they cannot evaluate the likelihood of new data points. The second alternative is normalizing flows for which the both sampling and likelihood evaluation are tractable.

]]>