Jan. 28, 2020

The goal of the variational autoencoder (VAE) is to learn a probability distribution $Pr(\mathbf{x})$ over a multi-dimensional variable $\mathbf{x}$. There are two main reasons for modelling distributions. First, we might want to draw samples (generate) from the distribution to create new plausible values of $\mathbf{x}$. Second, we might want to measure the likelihood that a new vector $\mathbf{x}^{*}$ was created by this probability distribution. In fact, it turns out that the variational autoencoder is well-suited to the former task but not for the latter.

It is common to talk about the variational autoencoder as if it *is* the model of $Pr(\mathbf{x})$. However, this is misleading; the variational autoencoder is a neural architecture that is designed to help learn the model for $Pr(\mathbf{x})$. The final model contains neither the 'variational' nor the 'autoencoder' parts and is better described as a *non-linear latent variable model*.

We'll start this tutorial by discussing latent variable models in general and then the specific case of the non-linear latent variable model. We'll see that maximum likelihood learning of this model is not straightforward, but we can define a lower bound on the likelihood. We then show how the autoencoder architecture can approximate this bound using a Monte Carlo (sampling) method. To maximize the bound, we need to compute derivatives, but unfortunately, it's not possible to compute the derivative of the sampling component. We'll show how to side-step this problem using the reparameterization trick. Finally, we'll discuss extensions of the VAE and some of its drawbacks.

Latent variable models take an indirect approach to describing a probability distribution $Pr(\mathbf{x})$ over a multi-dimensional variable $\mathbf{x}$. Instead of directly writing the expression for $Pr(\mathbf{x})$ they model a joint distribution $Pr(\mathbf{x}, \mathbf{h})$ of the data $\mathbf{x}$ and an unobserved latent (or hidden) variable $\mathbf{h}$. They then describe the probability of $Pr(\mathbf{x})$ as a marginalization of this joint probability so that

\begin{equation}

Pr(\mathbf{x}) = \int Pr(\mathbf{x}, \mathbf{h}) d\mathbf{h}.\tag{1}

\end{equation}

Typically we describe the joint probability $Pr(\mathbf{x}, \mathbf{h})$ as the product of the *likelihood* $Pr(\mathbf{x}|\mathbf{h})$ and the *prior* $Pr(\mathbf{h})$, so that the model becomes

\begin{equation}

Pr(\mathbf{x}) = \int Pr(\mathbf{x}| \mathbf{h}) Pr(\mathbf{h}) d\mathbf{h}.\tag{2}

\end{equation}

It is reasonable to question why we should take this indirect approach to describing $Pr(\mathbf{x})$. The answer is that relatively simple expressions for $Pr(\mathbf{x}| \mathbf{h})$ and $Pr(\mathbf{h})$ can describe a very complex distribution for $Pr(\mathbf{x})$.

A well known latent variable model is the mixture of Gaussians. Here the latent variable $h$ is discrete and the prior $Pr(h)$ is a discrete distribution with one probability $\lambda_{k}$ for each of the $K$ component Gaussians. The likelihood $Pr(\mathbf{x}|h)$ is a Gaussian with a mean $\boldsymbol\mu_{k}$ and covariance $\boldsymbol\Sigma_{k}$ that depends on the value $k$ of the latent variable $h$:

\begin{eqnarray}

Pr(h=k) &=& \lambda_{k}\nonumber \\

Pr(\mathbf{x} |h = k) &=& \mbox{Norm}_{\mathbf{x}}[\boldsymbol\mu_{k},\boldsymbol\Sigma_{k}].\label{eq:mog_like_prior}\tag{3}

\end{eqnarray}

where $\mbox{Norm}_{\mathbf{x}}[\boldsymbol\mu, \boldsymbol\Sigma]$ represents a multivariate probability distribution over $\mathbf{x}$ with mean $\boldsymbol\mu$ and covariance $\boldsymbol\Sigma$.

As in equation 2, the likelihood $Pr(x)$ is given by the marginalization over the latent variable $h$. In this case, this is a sum as the latent variable is discrete:

\begin{eqnarray}

Pr(\mathbf{x}) &=& \sum_{k=1}^{K} Pr(\mathbf{x}, h=k) \nonumber \\

&=& \sum_{k=1}^{K} Pr(\mathbf{x}| h=k) Pr(h=k)\nonumber \\

&=& \sum_{k=1}^{K} \lambda_{k} \mbox{Norm}_{\mathbf{x}}[\boldsymbol\mu_{k},\boldsymbol\Sigma_{k}]. \tag{4}

\end{eqnarray}

This is illustrated in figure 1. From the simple expressions for the likelihood and prior in equation 3, we can describe a complex multi-modal probability distribution.

Now let's consider the non-linear latent variable model, which is what the VAE actually learns. This differs from the mixture of Gaussians in two main ways. First, the latent variable $\mathbf{h}$ is continuous rather than discrete and has a standard normal prior (i.e., one with mean zero and identity covariance). Second, the likelihood is a normal distribution as before, but the variance is constant and spherical. The mean is a non-linear function $\mathbf{f}[\mathbf{h},\bullet]$ of the hidden variable $\mathbf{h}$ and this gives rise to the name. The prior and likelihood terms are:

\begin{eqnarray}

Pr(\mathbf{h}) &=& \mbox{Norm}_{\mathbf{h}}[\mathbf{0},\mathbf{I}]\nonumber \\

Pr(\mathbf{x} |\mathbf{h},\boldsymbol\phi) &=& \mbox{Norm}_{\mathbf{x}}[\mathbf{f}[\mathbf{h},\boldsymbol\phi],\sigma^{2}\mathbf{I}], \tag{5}

\end{eqnarray}

where the function $\mathbf{f}[\mathbf{h},\boldsymbol\phi]$ is a deep neural network with parameters $\boldsymbol\phi$. This model is illustrated in figure 2.

The model can be viewed as an infinite mixture of spherical Gaussians with different means; as before, we build a complex distribution by weighting and summing these Gaussians in the marginalization process. In the next three sections we consider three operations that we might want to perform with this model: computing the posterior distribution, sampling, and evaluating the likelihood.

The likelihood $Pr(\mathbf{x} |\mathbf{h},\boldsymbol\phi)$ tells us how to compute the distribution over the observed data $\mathbf{x}$ given hidden variable $\mathbf{h}$. We might however want to move in the other direction; given an observed data example $\mathbf{x}$ we might wish to understand what possible values of the hidden variable $\mathbf{h}$ were responsible for it (figure 3). This information is encompassed in the posterior distribution $Pr(\mathbf{h}|\mathbf{x})$. In principle, we can compute this using Bayes's rule

\begin{eqnarray}

Pr(\mathbf{h}|\mathbf{x}) = \frac{Pr(\mathbf{x}|\mathbf{h})Pr(\mathbf{h})}{Pr(\mathbf{x})}. \tag{6}

\end{eqnarray}

However, in practice, there is no closed form expression for the left hand side of this equation. In fact, as we shall see shortly, we cannot evaluate the denominator $Pr(\mathbf{x})$ and so we can't even compute the numerical value of the posterior for a given pair $\mathbf{h}$ and $\mathbf{x}$.

Although computing the posterior is intractable, it is easy to generate a new sample $\mathbf{x}^{*}$ from this model using ancestral sampling; we draw $\mathbf{h}^{*}$ from the prior $Pr(\mathbf{h})$, pass this through the network $f[\mathbf{h}^{*},\boldsymbol\phi]$ to compute the mean of the likelihood $Pr(\mathbf{x}|\mathbf{h})$ and then draw $\mathbf{h}$ from this distribution. Both the prior and the likelihood are normal distributions and so sampling from them in each step is easy. This process is illustrated in figure 4.

Finally, let's consider evaluating the likelihood of a data example $\mathbf{x}$ under the model. As before, the likelihood is given by:

\begin{eqnarray}

Pr(\mathbf{x}) &=& \int Pr(\mathbf{x}, \mathbf{h}|\boldsymbol\phi) d\mathbf{h} \nonumber \\

&=& \int Pr(\mathbf{x}| \mathbf{h},\boldsymbol\phi) Pr(\mathbf{h})d\mathbf{h}\nonumber \\

&=& \int \mbox{Norm}_{\mathbf{x}}[\mathbf{f}[\mathbf{h},\boldsymbol\phi],\sigma^{2}\mathbf{I}]\mbox{Norm}_{\mathbf{h}}[\mathbf{0},\mathbf{I}]d\mathbf{h}. \tag{7}

\end{eqnarray}

Unfortunately, there is no closed form for this integral, so we cannot easily compute the probability for a given example $\mathbf{x}$. This is a major problem for two reasons. First, evaluating the probability $Pr(\mathbf{x})$ was of the main reasons for modelling the probability distribution in the first place. Second, to learn the model, we maximize the log likelihood, which is obviously going to be hard if we cannot compute it. In the next section we'll introduce a lower bound on the log likelihood which can be computed and which we can use to learn the model.

During learning we are given training data $\{\mathbf{x}_{i}\}_{i=1}^{I}$ and want to maximize the parameters $\boldsymbol\phi$ of the model with respect to the log likelihood. For simplicity we'll assume that the variance term $\sigma^2$ in the likelihood expression is known and just concentrate on learning $\boldsymbol\phi$:

\begin{eqnarray}

\hat{\boldsymbol\phi} &=& argmax_{\boldsymbol\phi} \left[\sum_{i=1}^{I}\log\left[Pr(\mathbf{x}_{i}|\boldsymbol\phi) \right]\right] \nonumber \\

&=& argmax_{\boldsymbol\phi} \left[\sum_{i=1}^{I}\log\left[\int Pr(\mathbf{x}_{i}, \mathbf{h}_{i}|\boldsymbol\phi) d\mathbf{h}_{i}\right]\right].\label{eq:log_like} \tag{8}

\end{eqnarray}

As we noted above, we cannot write a closed form expression for the integral and so we can't just build a network to compute this and let Tensorflow or PyTorch optimize it.

To make some progress we define a lower bound on the log likelihood. This is a function that is always less than or equal to the log likelihood for a given value of $\boldsymbol\phi$ and will also depend on some other parameters $\boldsymbol\theta$. Eventually we will build a network to compute this lower bound and optimize it. To define this lower bound, we need to use Jensen's inequality which we quickly review in the next section.

Jensen’s inequality concerns what happens when we pass values through a concave function $g[\bullet]$. Specifically, it says that if we compute the expectation (mean) of these values and pass this mean through the function, the results will be greater than if we pass the values themselves through the function and then compute the expectation of the results. In mathematical terms:

\begin{equation}

g[\mathbf{E}[y]] \geq \mathbf{E}[g[y]], \tag{9}

\end{equation}

for any concave function $g[\bullet]$. Some intuition as to why this is true is given in figure 5. In our case, the concave function in question is the logarithm so we have:

\begin{equation}

\log[\mathbf{E}[y]]\geq\mathbf{E}[\log[y]], \tag{10}

\end{equation}

or writing out the expression for expectation in full we have:

\begin{equation}

\log\left[\int Pr(y) y dy\right]\geq \int Pr(y)\log[y]dy. \tag{11}

\end{equation}

We will now use Jensen's inequality to derive the lower bound for the log likelihood. We start by multiplying and dividing the log likelihood by an arbitrary probability distribution $q(\mathbf{h})$ over the hidden variables

\begin{eqnarray}

\log[Pr(\mathbf{x}|\boldsymbol\phi)] &=& \log\left[\int Pr(\mathbf{x},\mathbf{h}|\boldsymbol\phi)d\mathbf{h} \tag{12} \right] \\

&=& \log\left[\int q(\mathbf{h}) \frac{ Pr(\mathbf{x},\mathbf{h}|\boldsymbol\phi)}{q(\mathbf{h})}d\mathbf{h} \tag{13} \right],

\end{eqnarray}

We then use Jensen's inequality for the logarithm (equation 11) to find a lower bound:

\begin{eqnarray}

\log\left[\int q(\mathbf{h}) \frac{ Pr(\mathbf{x},\mathbf{h}|\boldsymbol\phi)}{q(\mathbf{h})}d\mathbf{h} \right]

&\geq& \int q(\mathbf{h}) \log\left[\frac{ Pr(\mathbf{x},\mathbf{h}|\boldsymbol\phi)}{q(\mathbf{h})} \right]d\mathbf{h}, \tag{14}

\end{eqnarray}

where the term on the right hand side is known as the *evidence lower bound* or *ELBO*. It gets this name because the term $Pr(\mathbf{x}|\boldsymbol\phi)$ is known as the evidence when viewed in the context of Bayes' rule (equation 6).

In practice, the distribution $q(\mathbf{h})$ will have some parameters $\boldsymbol\theta$ as well and so the ELBO can be written as:

\begin{equation}

\mbox{ELBO}[\boldsymbol\theta, \boldsymbol\phi] = \int q(\mathbf{h}|\boldsymbol\theta) \log\left[\frac{ Pr(\mathbf{x},\mathbf{h}|\boldsymbol\phi)}{q(\mathbf{h}|\boldsymbol\theta)} \right]d\mathbf{h}. \tag{15}

\end{equation}

To learn the non-linear latent variable model, we'll maximize this quantity as a function of both $\boldsymbol\phi$ and $\boldsymbol\theta$. The neural architecture that computes this quantity (and hence is used to optimize it) is the variational autoencoder. Before we introduce that, we first consider some of the properties of the ELBO.

When first encountered, the ELBO can be a somewhat mysterious object. In this section we'll provide some intuition about its properties. Consider that the original log likelihood of the data is a function of the parameters $\boldsymbol\phi$ and we want to find its maximum. For any fixed $\boldsymbol\theta$, the ELBO is still a function of the parameters, but one that must lie below the original likelihood function. When we change $\boldsymbol\theta$ we modify this function and depending on our choice, it may move closer or further from the log likelihood. When we change $\boldsymbol\phi$ we move along this function. These perturbations are illustrated in figure 6.

The ELBO is described as being *tight* when for a fixed value of $\boldsymbol\phi$ we choose parameters $\boldsymbol\theta$ so that the ELBO and the likelihood function coincide. We can show that this happens when the distribution $q(\boldsymbol\theta)$ is equal to the posterior distribution $Pr(\mathbf{h}|\mathbf{x})$ over the hidden variables. We start by expanding out the joint probability numerator of the fraction in the ELBO using the definition of conditional probability:

\begin{eqnarray}

\mbox{ELBO}[\boldsymbol\theta, \boldsymbol\phi] &=& \int q(\mathbf{h}|\boldsymbol\theta) \log\left[\frac{ Pr(\mathbf{x},\mathbf{h}|\boldsymbol\phi)}{q(\mathbf{h}|\boldsymbol\theta)} \right]d\mathbf{h}\nonumber \\

&=& \int q(\mathbf{h}|\boldsymbol\theta) \log\left[\frac{ Pr(\mathbf{h}|\mathbf{x},\boldsymbol\phi)Pr(\mathbf{x}|\boldsymbol\phi)}{q(\mathbf{h}|\boldsymbol\theta)} \right]d\mathbf{h}\nonumber \\

&=& \int q(\mathbf{h}|\boldsymbol\theta)

\log\left[Pr(\mathbf{x}|\boldsymbol\phi)\right]d\mathbf{h} +\int q(\mathbf{h}|\boldsymbol\theta) \log\left[\frac{ Pr(\mathbf{h}|\mathbf{x},\boldsymbol\phi)}{q(\mathbf{h}|\boldsymbol\theta)} \right]d\mathbf{h} \nonumber\nonumber \\

&=& \log[Pr(\mathbf{x} |\boldsymbol\phi)] +\int q(\mathbf{h}|\boldsymbol\theta) \log\left[\frac{ Pr(\mathbf{h}|\mathbf{x},\boldsymbol\phi)}{q(\mathbf{h}|\boldsymbol\theta)} \right]d\mathbf{h} \nonumber \\

&=& \log[Pr(\mathbf{x} |\boldsymbol\phi)] -\mbox{D}_{KL}\left[ q(\mathbf{h}|\boldsymbol\theta) ||Pr(\mathbf{h}|\mathbf{x},\boldsymbol\phi)\right].\label{eq:ELBOEvidenceKL} \tag{16}

\end{eqnarray}

This equation shows that the ELBO is the original log likelihood minus the Kullback-Leibler divergence $\mbox{D}_{KL}\left[ q(\mathbf{h}|\boldsymbol\theta) ||Pr(\mathbf{h}|\mathbf{x},\boldsymbol\phi)\right]$ which will be zero when these distributions are the same. Hence the bound is tight when $q(\mathbf{h}|\boldsymbol\theta) =Pr(\mathbf{h}|\mathbf{x},\boldsymbol\phi)$. Since the KL divergence can only take non-negative values it is easy to see that the ELBO is a lower bound on $\log[Pr(\mathbf{x} |\boldsymbol\phi)]$ from this formulation.

In the previous section we saw that the bound is tight when the distribution $q(\mathbf{h}|\boldsymbol\theta)$ matches the posterior $Pr(\mathbf{h}|\mathbf{x},\boldsymbol\phi)$. This observation is the basis of the *expectation maximization* (*EM*) algorithm. Here, we alternately (i) choose $\boldsymbol\theta$ so that $q(\mathbf{h}|\boldsymbol\theta)$ equals the posterior $Pr(\mathbf{h}|\mathbf{x},\boldsymbol\phi)$ and (ii) change $\boldsymbol\phi$ to maximize the upper bound (figure 7a). This is viable for models like the mixture of Gaussians where we can compute the posterior distribution in closed form. Unfortunately, for the non-linear latent variable model there is no closed form expression for the posterior distribution and so this method is inapplicable.

We've already seen two different ways to write the ELBO (equations 15 and 16). In fact, there are several more ways to re-express this function (see Hoffman & Johnson 2016). The one that is important for the VAE is:

\begin{eqnarray}

\mbox{ELBO}[\boldsymbol\theta, \boldsymbol\phi] &=& \int q(\mathbf{h}|\boldsymbol\theta) \log\left[\frac{ Pr(\mathbf{x},\mathbf{h}|\boldsymbol\phi)}{q(\mathbf{h}|\boldsymbol\theta)} \right]d\mathbf{h}\nonumber \\

&=& \int q(\mathbf{h}|\boldsymbol\theta) \log\left[\frac{ Pr(\mathbf{x}|\mathbf{h},\boldsymbol\phi)Pr(\mathbf{h})}{q(\mathbf{h}|\boldsymbol\theta)} \right]d\mathbf{h}\nonumber \\

&=& \int q(\mathbf{h}|\boldsymbol\theta) \log\left[ Pr(\mathbf{x}|\mathbf{h},\boldsymbol\phi) \right]d\mathbf{h}

+ \int q(\mathbf{h}|\boldsymbol\theta) \log\left[\frac{Pr(\mathbf{h})}{q(\mathbf{h}|\boldsymbol\theta)}\right]d\mathbf{h}

\nonumber \\

&=& \int q(\mathbf{h}|\boldsymbol\theta) \log\left[ Pr(\mathbf{x}|\mathbf{h},\boldsymbol\phi) \right]d\mathbf{h}

- \mbox{D}_{KL}[ q(\mathbf{h}|\boldsymbol\theta), Pr(\mathbf{h})] \tag{17}

\end{eqnarray}

In this formulation, the first term measures the average agreement $Pr(\mathbf{x}|\mathbf{h},\boldsymbol\phi)$ of the hidden variable and the data (reconstruction loss) and the second one measures the degree to which the auxiliary distribution $q(\mathbf{h}, \boldsymbol\theta)$ matches the prior. This formulation is the one that will be used in the variational autoencoder.

We have seen that the ELBO is tight when we choose the distribution $q(\mathbf{h}|\boldsymbol\theta)$ to be the posterior $Pr(\mathbf{h}|\mathbf{x},\boldsymbol\phi)$ but for the non-linear latent variable model, we cannot write an expression for this posterior.

The solution to the problem is to make a variational approximation: we just choose a simple parametric form for $q(\mathbf{h}|\boldsymbol\theta)$ and use this as an approximation to the true posterior. In this case we'll choose a normal distribution with parameters $\boldsymbol\mu$ and $\boldsymbol\Sigma$. This distribution is not always going to be a great match to the posterior, but will be better for some values of $\boldsymbol\mu$ and $\boldsymbol\Sigma$ than others. When we optimize this model, we will be finding the normal distribution that is "closest" to the true posterior $Pr(\mathbf{h}|\mathbf{x})$ (figure 8). This corresponds to minimizing the KL divergence in equation 16.

Since the optimal choice for $q(\mathbf{h}|\boldsymbol\theta)$ was the posterior $Pr(\mathbf{h}|\mathbf{x})$ and this depended on the data example $\mathbf{x}$, it makes sense that our variational approximation should do the same and so we choose

\begin{equation}\label{eq:posterior_pred}

q(\mathbf{h}|\boldsymbol\theta,\mathbf{x}) = \mbox{Norm}_{\mathbf{h}}[g_{\mu}[\mathbf{x}|\boldsymbol\theta], g_{\sigma}[\mathbf{x}|\boldsymbol\theta]], \tag{18}

\end{equation}

where $g[\mathbf{x},\boldsymbol\theta]$ is a neural network with parameters $\boldsymbol\theta$ that predicts the mean and variance of the normal variational approximation.

Finally, we are in a position to describe the variational autoencoder. We will build a network that computes the ELBO:

\begin{equation}

\mbox{ELBO}[\boldsymbol\theta, \boldsymbol\phi]

= \int q(\mathbf{h}|\mathbf{x},\boldsymbol\theta) \log\left[ Pr(\mathbf{x}|\mathbf{h},\boldsymbol\phi) \right]d\mathbf{h}

- \mbox{D}_{KL}[ q(\mathbf{h}|\mathbf{x},\boldsymbol\theta), Pr(\mathbf{h})] \tag{19}

\end{equation}

where the distribution $q(\mathbf{h}|\mathbf{x},\boldsymbol\theta)$ is the approximation from equation 18.

The first term in equation 19 still involves an integral that we cannot compute, but since it represents an expectation, we can approximate it with a set of samples:

\begin{equation}

E[f[\mathbf{h}]] \approx \frac{1}{N}\sum_{n=1}^{N}f[\mathbf{h}^{*}_N]\tag{20}

\end{equation}

where $\mathbf{h}^{*}_{n}$ is the $n^{th}$ sample. In the limit, we might only use a single sample $\mathbf{h}^{*}$ from $q(\mathbf{h}|\mathbf{x},\boldsymbol\theta)$ as a very approximate estimate of the expectation and here the ELBO will look like:

\begin{eqnarray}

\mbox{ELBO}[\boldsymbol\theta, \boldsymbol\phi] &\approx& \log\left[ Pr(\mathbf{x}|\mathbf{h}^{*}|\boldsymbol\phi) \right]- \mbox{D}_{KL}[ q(\mathbf{h}|\mathbf{x},\boldsymbol\theta), Pr(\mathbf{h})] \tag{21}

\end{eqnarray}

The second term is just the KL divergence between the variational Gaussian $q(\mathbf{h}|\mathbf{x},\boldsymbol\theta) = \mbox{Norm}_{\mathbf{h}}[\boldsymbol\mu,\boldsymbol\Sigma]$ and the prior $Pr(h) =\mbox{Norm}_{\mathbf{h}}[\mathbf{0},\mathbf{I}]$. The KL divergence between two Gaussians can be calculated in closed form and for this case is given by:

\begin{equation}

\mbox{D}_{KL}[ q(\mathbf{h}|\mathbf{x},\boldsymbol\theta), Pr(\mathbf{h})] = \frac{1}{2}\left(\mbox{Tr}[\boldsymbol\Sigma] + \boldsymbol\mu^T\boldsymbol\mu - D - \log\left[\mbox{det}[\boldsymbol\Sigma]\right]\right). \tag{22}

\end{equation}

were $D$ is the dimensionality of the hidden space.

So, to compute the ELBO for a point $\mathbf{x}$ we first estimate the mean $\boldsymbol\mu$ and variance $\boldsymbol\Sigma$ of the posterior distribution $q(\mathbf{h}|\boldsymbol\theta,\mathbf{x})$ for this data point $\mathbf{x}$ using the network $\mbox{g}[\mathbf{x},\boldsymbol\theta]$. Then we draw a sample $\mathbf{h}^{*}$ from this distribution. Finally, we compute the ELBO using equation 21.

The architecture to compute this is shown in figure 9. Now it's clear why it is called a variational autoencoder. It is an autoencoder because it starts with a data point $\mathbf{x}$, computes a lower dimensional latent vector $\mathbf{h}$ from this and then uses this to recreate the original vector $\mathbf{x}$ as closely as possible. It is variational because it computes a Gaussian approximation to the posterior distribution along the way.

The VAE computes the ELBO bound as a function of the parameters $\boldsymbol\phi$ and $\boldsymbol\theta$. When we maximize this bound as a function of both of these parameters, we gradually move the parameters $\boldsymbol\phi$ to values that have give the data a higher likelihood under the non-linear latent variable model (figure 7b).

In this section, we've described how to compute the ELBO for a single point, but actually we want to maximize its sum over all of the data examples. As in most deep learning methods, we accomplish this with stochastic gradient descent, by running mini-batches of points through our network.

You might think that we are done; we set up this architecture, then we allow PyTorch / Tensorflow to perform automatic differentiation via the backpropagation algorithm and hence optimize the cost function. However, there's a problem. The network involves a sampling step and there is no way to differentiate through this. Consequently, it's impossible to make updates to the parameters $\boldsymbol\theta$ that occur earlier in the network than this.

Fortunately, there is a simple solution; we can move the stochastic part into a branch of the network which draws a sample from $\mbox{Norm}_{\epsilon}[\mathbf{0},\mathbf{I}]$ and then use the relation

\begin{equation}

\mathbf{h}^{*} = \boldsymbol\mu + \boldsymbol\Sigma^{1/2}\epsilon, \tag{23}

\end{equation}

to draw from the intended Gaussian. Now we can compute the derivatives as usual because there is no need for the backpropagation algorithm to pass down the stochastic branch. This is known as the reparameterization trick and is illustrated in figure 10.

Variational autoencoders were first introduced by Kingma &Welling (2013). Since then, they have been extended in several ways. First, they have been adapted to other data types including discrete data (van den Oord *et* al. 2017, Razavi *et* al. 2019), word sequences (Bowman *et* al. 2015), and temporal data (Gregor & Besse 2018). Second, researchers have experimented with different forms for the variational distribution, most notably using normalizing flows which can approximate the true posterior much more closely than a Gaussian (Rezende & Mohamed 2015). Third, there is a strand of work investigating more complex likelihood models $Pr(\mathbf{x}|\mathbf{h})$. For example, Gulrajani *et *al. (2016) used an auto-regressive relation between output variables and Dorta *et* al. (2018) modeled the covariance as well as the mean.

Finally, there is a large body of work that attempts to improve the properties of the latent space. Here, one popular goal is to learn a *disentangled* representation in which each dimension of the latent space represents an independent real world factor. For example, when modeling face images, we might hope to uncover head pose or hair color as independent factors. These methods generally add regularization terms to either the posterior $q(\mathbf{h}|\mathbf{x})$ or the aggregated posterior $q(\mathbf{h}) = \frac{1}{J}\sum_{i=1}^{I}q(\mathbf{h}|\mathbf{x}_{i})$ so that the new loss function is

\begin{equation}

L_{new} = \mbox{ELBO}[\boldsymbol\theta, \boldsymbol\phi] - \lambda_{1} \mathbb{E}_{Pr(\mathbf{x})}\left[\mbox{R}_{1}\left[q(\mathbf{h}|\mathbf{x}) \right]\right] - \lambda_{2} \mbox{R}_{2}[q(\mathbf{h})], \tag{24}

\end{equation}

where $\lambda_{1}$ and $\lambda_{2}$ are weights and $\mbox{R}_{1}[\bullet]$ and $\mbox{R}_{2}[\bullet]$ are functions of the posterior and aggregated posterior respectively. This class of methods includes the BetaVAE (Higgins *et* al. 2017), InfoVAE (Zhao *et* al. 2017) and many others (*e.g.*, Kim & Mnih 2018, Kumar *et* al. 2017, Chen *et* al. 2018).

VAEs have several drawbacks. First, we cannot compute the likelihood of a new point $\mathbf{x}$ under the probability distribution efficiently, because this involves integrating over the hidden variable $\mathbf{h}$. We can approximate this integral using a Markov chain Monte Carlo method, but this is very inefficient. Second, samples from VAEs are generally not perfect (figure 11). The naive spherical Gaussian noise model which is independent for each variable generally produces noisy samples (or overly smooth ones if we do not add in the noise).

In practice, training VAEs (particularly sequence VAEs) can be brittle. It's possible that that the system converges to a local minimum in which the latent variable is completely ignored and the encoder always predicts the prior. This phenomenon is known as *posterior collapse* (Bowman *et* al. 2015). One way to avoid this problem is to only gradually introduce the second term in the cost function (equation 19) using an annealing schedule.

The VAE is an architecture to learn a probability model over $Pr(\mathbf{x})$. This model can generate samples, and interpolate between them (by manipulating the latent variable $\mathbf{h}$) but it is not easy to compute the likelihood for a new example. There are two main alternatives to the VAE. The first is generative adversarial networks. These are good for sampling but their samples have quite different properties from those produced by the VAE. Similarly to the VAE they cannot evaluate the likelihood of new data points. The second alternative is normalizing flows for which the both sampling and likelihood evaluation are tractable.