March 26, 2020

*Neural natural language generation* (NNLG) refers to the problem of generating coherent and intelligible text using neural networks. Example applications include response generation in dialogue, summarization, image captioning, and question answering. In this tutorial, we assume that the generated text is conditioned on an input. For example, the system might take a structured input like a chart or table and generate a concise description. Alternatively, it might take an unstructured input like a question in text form and generate an output which is the answer to this question.

We will not focus on the input type; we assume that the input has been processed by a suitable *encoder* to create an embedding in a latent space. Instead, we concentrate on the *decoder* which takes this embedding and generates sequences of natural language tokens.

We will use the running example of *neural machine translation*: given a sentence in language A, we aim to generate a translation in language B. The input sentence is processed by the encoder to create a fixed-length embedding. The decoder then uses this embedding to output the translation word-by-word.

We'll now describe the *encoder-decoder architecture* for this translation model in more detail (figure 1). The encoder takes the sequence $\mathbf{x}$ which consists of $K$ words $\{x_t\}_{t=1}^{K}$ and outputs the latent space embedding $\mathbf{h}$. The *decoder* takes this latent representation $\mathbf{h}$ and outputs a sequence of $L$ word tokens $\{\hat{y}_t\}_{t=1}^{L}$ one by one.

We consider an encoder which converts each word token to a fixed length word embedding using a method such as the SkipGram algorithm (Mikolov *et* al. 2013). Then these word embeddings are combined by a neural architecture such as a recurrent neural network (Sundermeyer *et* al. 2012), self-attention network (Vaswani *et* al. 2017), or convolutional neural network (Dauphin *et* al. 2017) to create a fixed-length hidden state $\mathbf{h}$ that describes the whole input sequence.

At each step $t$, the decoder takes this hidden state and an input token, and outputs a probability distribution over the vocabulary $\mathcal{V}$ from which the output word $\hat{y}_{t}$ can be chosen. The hidden state itself is also updated, so that it knows about the history of the generated words. The first input token is always a special *start of sentence* (SOS) token. Subsequent tokens correspond to the predicted output word from the previous time-step during inference, or the ground truth word during training. There is a special *end of sentence* (EOS) token. When this is generated, it signifies that no more tokens will be produced.

This tutorial is divided into two parts. In this first part, we assume that the system has been trained with a maximum likelihood criterion and discuss algorithms for the decoder. We will see that maximum likelihood training has some inherent problems related to the fact that the cost function does not consider the whole output sequence at once and we'll consider some possible solutions.

In the second part of the tutorial we change our focus to consider alternative training methods. We consider fine-tuning the system using reinforcement learning or minimum risk training which use sequence-level cost functions. Finally, we review a series of methods that frame the problem as structured prediction. A summary of these methods is given in table 1.

Type of method | Training | Inference |
---|---|---|

Decoding algorithms | Maximum likelihood Maximum likelihood Maximum likelihood Maximum likelihood Maximum likelihood Maximum likelihood |
greedy search beam search diverse beam search iterative beam search top-k sampling nucleus sampling |

Sequence-level fine-tuning | Fine-tune with reinforcement learning Fine-tune with minimum risk training |
greedy search/beam search beam search |

Sequence-level training | Scheduled sampling Beam search optimization SeaRNN Reward augmented max likelihood |
greedy search / beam search beam search greedy search / beam search beam search |

In this section, we describe the standard approach to train encoder-decoder architectures, which uses the maximum likelihood criterion. Specifically, the model is trained to maximize the conditional log-likelihood for each of $I$ input-output sequence pairs $\{\mathbf{x}_{i},\mathbf{y}_{i}\}_{i=1}^I$ in the training corpus so that:

\begin{equation}\label{eq:local_max_like}

L = \sum_{i=1}^{I} \log\left[ Pr(\mathbf{y}_i | \mathbf{x}_i, \boldsymbol\phi)\right] \tag{1}

\end{equation}

where $\boldsymbol\phi$ are the weights of the model.

The probabilities of the output words are evaluated sequentially and each depends on the previously generated tokens, so the probability term in equation 1 takes the form of an auto-regressive model and can be decomposed as:

\begin{equation}

Pr(\mathbf{y}_i | \mathbf{x}_i, \boldsymbol\phi)= \prod_{t=1}^{L_i} Pr(y_{i,t} |\mathbf{y}_{i,<t}, \mathbf{x}_i, \boldsymbol\phi). \tag{2}

\end{equation}

Here, the probability of token $y_{i,t}$ from the $i^{th}$ sequence at time $t$ depends on the tokens $\mathbf{y}_{i,<t} = \{y_{i,0}, y_{i,1},\ldots y_{i,t-1}\}$ seen up to this point as well as the input sentence $\mathbf{x}_{i}$ (via the latent embedding from the encoder).

This training criterion seems straightforward, but there is a subtle problem. In training, the previously seen ground truth tokens $\mathbf{y}_{i,<t}$ are used to compute the probability of the current token $y_{i,t}$. However, when we perform inference with the model and generate text, we no longer have access to the ground truth tokens; we only know the actual tokens $\hat{\mathbf{y}}_{i,<t}$ that we have generated so far (figure 2).

The approach of using the ground truth tokens for training is known as *teacher forcing*. In a sense, it means that the training scenario is unrealistic and does not map to the real situation when we perform inference. In training, the model is only exposed to sequences of ground truth tokens, but sees its own output when deployed. As we shall see in the following discussion, this *exposure bias* may result in some problems in the decoding process.

We now return to the decoding (inference) process. At each time step, the system predicts the probability $Pr(\hat{y}_t |\hat{\mathbf{y}}_{<t}, \mathbf{x}, \boldsymbol\phi)$ over items in the vocabulary, and we have to select a particular word from this distribution to feed into the next decoding step. The goal is to pick the sequence with the highest overall probability:

\begin{equation}

Pr(\hat{\mathbf{y}} | \mathbf{x}, \boldsymbol\phi) = \prod_t Pr(\hat{y}_t | \hat{\mathbf{y}}_{<t},\mathbf{x}, \boldsymbol\phi). \tag{3}

\end{equation}

In principle, we can simply compute the probability of every possible sequence by brute force. However, there are as many choices as there are words $|\mathcal{V}|$ in the vocabulary for each position and so a sentence of length $L$ would have $|\mathcal{V}|^{L}$ possible sequences. The vocabulary size $|\mathcal{V}|$ might be as large as 50,000 words and so this might not be practical.

We can improve the situation by re-using partial computations; there are many sequences which start in the same way and so there is no need to re-compute these partial likelihoods. A dynamic programming approach can exploit this structure to produce an algorithm with complexity $\mathcal{O}[L|\mathcal{V}|^2]$ but this is still very expensive.

Since we cannot find the sequence with the maximum probability, we must resort to tractable search strategies that produce a reasonable approximation of this maximum. Two common approximations are *greedy search* and *beam search* which we discuss respectively in the next two sections.

The simplest strategy is *greedy search*. It consists of picking the most likely token according to the model at each decoding time step $t$ (figure 3a).

\begin{equation}

\hat{y}_t =\underset{w \in \mathcal{V}}{\mathrm{argmax}}\left[ Pr(y_{t} = w | \hat{\mathbf{y}}_{<t}, \mathbf{x}, \boldsymbol\phi)\right] \tag{4}

\end{equation}

Note that this does not guarantee that the complete output $\hat{\mathbf{y}}$ will have high overall probability relative to others. For example, having selected the most likely first token $y_{0}$, it may transpire that there is no token $y_{1}$ for which the probability $Pr(y_{1}|y_{0},\mathbf{x},\boldsymbol\phi)$ is high. It might have been overall better to choose a less probable first token, which is more compatible with a second token.

We have seen that searching over all possible sequences is intractable, and that greedy search does not necessarily produce a good solution. Beam search seeks a compromise between these extremes by performing a restricted search over possible sequences. In this regard, it produces a solution that is both tractable and superior in quality to greedy search (figure 3b).

At each step of decoding $t$, the $B$ most probable sequences $\mathcal{B}^{t-1} =\{\hat{\mathbf{y}}_{<t,b}\}_{b=1}^{B}$ are stored as candidate outputs. For each of these hypotheses the log probability is computed for each possible next token $w$ in the vocabulary $\mathcal{V}$, so $B|\mathcal{V}|$ probabilities are computed in all. From these, the new $B$ most probable sequences $\mathcal{B}^t =\{\hat{\mathbf{y}}_{<t+1,b}\}_{b=1}^{B}$ are retained. By analogy with the formula for greedy search, we have:

\begin{equation}\label{eq:beam_search}

\mathcal{B}^t =\underset{w \in \mathcal{V},b\in1\ldots B}{\mathrm{argtopk}}\left[B, Pr(y_{t} = w | \hat{\mathbf{y}}_{<t,b}, \mathbf{x}, \boldsymbol\phi)\right] \tag{5}

\end{equation}

where the function $\mathrm{argtopk}[K,\bullet]$ returns the set of the top $K$ items that maximize the second argument. The process is repeated until EOS tokens are produced or the maximum decoding length is reached. Finally, we return the most likely overall result.

The integer $B$ is known as the *beam width*. As this increases, the search becomes more thorough but also more computationally expensive. In practice, it is common to see values of $B$ in the range 5 to 200. When the beam width is 1, the method becomes equivalent to greedy search.

When we train a decoder with a maximum-likelihood criterion, the resulting sentences can exhibit a lack of diversity. This happens at both (i) the beam level (many sentences in the same beam may be very similar) and (ii) the decoding level (words are repeated during one iteration of decoding). In the next two sections we look at methods that have been proposed to ameliorate these issues.

While beam search is superior to greedy search, it often produces sentences that have the same or a very similar start (figure 4a). Decoding is done in a left-to-right fashion and the probability weights are often concentrated at the beginning; even if search is performed over $B$ sentences, the weight of the first few words will mean that most of these sentences start with these words and there is little diversity.

4a) Beam SearchA steam engine train travelling down train tracks.A steam engine train travelling down tracks.A steam engine train travelling through a forest.A steam engine train travelling through a lush green forest.A steam engine train travelling through a lush green countryside.A train on a train track with a sky background. |

4b) Diverse Beam SearchA steam engine train travelling down train tracks.A steam engine train travelling through a forest. An old steam engine train travelling down train tracks. An old steam engine train travelling through a forest. A black train is on the tracks in a wooded area. A black train is on the tracks in a rural area. |

This raises the question of whether the likelihood objective correlates with our end-goals. We might care about other criteria such as diversity, which is important for chatbots: if a chatbot always said the same thing in response to a generic input such as "how are you today?", it would become quickly dull. Hence, we might want to factor in other criteria of quality when decoding.

To counter beam-level repetition, Vijayakumar *et* al. (2018) proposed a variant of beam search, called *diverse beam search*, which encourages more variation in the generated sentences than pure beam search (figure 4b). The beam is divided into $G$ groups. Regular beam search is performed in the first group to generate $B'=\frac{B}{G}$ sentences. For the second group, at step $t$ of decoding, the beam search criterion is augmented with a factor that penalizes token sequences that are similar to the first $t$ words of the hypotheses in the first group. For the third group, sequences that are similar to those in either of the first two groups are penalized, and so on (figure 5).

Vijayakumar *et* al. (2018) investigate several similarity metrics including *Hamming diversity* which penalizes tokens based on their number of occurrences in the previous groups.

Diverse beam search has the disadvantage that it only discourages sequences that are close to the final sequences found in previous beams. However, there may be significant portions of the space that were searched to find these hypotheses and since we didn't store the intermediate results there is nothing to stop us from redundantly considering the same part of the search space again (figure 6).

Kulikov *et* al. (2018) introduced *iterative beam search* which aims to solve this problem. It resembles diverse beam search in that beams (groups of hypotheses) are computed and recorded. These beams are ordered and each is affected by the previous beams. However, unlike diverse beam search we do not wait for a beam search to complete before computing the others. Instead, they are computed concurrently.

Consider the situation where at ouput time $t-1$ we have $G$ groups of beams, each of which contains $B^{\prime}$ hypotheses. We extend the first beam to length $t$ in the usual way; we consider concatenating every possible vocabulary word with each of the $B^{\prime}$ hypotheses, evaluate the probabilities and retain the best overall $B^{\prime}$ solutions of length $t$.

When we extend the second beam to length $t$, we follow the same procedure. However, we now set any hypotheses that are too close in Hamming distance to those in the first beam to have zero probability. Likewise, when we extend the third beam, we discount any hypotheses that are close to those in the first two beams. The result is that each beam is forced to explore a different part of the search space and the final results have increased diversity (figure 7).

Until this point, we have assumed that the best way to decode is by maximizing the probability of the output words, using either greedy search, beam search, or a variation on these techniques. However, Holtzman *et* al. (2019) demonstrate that human speech does not stay in high probability zones and is often much more surprising than the text generated by these methods.

This raises the question of whether we should sample randomly from the output probability distribution rather than search for likely decodings. Unfortunately, this can also lead to degenerate cases. Holtzman *et* al. (2019) conjecture that at some point during decoding, the model is likely to sample from the tail of the distribution (i.e., from the set of tokens which are much less probable than the gold token). Once the model samples from this tail, it might not know how to recover. In the next two sections, we consider two methods that aim to repress such behavior.

Fan *et* al. (2018) proposed *top-$k$ sampling* as a possible remedy. Consider one iteration of decoding at the $t$-th time step. Let us define $\mathcal{V}_{K,t}$ as the set of the $K$ most probable next tokens according to $Pr(y_{t}=w|\hat{y}_{<t},\boldsymbol\phi)$. Let us also define the sum $Z$ of these $K$ probabilities:

\begin{equation}

Z = \sum_{w \in \mathcal{V}_K^t} Pr(y_{t}=w|\hat{\mathbf{y}}_{<t},\boldsymbol\phi). \tag{6}

\end{equation}

Top-$k$ sampling proposes to re-scale these probabilities and ignore the other possible tokens:

\begin{equation}

Pr(y_{t}=w | \hat{\mathbf{y}}_{<t}|\boldsymbol\phi) \leftarrow

\begin{cases}

Pr(y_{t}=w | \hat{\mathbf{y}}_{<t},\boldsymbol\phi) / Z & \text{if } w \in \mathcal{V}_K^t \\

0 & \text{otherwise}. \tag{7}

\end{cases}

\end{equation}

With this strategy, we only sample from the $K$ most likely tokens and thus avoid tokens from the tail of the distribution (their probability is set to zero).

Unfortunately, it can be difficult to fix $K$ in practice. There exist extreme cases where the distribution is very peaked and so the top-$K$ tokens include tokens from the tail. Similarly, there may be cases where the distribution is very flat and valid tokens are excluded from the top-$K$ list.

Nucleus sampling (Holtzman *et* al. 2019) aims to solve this problem by retaining a fixed proportion of the probability mass. They define $\mathcal{V}_\tau^t$ as the smallest set such that:

\begin{equation*}

Z = \sum_{w \in \mathcal{V}^t_\tau} Pr(y_{t}=w|\hat{\mathbf{y}}_{<t}, \boldsymbol\phi) \geq \tau,

\end{equation*}

where $\tau$ is a fixed threshold. Then, as in top-$k$ sampling, probabilities are re-scaled to:

\begin{equation}

Pr(y_{t}=w | \hat{\mathbf{y}}_{<t}|\boldsymbol\phi) \leftarrow

\begin{cases}

Pr(y_{t}=w | \hat{\mathbf{y}}_{<t},\boldsymbol\phi) / Z & \text{if } w \in \mathcal{V}_\tau^t \\

0 & \text{otherwise}. \tag{8}

\end{cases}

\end{equation}

Since the set $\mathcal{V}_\tau^t$ is chosen so that the cumulative probability mass is at least $\tau$, nucleus sampling does not suffer in the case of a flat or a peaked distribution.

Part I of this tutorial has highlighted some of the problems that occur during decoding. We can classify these problems into two categories.

First, maximum-likelihood trains the model to stay in high probability regions of the token space. As shown by Holtzman *et* al. (2019), this differs significantly from human speech. If we want to take into account other criteria of quality, such as diversity, search strategies must be put in place to explore the space of likely outputs and greedy sampling or vanilla beam search are not enough.

Second, the exposure bias introduced by teacher forcing forces NNLG models to be myopic, and only look for the next most likely token given a ground truth prefix. As a consequence, if the model samples a token from the long tail, it might enter a "degenerate'' case. When this happens, the model "makes an error'' by sampling a low-probability token and does not know how to recover from this error.

We have presented some sampling strategies that alleviate these issues at inference time. However, since these problems are both side-effects of the maximum likelihood teacher-forcing methodology for training, another way to approach this is to modify the training method. In part II of this tutorial, we describe how Reinforcement Learning (RL) and structured prediction help with both maximum-likelihood and teacher-forcing induced issues.