Graph neural networks (GNNs) excel at performing semisupervised classification with the aid of a graph structure for the set of input examples. In recent years, researchers have begun applying GNNs to semisupervised datasets that don’t include graphstructure information, by generating a graph structure from the data. The paper identifies and solves a ‘supervision starvation’ problem in a promising approach called latent graph learning, which jointly optimizes a graph structure and a GNN classifier.
Much of the great success of modern neural network methods comes from applying architectural inductive biases like convolution and attention to structured inputs. One reason that graph neural networks (GNNs) are an exciting area of research is the promise of applying similar architectural techniques to structure in the input domain as a whole, by treating input examples as nodes in a graph. Previous work demonstrates that it’s sometimes possible to leverage the strengths of GNNs even when no graph information is available, by using label and feature information to construct an inputsexample graph with GNNfriendly properties. Since GNN classifiers are known to excel on graphs with high label homophily (i.e. connected nodes often belong to the same class), graph construction typically focuses on label homophily and on feature homophily as a proxy. The simplest methods preconstruct a graph through a kNN algorithm, while more sophisticated methods try to infer a graph during training.
In existing methods for latent graph learning, a neural network that assigns graphstructure to the input set is optimized together with a graph convolutional network (GCN) that classifies input examples based on their graph neighbors. The paper’s central observation is that in this setting, training for classification can’t properly optimize the graph: some edges, which the paper calls ‘starved edges,’ have no impact on the training loss but do affect the classifier’s predictions at test time. Since the values of these edges are learned without any training feedback, the model is at risk of making poor predictions at test time.
The existence of starved edges follows from the fact that an nlayers GCN makes predictions for input examples based on their nhop neighbors. Edges between examples that are each ≥n hops from any labeled example cannot affect any supervised prediction, and will therefore be ‘starved’ for supervision. Furthermore, it turns out that in typical latentgraph learning scenarios we should expect most edges to be starved: The authors observe that in a random graph (aka ErdosRényi graph) or scalefree network with the statistics of graphstructured datasets like Cora, Citeseer, and Pubmed, a random edge is more likely to be starved than unstarved. In particular, they give a simple proof that for an ErdosRényi graph with n nodes and m edges, if we have labels for q nodes selected uniformly at random then the probability of an edge being a starved edge is:
\begin{equation}\left ( 1\frac{q}{n} \right )\left ( 1\frac{q}{n1} \right )\prod_{i=1}^{2q}\left ( 1 \frac{m1}{\binom{n}{2}i} \right)\end{equation}
Furthermore, the proportion of starved edges in the actual Cora, Citeseer, and Pubmed datasets is very close to their probability in the analogue ErdosRényi graphs, so there is reason to believe that ErdosRényi graphs are a good model of natural graphs in this regard.
The paper proposes to solve the starvededge problem by adopting SLAPS (Simultaneous Learning of Adjacency and GNN Parameters with Selfsupervision), a multitask learning framework that supplements the classification task with a selfsupervised task. The selfsupervised task is based on the hypothesis that a graph structure suitable for predicting the features of input examples is also suitable for predicting their labels. The authors add a denoising autoencoder GNN downstream of the graph generator, optimizing the full system (graphgenerator, classifier, and denoising autoencoder) with a mixture of autoencoder loss and classifier loss.The training process thus encourages the graph generator to produce a graph structure that provides the denoising autoencoder with useful auxiliary information for denoising the input examples.
The authors compare their multitask framework with existing graphconstruction and latentgraph learning methods on a range of semisupervised classification datasets. The main experiment uses a ‘graphless’ version of the graphstructured semisupervised classification benchmarks Cora, Citiseer, and Pubmed, showing that SLAPS recovers a useful graphstructure from input examples alone. SLAPS strongly outperforms the baseline MLP, and significantly outperforms GNN methods that rely on preconstructing graphs or supervised latent graph learning:
MODEL  Cora  Citeseet  Cora390  Citeseer370  Pubmed  ogbnarxiv 
MLP  56.1 ± 1.6†  56.7 ± 1.7†  65.8 ± 0.4  67.1 ± 0.5  71.4 ± 0.0  51.7 ± 0.1 
MLPGAM*  70.7‡  70.3‡      71.9‡   
LP  37.6 ± 0.0  23.2 ± 0.0  36.2 ± 0.0  29.1 ± 0.0  41.3 ± 0.0  OOM 
kNNGCN  66.5 ± 0.4†  68.3 ± 1.3†  23.2 ± 0.0  71.8 ± 0.8  70.4 ± 0.4  49.1 ± 0.3 
LDS      71.5 ± 0.8†  71.5 ± 1.1ƚ  OOM  OOM 
GRCN  67.4 ± 0.3  67.3 ± 0.8  71.3 ± 0.9  70.9 ± 0.7  67.3 ± 0.3  OOM 
DGCNN  56.5 ± 1.2  55.1 ± 1.4  67.3 ± 0.7  66.6 ± 0.8  70.1 ± 1.3  OOM 
IDGL  70.9 ± 0.6  68.2 ± 0.6  73.4 ± 0.5  72.7 ± 0.4  72.3 ± 0.4  OOM 
kNNGCN + AdaEdge  67.7 ± 1.0  68.8 ± 0.3  72.2 ± 0.4  71.8 ± 0.6  OOT  OOM 
kNNGCN + selftraining  67.3 ± 0.3  69.8 ± 0.3  71.1 ± 0.3  72.4 ± 0.2  72.7 ± 0.1  NA 
SLAPS (FP)  72.4 ± 0.4  70.7 ± 0.4  76.6 ± 0.4  73.1 ± 0.6  OOM  OOM 
SLAPS (MLP)  72.8 ± 0.8  70.5 ± 1.1  75.3 ± 1.0  73.0 ± 0.9  74.4 ± 0.6  56.6 ± 0.1 
SLAPS (MLPD)  73.4 ± 0.3  72.6 ± 0.6  75.1 ± 0.5  73.9 ± 0.4  73.1 ± 0.7  52.9 ± 0.1 
SLAPS (MLP) + AdaEdge  72.8 ± 0.7  72.6 ± 1.5  75.2 ± 0.6  72.6 ± 1.4  OOT  OOT 
SLAPS (MLP) + selftraining  74.2 ± 0.5  73.1 ± 1.0  75.5 ± 0.7  73.3 ± 0.6  74.3 ± 1.4  NA 
The authors also study the application of SLAPS to semisupervised classification on datasets with noisy or corrupted graphstructure information, and find that using noisy graph information to initialize SLAPS is greatly preferable to feeding the noisy graph information directly to a classifier GNN:
The paper studies the use of latent stochastic differential equations (SDEs) together with normalizing flows to learn continuous timeseries dynamics. When learning continuous timeseries dynamics, the objective is to maximize the observational loglikelihood of an inhomogeneous collection of training sequences with varying lengths and time stamps. At testtime, in addition to the maximization of observational loglikelihoods, we are also interested in sampling trajectories in a manner consistent with these loglikelihoods.
The authors improve on the state of the art in the field by employing a normalizing flow as a timedependent decoder for a flexible latent SDE, achieving greater expressivity compared to methods that rely on a normalizing flow alone. The price of the increase in expressivity is that the observational loglikelihood becomes intractable, making variational approximations necessary. The authors formulate a principled variational approximation of the observational loglikelihood, based on a piecewise construction of the posterior distribution of the latent SDE.
Sparse and irregular observations of continuous dynamics are common in many areas of science, including finance, healthcare, and physics. Timeseries models driven by stochastic differential equations provide an elegant framework for this challenging scenario and have recently gained popularity in the machine learning community. The SDEs are typically implemented by neural networks with trainable parameters, and the latent processes defined by the SDEs are then decoded into an observable space with complex structure.
Despite great recent progress in the field, it remains challenging to produce models that are both computationally tractable and flexible. Cutting edge methods built around combining a simple latent process with invertible transformations have the benefit of giving exact and efficient likelihood evaluation of observations, but can only model a limited class of stochastic processes. In particular, the otherwise highly effective flowbased method CTFP (‘continuoustime flow process’) is provably incapable of modeling some commonplace stochastic processes, from simple processes like the OrnsteinUhlenbeck (OU) process to more complex nonMarkov processes. A more formal difficulty with models like CTFP is that standard neural networks practice for constructing reversible transformations implies Lipschitz continuity, and Lipschitzcontinuous reversible transformations of simple processes are especially limited. Transforming a simple stochastic process into Brownian motion, for example, requires a nonLipschitz function and therefore nonstandard network architecture.
The paper introduces Continuous Latent Process Flows (CLPF). CLPF treats an observed sequence as a partial realization of a continuoustime observable stochastic process X_{t}, and treats X_{t} in turn as a function of the trajectory of a flexible SDE process Z_{t} together with the trajectory of a simple stochastic process O_{t}. (e.g., an OUprocess).
Concretely, the CLPF framework models the evolution of an mdimensional timecontinuous latent state Zₜ in the time interval [0,T] using a flexible stochastic differential equation
\begin{equation}\mathrm{d} \boldsymbol{Z}_{t}=\boldsymbol{\mu}_{\gamma}\left(\boldsymbol{Z}_{t}, t\right) \mathrm{d} t+\sigma_{\gamma}\left(\boldsymbol{Z}_{t}, t\right) \mathrm{d} \boldsymbol{W}_{t}\end{equation}
where Wₜ is an mdimensional Wiener Process and γ denotes the (shared) learnable parameters of the drift function µ and variance function σ.
The latent SDE dynamics then produce an observable process X_{t }as follows:
\begin{equation}X_{t}=F_{\theta}\left(O_{t} ; Z_{t}, t\right)\end{equation}
where Oₜ is a ddimensional Ornstein–Uhlenbeck process with closedform transition density and Fθ( · ; zₜ, t) is a normalizing flow parameterized by θ for any zₜ, t.
Because CLPF latent dynamics follow a generic, flexible SDE process, exact computations of the observational loglikelihood are generally intractable. It’s therefore necessary to use a variational approximation to compute the training gradient of a CLPF model or perform inference. To this end, the authors construct a novel evidence lower bound (ELBO)Save
\begin{equation}\mathbb{E}_{\omega^{(1)}, \ldots, \omega^{(n)} \sim W_{t}^{(1)} x \ldots \times W_{t}^{(n)}}\left[\sum_{i=1}^{n} \log p\left(x_{t_{1}} \mid x_{t_{11}}, \tilde{z}_{t_{1}}, \tilde{z}_{t_{i1}}, \omega^{(i)}\right)+\sum_{i=1}^{n} \log M^{(i)}\left(\omega^{(i)}\right)\right]\end{equation}
where wt^{⁽1⁾},...,wt^{⁽ⁿ⁾} are independent Wiener processes that (speaking informally for brevity) construct W_{t} piecewise and M is an importance weight between the prior and posterior latent SDE.
The authors compare CLPF with previous continuous dynamics methods for modelling irregular observations of a continuous process, as well as with a noncontinuous Variational RNN (VRNN) baseline that excels in likelihood estimation but cannot properly generate trajectories.
The authors first evaluate CLPF on synthetic data sampled from known stochastic processes, to verify its ability to capture a variety of continuous dynamics. They compare CLPF with previous continuous methods on the synthetic cases Geometric Brownian Motion, Linear SDE, Continuous AR(4) Process, and Stochastic Lorenz Curve:
Model  GBM  LSDE  CAR  SLC  
λ=2  λ=20  λ=2  λ=20  λ=2  λ=20  λ=20  λ=40  
VRNN  0.425  0.650  0.634  1.665  1.832  2.675  2.237  1.753 
LatentODE[33]  1.916  1.796  0.900  0.847  4.872  4.765  9.117  9.115 
CTFP[12]  2.940  0.678  0.471  1.778  383.593  51.950  0.489  0.586 
LatentCTFP[12]  1.472  0.158  0.468  1.784  249.839  43.007  1.419  0.077 
LatentSDE[25]  1.243  1.778  0.082  0.217  3.594  3.603  7.740  8.256 
CLPF(ours)  0.444  0.698  0.831  1.939  1.322  0.077  2.620  3.963 
In order to evaluate CLPF on real data, the authors generate irregular timeseries data by sampling from the MujocoHopper, Beijing AirQuality Dataset (BAQD), and PTB Diagnostic Database (PTBDB) datasets at irregular timeintervals. The authors find that CLPF outperforms existing continuousdynamics models in likelihood estimation, and nearly closes the gap with VRNN in sequential prediction:
Model  Mujoco[33]  BAQD[37]  PTBDB[5] 
VRNN[10]  15876  1.204  2.035 
LatentODE[33]  23551  2.540  0.533 
LatentSDE[25  3071  1.512  1.358 
CTFP[12]  7598  0.170  1.281 
LatentCTFP[12]  12693  0.480  1.659 
CLPFANODE(ours)  14694  0.619  1.575 
CLPFiRes(ours)  10873  0.486  1.519 
Model  GBM  LSDE  CAR  SLC 
CLPFGlobal  0.447  0.821  1.552  3.304 
CLPFIntependent  0.800  0.326  4.970  7.924 
CLPFWiener  0.390  0.790  1.041  1.885 
Latent SDE  1.243  0.082  3.594  7.740 
CLPF  0.444  0.831  1.322  2.620 
Model  Mujoco[33]  BAQD[37]  PTBDB[5] 
VRNN[10]  1.599,[0.196,1.221]  0.519, [0.168, 0.681]  0.037, [0.005, 0.032] 
LatentODE[33]  13.959, [9.857, 15.673]  1.416, [0.936, 1.731]  0.224, [0.114, 0.322] 
LatentSDE[25]  7.627, [2.384, 8.381]  0.848, [0.454, 1.042]  0.092, [0.032, 0.111] 
CTFP[12]  1.969, [0.173, 1.826]  0.694, [0.202, 10966]  0.055, [0.006, 0.046] 
LatentCTFP[12]  1.983, [0.167, 1.744]  0.680, [0.189, 0.943]  0.065, [0.007, 0.059] 
CLPFANODE(ours)  1.629, [0.149, 1.575]  0.542, [0.150, 0.726]  0.048, [0.005, 0.041] 
CLPFiRes (ours)  1.846, [0.177, 1.685]  0.582, [0.183, 0.805]  0.055, [0.006, 0.049] 
It is often desirable for a neural network to be monotonic: informally, to have an output function that is nondecreasing with respect to certain features. This paper identifies problems with existing general methods for training a neural network to be monotonic, and proposes a superior general method.
Monotonicity is a common requirement in reallife applications of prediction models. For example, in the case of models used to accept/reject job applications, we expect acceptance scores to be monotonically nondecreasing with respect to features such as a candidate’s years of experience. Such expectations of monotonicity often reflect accepted institutional best practices, or ethical or legal norms, so that predictors that fail on monotonicity are unacceptable for reallife use.
While it is possible to guarantee monotonicity by defining a ‘monotonous by construction’ model class, such model classes have limited use since they exclude many commonly used neural network architectures. More generally applicable approaches focus on finding monotonic candidates within a general class of models while performing empirical risk minimization. Since the verification of a model’s monotonicity can be extremely computationally expensive, these approaches typically do not provide a guarantee. Instead, they rely on regularization penalties that bias a learning algorithm towards monotonic predictors.
The general form of the regularization penalty is:
\begin{equation}\mathbb{E}_{x \sim \mathcal{D}}\left[\sum_{i \in M} \max \left(0,\frac{\partial h(x)}{\partial x_{i}}\right)^{2}\right]\end{equation}
where M is the set of input dimensions with respect to which monotonicity is desired, and
)/xi indicates the gradients of the predictor h relative to the input dimensions i ∈ M. In other words, we penalize h for behaving nonmonotonically at points sampled from some distribution D.
The paper’s novel contribution concerns the choice of distribution D. In previous work, the chosen D was either the empirical distribution of the training sample, or a uniform distribution over the input space. The paper demonstrates that both choices have serious shortcomings: When D is the empirical training distribution, monotonicity is only enforced close to the training data, and may fail on the test data in the case of a covariate shift. When D is the uniform distribution, the sampled points will likely lie far from the training data, thus failing to enforce monotonicity around the training data. (This is particularly likely given a highdimensional input space, where points sampled in uniform are likely to lie close to the input space’s boundary.)
The paper’s solution is to compute the regularization penalty on points generated by mixing up points from the training data and random points. To sample from the ‘mixup’ distribution D, the regularizer augments a minibatch of N training examples with N random points, then samples random interpolations of random pairs of points from the augmented minibatch. The authors hypothesize that performing mixup applies monotonicity to parts of the space that are disregarded if one focuses only on either observed data or random draws from uninformed choices of distributions such as the uniform.
The authors first test their hypotheses on synthetic data manifolds with a covariate shift between the training/validation data and test data. The results show that applying the monotonicity regularizer to the mixup distribution enforces monotonicity on both the test data and random points, whereas applying the regularizer to the training data or at random is effective for one condition at most. In addition, the results suggest that when scaling to high dimensions, the covariate shift weakens the effect of trainingset enforcement but not of mixup enforcement.
 M  / D  20/100  40/200  80/400  100/500  
ρ_{random} 
ρ_{test} 
ρ_{random} 
ρ_{test} 
ρ_{random} 
ρ_{test} 
ρ_{random} 
ρ_{test} 

Nonmon.  99.90%  99.99%  97.92%  94.96%  98.47%  96.56%  93.98%  90.01% 
Ω_{random} 
0.00%  3.49%  0.00%  4.62%  0.01%  11.36%  0.02%  19.90% 
Ω_{train} 
1.30%  0.36%  4.00%  0.58%  9.67%  0.25%  9.25%  5.57% 
Ω_{mixup} 
0.00%  0.35%  0.00%  0.44%  0.00%  0.26%  0.00%  0.42% 
The results generalize to real datasets, where mixup regularization archives the best monotonicity under every evaluation condition. The authors additionally observe that successfully enforcing monotonicity has no effect on prediction performance, suggesting that monotonic predictors are viable as predictors:
Nonmon.  Ω_{random}  Ω_{train}  Ω_{mixup}  
ValidationRMSE  0.213±0.000  0.223±0.002  0.222±0.002  0.235±0.001 
Test RMSE  0.221±0.001  0.230±0.001  0.229±0.002  0.228±0.001 
ρ_{random}  99.11%±1.70%  0.00%±0.00%  14.47%±7.55%  0.00%±0.00% 
ρ_{train}  100.00%±0.00%  7.23%±7.76%  0.01%±0.01%  0.00%±0.00% 
ρ_{test}  100.00%±0.00%  6.94%±7.43%  0.04%±0.03%  0.00%±0.00% 
The authors note that the mixup strategy introduces no computational overhead over the existing strategies, and is therefore strictly preferable. They propose that in future work the mixup strategy could be used to improve interpretability in complex neural network models, by enforcing homogeneity between a network’s outputs and a subset of its highlevel representations.
]]>Odd/distribution shift/domain adaptation
by Jiawei He
Time series research remains a cutting edge field in machine learning community. This is especially important in Finance applications as we face stock price data, credit card transaction data, etc. A common assumption in training neural networks via maximum likelihood estimation is that the errors across time steps are uncorrelated. This assumption is still heavily applied to almost all machine learning optimizations. However, in time series problems, intrinsically, errors can be autocorrelated in many cases due to the temporal nature of the data, which makes such maximum likelihood estimation inaccurate. Although adjusting for autocorrelated errors in linear or nonlinear time series data has been studied extensively, especially in econometrics, those methods are applicable only when the exact (and correct) form of the underlying system is known. On the other hand, NNs for timeseriesrelated tasks have become a popular research direction due to NNs’ effectiveness in approximating unknown, nonlinear systems.
To adjust for autocorrelated errors, a method to jointly learn the autocorrelation coefficient with the model parameters via gradient descent is proposed in this paper. Extensive simulation verified the effectiveness of the proposed approach on time series forecasting. Results across a wide range of realworld datasets with various stateoftheart models show that the proposed method enhances performance in almost all cases. Based on these results, the authors suggested empirical critical values to determine the severity of autocorrelated errors. Some of the limitations mentioned in the paper includes (1) the method is not applicable to probabilistic forecasting, and (2), if the underlying timeseries can be modelled well by a known process, then the benefit of adopting this approach will be decreasing. For future research directions, the authors suggested exploring more complex, higherorder autocorrelated errors with quantile regression and probabilistic forecasting.
by Jiawei He
Many complex time series can be effectively subdivided into distinct regimes that exhibit persistent dynamics. Discovering the switching behaviour and the statistical patterns in these regimes is important for understanding the underlying dynamical system. State Space Models (SSMs) are a powerful tool for such tasks—especially when combined with neural networks — since they provide a principled framework for time series modelling One of the most popular SSMs is the Linear Dynamical System, which models the dynamics of the data using a continuous latent variable, called state, that evolves with Markovian linear transitions. The assumptions of LDS allow for exact inference of the states; however, they are too restrictive for realworld systems that often exhibit piecewise linear or nonlinear hidden dynamics with a finite number of operating modes or regimes.
In this paper, the Recurrent Explicit Duration Switching Dynamical System (REDSDS) is proposed. REDSDS is a nonlinear state space model that is capable of identifying both state and timedependent switching dynamics. Statedependent switching is enabled by a recurrent statetoswitch connection and an explicit duration count variable is used to improve the timedependent switching behaviour. The authors also proposed an efficient hybrid inference and learning algorithm that combines approximate inference for states with conditionally exact inference for switches and counts. The model is trained by maximizing a Monte Carlo lower bound of the marginal loglikelihood that can be computed efficiently as a byproduct of the inference routine. Thorough evaluation on a number of benchmark datasets for time series segmentation and forecasting demonstrated that REDSDS can learn meaningful duration models, identify both state and timedependent switching patterns and extrapolate the learned patterns consistently into the future. Future research directions include semisupervised time series segmentation. For timesteps where the correct regime label is known, it is straightforward to condition on this additional information rather than performing inference; this may improve segmentation accuracy while providing an inductive bias that corresponds to an interpretable segmentation.
by Nazanin Mehrasa
Event sequences as a special form of timeseries data, are discrete events in continuous time, meaning that events happen asynchronously in time. This type of data is prevalent in a wide variety of applications, such as socialnetworks, stock market, healthcare, seismology and etc. To analyze event sequences and perform tasks such as future prediction, it is crucial to understand the complex influences of events on each other, including excitation, inhibition, and how the strength of these influences varies with time.
In this work, the authors propose a temporal point process framework for modeling event sequences. A temporal point process (TPP) is a mathematical framework for characterizing and modeling event sequences. A TPP is usually defined by specifying an intensity function which encodes the expected rate of events. To define the intensity, most previous neuralbased point processes use recurrent neural networks and often couple all the temporal dependency in a blackbox, which lacks interpretability on how events influence each other. In addition, existing work often assume simple functional forms for modeling the influence strength which limits the model's expressiveness e.g. exponential time decay of the influence strength. In this paper, the authors propose SPRITE, short for Selfadaptable Point pRocess wIth nonparametric Time dEcays which defines the intensity by decoupling the influences between every pair of the events in the history and model the influence via a nonparametric function of events types and timing. They introduce a general construction that covers all possible time decaying functions of the influence strength, resulting into a more flexible and expressive model while providing more interpretability. The proposed model outperforms baseline models on synthetic and realword datasets, demonstrating the effusiveness of the proposed approach.
by Siqi Liu
OutofDistribution (OOD) detection aims to detect examples in the test data that are not from the same distribution as the training data. Detecting these anomalous instances can not only have great value on its own for applications like alerting systems, where the purpose is to discover these instances, but also help to avoid or reduce risks of applying machine learning models, especially in riskaverse situations, such as healthcare and finance. In this work, the authors study the problem of OOD detection for data generated by temporal point processes (TPPs), i.e., event sequences. They connect OOD with goodnessoffit (GOF) tests in TPPs and propose a new statistic, SumofSquaredSpacings (3S), for GOF tests addressing some limitations of existing widelyused methods, such as being insensitive to the total number of events. In the experiments, their method shows strong and stable performance across different types of generative processes and realworld datasets.
by Siqi Liu and Yik Chau Liu
Empirical Risk Minimization (ERM) is commonly used to train models in machine learning, but in practice, distribution shifts (or domain shifts) can cause problems for these models and result in suboptimal performance. Previously researchers have studied similar problems in several related areas such as domain adaptation, domain generalization, and metalearning. In this work, the authors combine ideas from metalearning and domain adaptation and propose a generic framework termed Adaptive Risk Minimization (ARM). In this framework, the model tries to metalearn on the training data such that it can adapt to distribution shifts at test time with only unlabeled data. The model consists of an adaptation model and a prediction model and is optimized for postadaptation performance. They develop several methods within this framework, using either the contextual approach or the gradientbased approach, which show better performance in the experiments than previous methods that focus on either test domain adaptation or training domain generalization alone, demonstrating the benefits of combining adaptation with metalearning.
by Ruizhi Deng
How should we fit partial observations of continuous timeseries dynamics on discrete time grids? Using a probabilistic model with continuous dynamics would be an intuitively promising idea. Defining continuous dynamics could permit us to sample trajectories in a continuous time range and perform inference on arbitrary time points. Deep learning models equipped with continuous dynamics are not actively studied until recently. Continuous Latent Process Flows (CLPF) can be viewed as an extension of two recent models: latent SDE and Continuous Time Flow Process (CTFP). CLPF combines the expressive power of latent SDE with the timedependent decoding to CTFP as a better inductive bias to generate trajectories continuous in time. In addition, CLPF also proposes a flexible approach to the posterior process for variational approximation in a principled piecewise manner. CLPF demonstrates competitive performance on both synthetic and realworld data.
by Ruizhi Deng
Normalizing flows are generative models that transform a simple base distribution to a complex target distribution using invertible mapping. Affine coupling layer is a popular choice of basic building blocks in normalizing flows as the determinant of the transformation’s triangular Jacobian can be computed in linear time. Normalizing flows using affine coupling layers also demonstrate promising success when scaled up to highdimensional data like images. However, understanding of affine coupling flows’ theoretical properties, especially its representation power, remains ambiguous until recently. Previous works studying the universal approximation property of affine coupling flows rely on constructions leading to illbehaved Jacobian that are nearly singular and causing difficulties in practice. With mild assumptions, the recommended work employs a different construction to show the standard Gaussian can be transformed by affine coupling flows to approximate a target distribution arbitrarily well in Wasserstein distance sense. The construction pads the target distribution with standard Gaussian noise and the determinant of the transformation’s Jacobian is bounded above and below by constants. The proposed construction is supported by practice in previous works which improves the training of normalizing flows and has broader implications on the universal approximation power and training of other types of normalizing flows.
by Alex Radovic
This NeurIPS a number of papers have drawn exciting connections between diffusion, normalizing flow, and variational autoencoder based generative models. These connections, motivated by theory, are allowing for improved optimization of diffusion models, an extremely exciting and performant family of generative models. This paper specifically uses these connections to help motivate a new optimization strategy to improve likelihood estimation with score base diffusion models. Diffusion models learn a process to transform data samples to pure noise, and a reversion of that same process which allows them to act as powerful generative models, creating convincing data samples from noise . Scorebased diffusion models are trained to minimize a weighted combination of score matching losses, and are defined by a SDE. These scorebased models can be interpreted as continuous normalizing flows, allowing for exact likelihood calculations, while still being trained with a score matching loss. Training with score matching is much more efficient than training a continuous normalizing flow, which requires expensive calls to an ODE solver at every step of training. However this objective provides no guarantee that likelihood scores are improved. This paper provides a new weighted score loss which is shown to upper bound the negative loglikelihood, analogous to the lower bound used when training variational autoencoders. This novel, theory motivated loss, is then shown to empirically improve likelihood estimation across a variety of score based diffusion models and datasets. Broadly, this work and others at NeurIPS suggest score based diffusion models, with appropriate optimization choices, can provide performance in likelihood estimation competitive with continuous normalizing flows but with far more efficient training.
by Andreas Lehrmann
Neural ordinary differential equations (Neural ODEs) are a popular class of statistical models that are based on a continuousdepth parametrization of a hidden state’s derivative. Extensions of this idea form the basis for a variety of latent variable models for asynchronous timeseries data, including models based on latent ODEs, continuouslyindexed normalizing flows, and neural stochastic differential equations. The optimization of Neural ODEs (i.e., computing gradients of network parameters w.r.t. a loss) is based on the adjoint sensitivity method, which includes expensive calls to a blackbox ODE solver. Neural flows circumvent this problem by directly modeling the solution curves (the flow) instead of the original ODE and, as a result, do not have to rely on ODE solvers. One technical challenge is that the architecture representing the flow must be a valid solution to an ODE (e.g., the solution curves corresponding to different initial values cannot intersect). The paper formalizes these constraints and demonstrates how popular neural network layers, such as residual layers, GRU cells, or coupling layers, can be adapted accordingly. Applications of this approach include flow versions of encoderdecoder architectures like ODERNNs for filtering/smoothing as well as flow versions of normalizing flows for timedependent density estimation (1). Comprehensive experiments show that neural flows do not only outperform their ODEcounterparts in terms of raw performance but, depending on the task and architecture, can also be up to an order of magnitude faster.
^{1 }Note that the use of flow in this sentence is overloaded, with “normalizing flows” and “neural flows” referring to two completely different concepts.
by Matthew Schlegel
As more machine learning models are applied to highstakes applications, explaining a model’s predictions is a necessary part of responsible use of these models. Reasoning about how changes in the input change a model’s prediction is known as a counterfactual explanation. This paper extends the framework of counterfactual explanations to sequences of decisions to find optimal counterfactual policies that maximize an outcome constrained to remaining close to the observed action sequence. The policies returned from their polynomialtime algorithm improve outcomes on a series of synthetic and real datasets. The authors posit that the counterfactual policies can be used to further elucidate complex decisionmaking processes and, specifically, give insight when counterfactual actions are concentrated on a few critical decision points. Looking beyond onestep decisions to multistep action sequences is critical for explaining complex decisionmaking algorithms. This paper excellently provides the groundwork for building counterfactual explanations along trajectories.
by Yanshuai Cao
Many have hypothesized that deep learning and causal inference could complement each other well. On the one hand, understanding cause and effect could help fix some known issues with deep learning, such as the poor ability to generalize out of distribution and lack of robustness to adversarial attacks. On the other hand, the power of representation learning could scale causal inference to highdimensional problems. However, most existing works that employ causal inference with deep neural networks use them as separate stages. For example, following Pearl’s DOCalculus, a symbolic computation step is first executed for causal identification, turning a causal question into a statistical estimation problem, which can then be solved by fitting deep neural nets.
In this work, the authors combine causal inference and neural networks on a more fundamental level with the proposed neural causal models (NCMs) and perform the causal identification via gradient descent in the same process as neural net parameter learning. No more symbolic computation is needed, just the structural knowledge expressed through the design of the neural net, which deep learning researchers already spend lots of time engineering. The paper also has theoretical results about expressivity and identifiability of NCMs, which follow from the universal approximation theorem of feedforward neural nets and the “Causal No Free Lunch” principle entailed by Pearl’s Causal Hierarchy.
by Peng Xu
What problem does it solve?  Leverage sparsity to make large Tranformer models scale efficiently. To be specific, the goal is to perform inference faster than the standard Transformer as the model size scales up, while retaining the empirical performances on real tasks.
Why is this important?  The Transformer architecture have achieved huge successes in the field of natural language processing in recent years. Lately, it also gains great popularity among other fields. At the same time, the model size of Transformer models grows larger and larger, as well as the huge costs such models incur. As a result, it is increasingly important to make them scale efficiently.
The apporach taken  This paper address this problem by proposing Scaling Transformers with a separate sparse mechanism for the query, key, value and output layers (Sparse QKV for short) and combine it with sparse feedforward blocks (Sparse FF for short) to get a fully sparse Transformer architecture.
Results:
Scaling Transformers also yield competitive results on challenging realworld tasks like summarizing arxiv articles, as compared to stateoftheart approaches.
by Thibaut Durand
Predicting the future is a fundamental research problem with a large range of applications like demand forecasting, autonomous driving, robotics, and health care. However this research problem is very challenging because future is uncertain. Probabilistic generative models have shown promising results for this problem. This paper introduces the Probabilistic Transformer (ProTran) model, which is a state space model (SSM) based on transformer architectures for multivariate time series. Unlike existing models, this model does not rely on recurrent neural networks but relies on the attention mechanism because the attention mechanism shows promising results to model longrange dependencies. Compared to transformerbased models, the Probabilistic Transformer model is able to capable of generating diverse longterm forecasts with uncertainty estimates. The Probabilistic Transformer model shows very good performances on several tasks like time series forecasting and human motion prediction. Probabilistic time series forecasting is an active research problem at Borealis AI. I really like the Probabilistic Transformer model because it combines strengths of state space models and transformer architectures. I think that capturing the uncertainty inherent to the future can lead to strong time series forecasting models that will help to make better financial decisions.
by Peng Xu
In the past years, Transformers have shown great successes in multiple domains. However, the training cost of a Transformer can be expensive, in particular for the large models designed recently. This paper proposes to reduce the costs of Transformers by searching for a more efficient variant. To find Transformer alternatives, the authors designed a new search space. Then, they used the Regularized Evolution with hurdles search algorithm to find the most training efficient architecture in the search space. They discovered a new model called Primer (PRIMitives searched transformER). The main finding is that the compute savings of Primer over Transformers increase as training cost grows, when controlling for model size and quality. The authors also found that the improvements of Primer over Transformer can be mostly attributed to two main modifications: squaring ReLU activations and adding a depthwise convolution layer after each Q, K, and V projection in selfattention. It is interesting to see these modifications are easy to implement and can be easily added into existing Transformer architecture codebase. The authors observed that these changes can significantly speed up training of existing Transformer models without additional tuning. Improving the training of Transformers is an active research area at Borealis AI. Making training efficient can be critical for models working on nonstationary data like time series forecasting models. I like this paper because it shows that some small architectural changes can improve a lot the training of Transformers.
by Amir Abdi
One of the challenges of Representation Learning with Variational AutoEncoders (VAEs) is the model identification issue related to the rotational symmetries of the latent space caused by the rotational invariance of the standard Gaussian prior. The mentioned rotational symmetries in the latent space can cause strong correlations between latent variables, which, in turn, hinders disentanglement. Moreover, because of the lack of rotational constraints, high variations are observed inbetween experiments on different seeds with respect to the disentanglement metrics.
In this work, and inspired by Independent Component Analysis (ICA) the authors propose the Jacobian L1 regularized VAE as an extension of BetaVAE with an added L1 norm on the Jacobian of the generator function to address the rotational identifiability issue. The L1 loss encourages local alignment of the axes of the latent representation with individual factors of variation. They demonstrate improvements on extended versions of disentanglement metrics, i.e., MIG and Modularity, which focus on local disentanglement across factors of variations, compared to BetaVAE, FactorVAE, DIPVAEI, DIPVAEII, BetaTcVae, and annealedVAE. This solution helps with local alignment of the factors of variation, yet, does not address global alignment. Because the full Jacobian of the generator is calculated during training, the compute time is scaled linearly with the number of latent dimensions.
]]>The views expressed in this article are those of the interviewee and do not necessarily reflect the position of RBC or Borealis AI.
We offer the tools to solve any natural language processing problem that a developer might have through the use of large language models. We have an API that allows people to access, fine tune and deploy these stateoftheart models, giving them the ability to solve pretty much any problem they can formulate.
(NF):
We're doing something that is transformative; getting computers to understand language has really broad impacts. We know we need to respect the power of the technology and understand the ways in which it could be used for good and for bad.
As the builder of that tool, you want to enable the good things that can be done with it while, at the same time, make the bad things that can be done with it more difficult to do and less effective.
(NF):
There's no silver bullet for this, even though there's a lot of people working on it. Languages continuously change – they're living things – so there will never be a complete lock on this.
If you take an extreme view, one way to address safety concerns would be to limit access to just the handful of companies that have the resources to create their own large language models. But we think the technology's really good. We think it's really transformative. And we want people to have access to it. So limiting access as a way to improve security is obviously not ideal.
The middle ground is that you make the technology as good as you can, and as ethical and responsible as you can. You then deploy it in a way that gives as many people access to it as possible, while balancing the risk and ensuring it is deployed responsibly.
(NF):
Let’s use hateful content as an example. Prior to deploying the model, we spend a lot of time trying to reduce the likelihood of it generating hateful or identitybased hate content.
The most straightforward way is by changing the distribution of the training data. And that can be done with some really simple techniques like word level filtration – where documents are removed from the training data if they contain a word from a prepopulated list of slurs, for example. But that obviously doesn’t catch everything.
Some techniques are much more sophisticated. For example, we recently posted a paper that described how we are using our model to selfidentify words and text that should be added to the list. In other words, we are using earlier versions of our big language model to remove harmful data for the next iteration of the model.
(NF):
Not really. We haven’t seen a drop in performance when filtering out identitybased hate speech, for example. If we do see a drop in performance, the impact is generally on the model’s ability to generate identitybased hate speech. So it’s really a winwin.
(NF):
I think it’s tempting to just say that morality is subjective. To ask, “Who are we to make the decisions?” It’s easy to abdicate the decision. But I don't agree with that at all.
I think it’s far better to recognize that it’s subjective, and then to work really hard to make the right decisions based on input from as many smart people as possible. And I think founders of startups have an even greater responsibility to ensure the technologies they are building are contributing to the good in the world. We cannot just simply abdicate that responsibility to users.
(NF):
I’ll be the first to admit that I'm not an expert in ethics. That's not my background. And I know that. So it's really helpful to have a group of people who have studied that area and its intersections with technology.
We set up a Responsibility Council at Cohere. And when we’re faced with a complicated problem, we can reach out to this group of diverse people to get their input. They give us suggestions. They pay attention to how we're doing things. And they give us advice and recommendations and tell us if we're doing the right stuff.
I think in the technology sector, we often think most problems can be addressed by applying more tech. But the reality is that there are a whole bunch of complicated problems that can't be addressed with pure tech solutions. These are problems that require people who have spent a lot of time thinking about a bunch of the other domains of research that are not the hard sciences.
(NF):
We take a holistic and distributed approach to this. Alongside our Responsibility Council, we have our own internal experts who are largely dedicated to working on responsibility. We also want these concepts and ideas to be flowing across the organization and through the culture. So we try to distribute some of the responsibilities across the whole team, encouraging as many people as possible to work on it.
The point is to ensure the idea of responsibility doesn’t get stuck in siloed thinking – that people are engaged on these topics as much as possible, and you are making sure it is spread out across the organization. Responsibility can’t just exist on a slide in the organizational mission statement.
(NF):
We really need to respect the technology that we work with. Machine learning can work. It can be transformative. It can have a massive impact on people’s lives. So you need to make sure you are building something that is having a positive impact and minimizing the potential for negative impact.
At Cohere, we try to think about these issues as early as possible in the development cycle. And we are working with a bunch of really smart people to help ensure we don’t allow a blindspot to emerge down the road.
My advice would be to get as much input from as many different people as possible. And to start thinking about it from the very start. Other than that, just try to do your best.
Nick Frosst is the cofounder of Cohere. Prior to founding Cohere, Nick worked on neural network research as part of Geoffrey Hinton’s Toronto Google Brain Team, focusing on capsule networks, adversarial examples, and explanability. Nick holds a BSc from the University of Toronto, with a double major in Computer Science and Cognitive Science. Nick cofounded Cohere in January 2019 with Aiden Gomez and Ivan Zhang.
]]>
At Borealis AI, we firmly believe that the development of responsible ML requires diverse views, research and talent. And we are committed to encouraging greater diversity and inclusion in our actions, our research and our collaborative partnerships.
That is why Borealis AI is proud to support the 2021 Women in Machine Learning (WiML) Workshop. This important event gives femaleidentified faculty, research scientists, and graduate students in the machine learning community an opportunity to meet, exchange ideas and learn from each other. In doing so, WiML is on a mission to increase gender diversity in ML, help womenidentified individuals in ML to succeed professionally, and increase their impact within their communities.
“At Borealis AI, we are committed to empowering and engaging femaleidentified researchers in the field of ML,”
noted Dr. Eirene Seiradaki, Director of Research Partnerships at Borealis AI.
“Alongside our range of other diversity and inclusion initiatives, we hope our support of the 2021 WiML Workshop at NeurIPS provides those researchers – and those aspiring to join the field of ML – with the role models, ideas and inspiration to drive their career in ML forward.”
Hosted virtually within the 2021 Conference on Neural Information Processing Systems (NeurIPS), this year’s event builds on 15 years of programs designed around substantive technical and professional conversations held within positive, supportive environments. To learn more about WiML and the WiML Workshop, click here.
]]>The 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP) is the preeminent forum for collaboration around computational linguistics and natural language processing. This year’s conference is expected to attract around 4,000 attendees, both physical and virtual. But, for a wide variety of reasons, forums like this can be often be difficult to access for some researchers. And that directly impacts diversity.
As a Diversity and Inclusion sponsor of the EMNLP, we aim to support researchers facing various types of hardships. We are helping provide accommodations for researchers with disabilities. We are helping to subsidize attendance for those dealing with financial hardship, those with family or childcare responsibilities, and those firsttime attendees from underrepresented regions or groups. And we are helping to enable remote participation for researchers unable to travel to the conference.
“Borealis AI is dedicated to growing, strengthening and diversifying the global machine learning talent pool through innovative and smart partnerships like our Diversity and Inclusion sponsorship of EMNLP 2021,”
“We look forward to meeting the attendees at our virtual booth and we are excited to see what new ideas, models and technologies will emerge from the event.”
]]>
In contrast to STL, multitask learning (MTL) optimizes a single model to perform multiple related tasks simultaneously, aiming to improve generalization and parameter efficiency across tasks. In this case, two or more output targets are associated with the same input data. Effective multitasking learning typically requires task balancing to prevent one or more tasks from dominating the optimization, to decrease negative transfer, and to avoid overfitting. Standard MTL settings usually assume a homogeneous set of tasks, for example all tasks are classification or regression tasks, and usually they are nonsequential data. This scenario can greatly benefit MTL approaches with strong shared representations. In contrast, heterogeneous multitask learning is defined by multiple classes of tasks, such as classification, regression with single or multi label characteristics and temporal data, being optimized simultaneously. The latter setting is more realistic but lacks further exploration. In this post, we share a novel method that we recently developed for heterogeneous MTL.
Hardparameter sharing networks [1], shown in Figure 1.b are one of the pillars of multitask learning. These networks are composed of a shared bottom and taskspecific branches. Ma et al. [2] suggested that a unique shared bottom might not be enough to generalize for all tasks in an application, and proposed to use several shared bottoms, or what they call experts. The experts are combined using gate functions, and their combination is forwarded to the towers. The final architecture is called Multigate MixtureofExperts(MMoE), and is shown in Figure 1.c. MMoE generalizes better than its traditional hardparameter sharing counterpart, but there are two weaknesses: first, it lacks a taskbalancing mechanism; second, the only source of diversity among the experts is due to the random initialization. Although the experts can indeed be diverse enough if they specialize in different tasks, there are no guarantees that this will happen in practice. We propose the Multigate MixtureofExperts with Exclusivity (MMoEEx) (Figure 1.d), a model that induces more diversity among the experts and has a taskbalancing component.
Multigate MixtureofExperts with Exclusivity (MMoEEx) takes its inspiration from ensemble learning, where diversity among their learners tend to generalize better. MMoEEx can be divided in three parts: gates, experts and towers. Considering an application with $K$ tasks, input data $x \in \mathbb{R}^d$, the gate function $g^k()$ is defined as:
\begin{equation}
\label{eq:g}
g^k(x) = \text{softmax}(W^k x), \forall k \in \{0,...,K\} \tag{1}
\end{equation}
where $W^k \in \mathbb{R}^{E \times d}$ are learnable weights and $E$ is the number of experts, defined by the user. The gates control the contribution of each expert to each task.
The experts $f_e(),\forall e\in \{0,...,E\}$, and our implementation is very flexible to accept several experts architectures, which is essential to work with applications with different data types. For example, if working with temporal data, the experts can be LSTMs, GRUs, RNNs; for nontemporal data, the experts can be dense layers. The number of experts $E$ is defined by the user. The experts and gates' outputs are combined as follows:
\begin{equation}
\label{eq:f}
f^k(x) = \sum_{e=0}^Eg^k(x)f_e(x), \forall k \in \{0,...,K\} \tag{2}
\end{equation}
The $f^k()$ are input to the towers, the taskspecific part of the architecture. Their design depends on the data type and tasks. The towers $h^k$ output the task predictions as follows:
\begin{equation}
y^k = h^k(f^k(x)), \forall k \in \{0,...,K \} \tag{3}
\end{equation}
Previous Mixture of Experts models like [2] leverage several experts to make their final predictions; however, they rely on indirect approaches, such as random initialization, to foster diversity among the experts, and on the expectation that the gate function will learn how to combine these experts. Here we propose a mechanism to induce diversity among the experts, defined as $\textit{exclusivity}$.
Exclusivity: We set $\alpha E$ experts to be exclusively connected to one task. The value $\alpha\in[0,1]$ controls the proportion of experts that will be $\textit{exclusive}$. If $\alpha=1$, all experts are exclusive, and if $\alpha=0$, all experts are shared (same as MMoE). An exclusive expert is randomly assigned to one of the tasks $T_k$, but the task $T_k$ can still be associated with other exclusive experts and shared experts.
MMoEEx, similarly to MMoE, relies on the expectation that gate functions will learn how to combine the experts. Our approach induces more diversity by forcing some of these gates to be 'closed' to some experts, and the exclusivity mechanism is used to close part of the gates. The remaining nonclosed gates learn to combine the output of each expert based on the input data, according to Equation 1.
Competing task optimization is another challenge of optimizing heterogenous tasks. The goal of the MAMLMTL optimization is to balance the tasks on the gradient level. Finn et al. [3] proposed the Modelagnostic Metalearning (MAML), a twostep optimization approach originally intend to be used with transferlearning and fewshot learning due to its fast convergence. Initial attempts to apply MAML to MTL show that MAML can balance the tasks on the gradient level and yield better results than some existing taskbalancing approaches [4]. The core idea is that MAML's temporary update yields smoothed losses, which also smooth the gradients on direction and magnitude. However, differently from [4], we do not freeze taskspecific layers during the intermediate/inner update. The MAMLMTL approach is shown in Figure 2. The approach consists of evaluating each task loss. After that each task loss is used to temporarily update the network which are reevaluated and the task specific temporarily losses are aggregated to form the final loss which will provide the actual network update.
The Medical Information Mart for Intensive Care (MIMICIII) database was proposed by [5] to be a benchmark dataset for MTL in timeseries data. It contains metrics of patients from over 40,000 intensive care units (ICU) stays. This dataset has 4 tasks: two binary tasks, one temporal multilabel task, and one temporal classification. Figure 3 shows the neural network adopted in our work and where each task is calculated.
The full set of results for MIMICIII dataset is presented in Table 1. We compared our approach with the multitask channel wise LSTM (MCWLSTM) [6], single task trained network, shared bottom, MMoE [2] and MMoEEx.
MMoEEx outperforms all the compared approaches except on the Phenotype (Pheno) task. For both time series tasks (LOS and Decomp) our approach outperforms all baselines. It is worth noting that for the LOS task, which is the hardest task on MIMICIII, we present a relative improvement superior to $40$ percentage points when compared to multitask channel wise LSTM [6] and over $16$ percentage points to MMoE.
Method  Pheno  LOS  Decomp  Ihm  $\Delta$ 
MCWLSTM[6]  77.4  $45.0$  $90.5$  $87.0$  $+0.28%$ 
Single Task [6]  $77.0$  $45.0$  $91.0$  $86.0$   
Shared Bottom  $73.36$  $30.60$  $94.12$  $82.71$  $9.28%$ 
MMoE  75.09  $54.48$  $96.20$  $90.44$  $+7.36%$ 
MMoEEx  $72.44$  63.45  96.82  90.73  +11.74% 
We measured how diverse MMoEEx experts are compared to traditional MMoE.
The diversity among experts can be scored by the distance between the experts' outputs $f_e, \forall e\in\{0,..., E\}$. Considering a pair of experts $i$ and $j$, the distance between them is defined as:
\begin{equation}
d_{i,j} = \sqrt{\sum_{n=0}^N(f_i(x_n)f_j(x_n))^2} \tag{4}
\end{equation}
where $N$ is the number of samples in the dataset, $d_{i,j} = d_{j,i}$, and a matrix $D \in \mathbb{R}^{E\times E}$ is used to keep all the distances. To scale the distances into $d_{i,j}\in [0,1]$, we divide the raw entries in the distance matrix $D$ by the maximum distance observed, $\text{max}(D)$. A pair of experts $i,j$ with $d_{i,j} = 0$ are considered identical, and experts distances $d_{i,j}$ close to 0 are considered very similar; analogously, experts with $d_{i,j}$ close to 1 are considered very dissimilar. To compare the overall distance between the experts of a model, we define the $\textit{diversity score}$ $\bar{d}$ as the mean entry in $D$.
We analyze the diversity score of the MMoE and MMoEEx experts on MIMICIII. The MMoE and MMoEEx models compared have the same neural network structure, but the MMoEEx uses the MAML  MTL optimization and has the diversity enforced. The MMoEEx model in Figure 4 was created with $\alpha = 0.5$ and exclusivity. In other words, half of the experts in the MMoEEx model were randomly assigned to be exclusive to one of the tasks, while in the MMoE model all experts are shared among all tasks. Figure 4 shows a heatmap of the distances $D^{MMoE}$ and $D^{MMoEEx}$ calculated on the MIMICIII testing set with 12 experts. MMoE's heatmap has overall lighter colors, indicating smaller diversity scores, compared with MMoEEx. Quantitatively, MMoEEx produces a relative lift of $43\%$ in diversity score.
We presented a novel multitask learning approach called Multigate MixtureofExperts with Exclusivity (MMoEEx), which extends previous methods by introducing an exclusivity mechanism that induces more diversity among experts, allowing the network to learn representations that are more effective for heterogeneous MTL. We also introduce a twostep optimization approach called MAMLMTL, which balances tasks at the gradient level and enhances MMoEEx's capability to optimize imbalanced tasks.
MTL has achieve critical mass in multiple areas like natural language processing [7, 8, 9], computer vision [10, 11, 12], reinforcement learning [13, 14] and multimodal learning [15, 16]. Standard soft/hard parameter sharing approaches are a well established technique to handle multiple tasks. While they show improvements over single task learning for tasks with similar characteristics, it is not fully explored how MTL can further improve heterogeneous task scenarios. Hybrid approaches like mixture of experts can mitigate several limitations of standard approaches and further extend its capabilities when coupled with specialized optimization methods. Optimization methods for MTL are in their infancy, and more research on metalearning task balancing can greatly benefit MTL research. We hope this work inspires the community to further investigate multitask learning at the network architecture and optimization levels.
^{[1] } Rich Caruana. Multitask learning: A knowledgebased source of inductive bias. In Proceedings of the Tenth International Conference on Machine Learning, 1993.
^{[2] } Jiaqi Ma, Zhe Zhao, Xinyang Yi, Jilin Chen, Lichan Hong, and Ed H Chi. Modeling task relationships in multitask learning with multigate mixtureofexperts. In ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2018.
^{[3] } Chelsea Finn, Pieter Abbeel, and Sergey Levine. Modelagnostic metalearning for fast adaptation of deep networks. In International Conference on Machine Learning, pages 1126–1135. PMLR, 2017.
^{[4] } Sungjae Lee and Youngdoo Son. Multitask learning with single gradient step update for task balancing. arXiv preprint arXiv:2005.09910, 2020. 8
^{[5] } Hrayr Harutyunyan, Hrant Khachatrian, David C Kale, Greg Ver Steeg, and Aram Galstyan. Multitask learning and benchmarking with clinical time series data. Scientific data, 6(1):1–18, 2019.
^{[6] } Alistair EW Johnson, Tom J Pollard, Lu Shen, H Lehman LiWei, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G Mark. Mimiciii, a freely accessible critical care database. Scientific data, 3(1):1–9, 2016.
^{[7] } Victor Sanh, Thomas Wolf, and Sebastian Ruder. A hierarchical multitask approach for learning embeddings from semantic tasks. In AAAI Conference on Artificial Intelligence, volume 33, 2019.
^{[8] } Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jianfeng Gao. Multitask deep neural networks for natural language understanding. In Annual Meeting of the Association for Computational Linguistics, 2019.
^{[9] } Cagla Aksoy, Alper Ahmetoglu, and Tunga Gung ¨ or. Hierar ¨ chical multitask learning approach for BERT. arXiv preprint arXiv:2011.04451, 2020.
^{[10] } Ximeng Sun, Rameswar Panda, Rogerio Feris, and Kate Saenko. AdaShare: Learning what to share for efficient deep multitask learning. In Advances in Neural Information Processing Systems, 2020.
^{[11] } Shikun Liu, Edward Johns, and Andrew J Davison. Endtoend multitask learning with attention. In IEEE Conference on Computer Vision and Pattern Recognition, 2019.
^{[12] } Simon Vandenhende, Stamatios Georgoulis, and Luc Van Gool. MTINet: Multiscale task interaction networks for multitask learning. In European Conference on Computer Vision, 2020.
^{[13] } Lerrel Pinto and Abhinav Gupta. Learning to push by grasping: Using multiple tasks for effective learning. arXiv preprint arXiv:1609.09025, 2016. 9
^{[14] } Matteo Hessel, Hubert Soyer, Lasse Espeholt, Wojciech Czarnecki, Simon Schmitt, and Hado van Hasselt. Multitask deep reinforcement learning with popart. Technical report, DeepMind, 2019.
^{[15] } Subhojeet Pramanik, Priyanka Agrawal, and Aman Hussain. OmniNet: A unified architecture for multimodal multitask learning. arXiv preprint arXiv:1907.07804, 2019.
^{[16] } Jiasen Lu, Vedanuj Goswami, Marcus Rohrbach, Devi Parikh, and Stefan Lee. 12in1: Multitask vision and language representation learning. In The IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2020.
]]>Reinforcement learning (RL) has moved from toy domains to realworld applications such as navigation [4], software engineering [2], industrial design [11], and finance [10]. Each of these applications has inherent difficulties which are longstanding fundamental challenges in RL, such as: limited training time, partial observability, large action or state spaces, costly exploration and safety considerations, among others. Similar problems occur when using RL in trading markets, however, we focus on three main aspects that we consider are highly relevant for financial applications: riskawareness, variance reduction, and robustness.
Risk is such a common term that can have many definitions in different scenario. Our first question then is, what is risk? In the context of trading, risk is the potential that your chosen investments may fail to deliver your anticipated outcome. That could mean getting lower returns than expected, or losing your original investment – and in certain forms of trading, it can even mean a loss that exceeds your deposit.
Our second question is, how to measure risk? Risk assessment is a cornerstone in financial applications, a wellknown approach is to consider risk while assessing the performance (profit) of a trading strategy. Here, risk is a quantity related to the variance (or standard deviation) of the profit and it is commonly refereed to as "volatility". In particular, the Sharpe ratio [15], a commonly used measure in trading markets, considers both the generated profit and the risk (variance) associated with a trading strategy. Sharpe ratio is commonly defined to be the asset return divided by the standard deviation of the asset return.
Traditional RL aims to optimize the expected return, usually, without considerations of risk. However, riskaverse RL is a recent area that has proposed to optimize an objective function with risk consideration.
Riskaverse Qlearning (RAQL): Shen et al. [16] proposed a Qlearning algorithm that is shown to converge to the optimal of a risksensitive objective function:
\begin{align}
\label{eq:Risk_Averse_Objective}
\tilde{J}_{\pi}= \frac{1}{\beta}\mathbb{E}_{\pi}\left[exp\left(\beta\sum_{t=0}^{\infty}\gamma^t r_t\right)\right]=\mathbb{E}\left[\sum_{t=0}^{\infty}\gamma^t r_t\right] + \frac{\beta}{2}\mathbb{V} ar\left[\sum_{t=0}^{\infty}\gamma^t r_t\right] + O(\beta^2).
\end{align}
The training scheme is the same as Qlearning, except that in each iteration, a utility function is applied to the TDerror. A utility function is a monotonically increasing function. A concave utility function is applied when we want to optimize a riskaverse objective function, in contrast, a convex utility function is applied when we want to optimize a riskseeking objective function. To summarize, applying a utility function to Qlearning is a concise way to consider risk in RL.
In trading markets, we do not only care about the expected return, but also how 'safe' a strategy is. In RL, one common approach to measure 'safety' is by measuring the variance of the return [7]. Here we we mention two recent works.
AveragedDQN [1]: This approach reduces training variance by training multiple Q tables in parallel and averaging previously learned Qvalue estimates, which leads to a more stable training procedure and improved performance by reducing approximation error variance in the target values. Averaged DQN is theoretically shown to reducing the training variance, but there is no convergence guarantee.
Variance reduced Qlearning (VQL): Wainwrigth [18] proposed a variance reduction Qlearning algorithm which can be seen as a variant of the SVRG algorithm in stochastic optimization [9]. Given an algorithm that converges to $Q^*$, one of its iterates $\bar{Q}$ could be used as a proxy for $Q^*$, and then recenter the ordinary Qlearning updates by a quantity $\hat{\mathcal{T}}_k(\bar{Q}) + \mathcal{T}(\bar{Q})$, where $\hat{\mathcal{T}}_k$ is an empirical Bellman operator, $\mathcal{T}$ is the population Bellman operator, which is not computable, but an unbiased approximation of it could be used instead. This algorithm is shown to be convergent to the optimal of expected return and enjoys minimax optimality up to a logarithmic factor.
Novel proposed algorithm: RA2Q [6]: Since RAQL has the advantage that it converges to the optimal of a riskaverse objective function, we could use it as a building block and design novel riskaverse RL algorithms based on it. The idea of training multiple Q tables in parallel could be integrated with the utility function technique, more specifically, we train $k$ Q tables in parallel using the update rules in RAQL, and select more stable actions by estimating the variance by the sample variance of those $k$ Q tables, then compute a riskaverse $\hat{Q}$ table and select actions according to it. We name this algorithm RA2Q, which preserves the convergence property of RAQL.
Novel proposed algorithm: RA2.1Q [6]: We can also combine the 'recenter' technique of VQL with the utility function technique in a novel algorithm RA2.1Q. For each empirical Bellman operator $\hat{\mathcal{T}}_k$, we apply a riskaverse utility function to the TD error. Although we cannot show any convergence guarantee of RA2.1Q, empirically, RA2.1Q obtained better results than RA2Q and RAQL in a multiagent evaluation.
What is robustness? We usually say an algorithm is robust if it is stable under different challenging scenarios. Recent works, have improved robustness of algorithms with adversarial learning by assuming two opposing learning processes: one that aims to disturb the most and another one that tries to control the perturbations [12].
RiskAverse Robust Adversarial Reinforcement Learning (RARL): The same concept has been adapted with neural networks in the context of deep RL [14] and in particular RARL [13] extended this idea by combining with Averaged DQN. RARL trains two agents  protagonist and adversary in parallel, and the goal for those two agents are respectively to maximize/minimize the expected return as well as minimize/maximize the variance of expected return. RARL showed good experimental results, enhancing stability and robustness, without providing theoretical guarantees.
Novel proposed algorithm: RA3Q [6]: The idea of having a protagonist and adversary in the same environment lends itself to multiagent learning algorithms. In this context, Nash Qlearning [8] is a wellknown multiagent algorithm that can obtain the optimal strategy when there exists a unique Nash equilibrium in generalsum stochastic games. Our last proposal takes inspiriation from multiagent learning algorithms and adversarial RL. In order to achieve a robust riskaverse agent, we combine the idea of adversarial learning with RA2Q. We assume two opposing learning process: one protagonist aims to maximize the expected reward and minimize the variance, while one adversary aims to disturb the protagonist by minimizing the expected reward and maximize the variance. We name this adversarial learning algorithm RA3Q and although RA3Q does not have a convergence guarantee, empirically, RA3Q shows better results in terms of robustness compared to RA2Q.
How do we measure the superiority of RL agents in trading markets? We use game theory and treat each agent as a player in a stochastic game. In empirical game theory, a meta game payoff table could be seen as a combination of two matrices $(NR)$, where each row $N_i$ contains a discrete distribution of $p$ players over $k$ strategies, and each row yields a discrete profile $(n_{\pi_1}, ..., n_{\pi_k})$ indicating exactly how many players play each strategy with $\sum_{j}n_{\pi_j} = p$. A strategy profile $\mathbf{u} = \left(\frac{n_{\pi_1}}{p}, ..., \frac{n_{\pi_k}}{p}\right)$. And each row $R_i$ captures the rewards corresponding to the rows in $N$. Once we have a metagame payoff table, to view the dominance of different strategies, one can plot a directional field of the payoff tables where arrows in the strategy space indicates the direction of flow of the population composition over the strategies [17].
In our first experiment with the opensourced ABIDES [5] market simulator our setting consisted of one nonlearning agent that replays the market deterministically [3]and learning agents. The learning agents considered are: RAQL, RA2Q, RA2.1Q. The measure we use is Sharpe Ratio, which is a commonly used riskaverse measure in financial markets. The results are shown in the Figure below.
Our second experiment tested robustness and we trained RA2Q and RA3Q agents under the same conditions as a first step. Then in testing phase we added two types of perturbations, one adversarial agent (trained within RA3Q) or adding noise (aka. zerointelligence) agents in the environment. In both cases, the agents will act in a perturbed environment. The results presented in Table 1 shown that RA3Q obtained better results than RA2Q, highlighting its robustness.
Algorithm/Setting  Adversarial Perturbation  ZI Agents Perturbation 
RA2Q RA3Q 
0.5269 0.9347 
0.9538 1.0692 
We have argued that riskawareness, variance reduction and robustness are relevant characteristics for RL agents since those can be used as building blocks to construct algorithms. For example, by using utility functions, parallel training of Q tables, and adversarial learning, different algorithms can be constructed, as shown in Fig. 2.
Table 2 presents a summary of properties of the algorithms mentioned in this post, those with bold typeface are our novel algorithms [6].
Algorithm  Riskawareness  Variance reduction  Robustness 
RAQL  ●  
Averaged DQN  ●  
VQL  ●  
RARL  ●  ●  
RA2Q  ●  ●  
RA2.1Q  ●  ●  
RA3Q  ●  ●  ● 
^{[1] } Oron Anschel, Nir Baram, and Nahum Shimkin. Averageddqn: Variance reduction and stabilization for deep reinforcement learning. InInternational Conference on Machine Learning,pages 176–185. PMLR, 2017.
^{[2] }Mojtaba Bagherzadeh, Nafiseh Kahani, and Lionel Briand. Reinforcement learning for test caseprioritization.arXiv preprint arXiv:2011.01834, 2020.
^{[3] } Tucker Hybinette Balch, Mahmoud Mahfouz, Joshua Lockhart, Maria Hybinette, and DavidByrd. How to evaluate trading strategies: Single agent market replay or multiple agent interactive simulation?arXiv preprint arXiv:1906.12010, 2019.
^{[4] }Marc G Bellemare, Salvatore Candido, Pablo Samuel Castro, Jun Gong, Marlos C Machado,Subhodeep Moitra, Sameera S Ponda, and Ziyu Wang. Autonomous navigation of stratosphericballoons using reinforcement learning.Nature, 588(7836):77–82, 2020.
^{[5] }David Byrd, Maria Hybinette, and Tucker Hybinette Balch. Abides: Towards highfidelitymarket simulation for ai research.arXiv preprint arXiv:1904.12066, 2019.
^{[6] } Yue Gao, Kry Yik Chau Lui, and Pablo HernandezLeal. Robust RiskSensitive ReinforcementLearning Agents for Trading Markets. InReinforcement Learning for Real Life (RL4RealLife)Workshop at ICML, 2021.
^{[7] } Javier Garcıa and Fernando Fern ́andez. A comprehensive survey on safe reinforcement learning.Journal of Machine Learning Research, 16(1):1437–1480, 2015.
^{[8] } Junling Hu and Michael P. Wellman. Multiagent reinforcement learning: Theoretical frameworkand an algorithm. InProceedings of the Fifteenth International Conference on Machine Learning, ICML ’98, page 242–250, San Francisco, CA, USA, 1998. Morgan Kaufmann PublishersInc.
^{[9] } Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variancereduction. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger,editors,Advances in Neural Information Processing Systems, volume 26. Curran Associates,Inc., 2013.
^{[10] } Yuxi Li. Deep reinforcement learning: An overview.arXiv preprint arXiv:1701.07274, 2017.
^{[11] } Azalia Mirhoseini, Anna Goldie, Mustafa Yazgan, Joe Jiang, Ebrahim Songhori, Shen Wang,YoungJoon Lee, Eric Johnson, Omkar Pathak, Sungmin Bae, et al. Chip placement with deepreinforcement learning.arXiv preprint arXiv:2004.10746, 2020.
^{[12] } Jun Morimoto and Kenji Doya. Robust reinforcement learning.Neural computation, 17(2):335–359, 2005.5
^{[13] } Xinlei Pan, Daniel Seita, Yang Gao, and John Canny. Risk averse robust adversarial reinforcement learning. In2019 International Conference on Robotics and Automation (ICRA), pages8522–8528. IEEE, 2019.
^{[14] } Lerrel Pinto, James Davidson, Rahul Sukthankar, and Abhinav Gupta. Robust adversarialreinforcement learning. InInternational Conference on Machine Learning, pages 2817–2826.PMLR, 2017.
^{[15] } William F Sharpe. The sharpe ratio.Journal of portfolio management, 21(1):49–58, 1994.
^{[16] } Yun Shen, Michael J Tobia, Tobias Sommer, and Klaus Obermayer. Risksensitive reinforcementlearning.Neural computation, 26(7):1298–1328, 2014.
^{[17] } Karl Tuyls, Julien Perolat, Marc Lanctot, Edward Hughes, Richard Everett, Joel Z Leibo, CsabaSzepesv ́ari, and Thore Graepel. Bounds and dynamics for empirical game theoretic analysis.Autonomous Agents and MultiAgent Systems, 34(1):1–30, 2020.
^{[18] } Martin J Wainwright.Variancereducedqlearning is minimax optimal.arXiv preprintarXiv:1906.04697, 2019.
The views expressed in this article are those of the interviewee and do not necessarily reflect the position of RBC or Borealis AI.
Nassim Abdi (NA):
Businesses have been talking about diversity and inclusion for a long time. Unfortunately, most of the time, they were doing little more than completing a list of ‘checkbox’ items. More recently, however, we have started to see companies thinking more clearly about the business case behind diversity and inclusion. Businesses are starting to realize that they are losing high quality talent and slowing their rate of innovation and creativity simply because they do not have diverse voices at the table. You can’t create the most innovative tool if you aren’t looking at it from all angles and from different perspectives.
The data proves it. Recent research shows that companies with an inclusive workplace enjoy a 42% increase in collaboration. Those with diverse teams report a 1.7x increase in innovation. And there is emerging data that suggests lack of diversity leads to an US8 billion loss in productivity. Those are numbers businesses can’t simply ignore.
(NA):
I think Microsoft’s experience with Tay demonstrated that AI is only as good as the data you put into it. Recent studies suggest that facial recognition tends to be much, much more accurate when it comes to white males than other groups – particularly women of colour – largely because there is much more test data for white males than other populations. So I would argue that bias is already in the system and in the data before we even start.
That is why it is so critically important that AI developers and researchers pay attention to this issue. You must be able to bring that diversity of views or that empathetic mindset approach that only comes from understanding other peoples’ perspectives and walking in their shoes. Otherwise, you are allowing your decisions to be influenced by implicit biases.
(NA):
It really comes down to the relationship between bias and privilege. As developers, you have the privilege of being the decisionmakers – you are the ones that control whether you are creating something that will ultimately be inclusive or exclusive. And that influences the ultimate structure of power. It’s not easy. But it starts with understanding that privilege and really checking our biases.
Machine learning models love binary decisions. Yet diversity does not lend itself well to binary thinking. How can we tackle this idea of ‘intersectionality’ as we think about our models?
That is one the big challenges when it comes to how we use and define technology. For now, it really comes down to ensuring you have real diversity and inclusion in your teams and that you define standards that are more aligned to the human world we actually live in. That means that the governance of these platforms is going to become much more important.
(NA):
The first thing is just being aware that we all have implicit bias. It’s human nature; for millions of years, humans were trained to avoid things that were unfamiliar. So the big question is how we can avoid it, particularly in a world of social media echo chambers. The real challenge here is to help people walk in someone else’s shoes – to really start to understand their situation and the world from their perspective.
(NA):
The first step for executives and business leaders is to be willing to address the problem. It isn’t always easy to change the status quo. The good news is that the new generation of workers is really starting to change the conversation. They care about where they work and the vision of the organization. They want to see the bigger picture and they want to have a positive impact.
One of the more successful approaches that businesses are adopting is ‘reverse mentorship’. As a practice, it’s been around for a while. But now we are seeing a lot of success from companies using reverse mentorship to create safe spaces for conversations about diversity, equity and inclusion.
(NA):
I believe the key is in helping people walk in other peoples’ shoes and get a real understanding of their perspectives and lived experiences. And that’s really the foundation point for the company I founded – StoryBolt. Simply put, we use the power of storytelling to help organizations unpack implicit bias, gender equality, mental health and more.
As a teacher, I realized very quickly that people learn better from visual stories. Different parts of the brain start working; language comprehension, visual cues and sounds all fire up the neurons. And it creates an experience that can stay with us for the rest of our lives. Rather than just show smart documentaries that unpack an issue, we then invite the filmmaker to come into the room for a Q&A and discussion on the issues. It’s amazing how that sparks a kind of awareness in people that simply does not go away.
Nassim Abdi, Ph.D., is the CEO and Cofounder of StoryBolt. She is a storyteller and evangelist on finding the intersection of entertainment and learning in the area of diversity, equity, and inclusion. She has 12 years of academic experience in the field of intersectionalities of gender, race, and other identities as they relate to systems of discrimination or disadvantage. Nassim is also the leading actress of a Netflixfeatured film, Secret Ballot, (by Sony Pictures). Her vision for StoryBolt was shaped by the lifechanging experience of the film as it engaged her in Q&A sessions and exposed her to the power of movies and how candid human connections could change perspectives and facilitate courageous conversations in the workplace.
]]>If you’ve got an idea and passion to explore what you can do with it, we want to help you solve it.
Let’s SOLVE it is a new Borealis AI mentorship program that aims to help undergraduate students use AI to make a difference and solve real problems in their communities. You bring the idea and the team, we’ll provide the industry exposure, mentorship, contacts and training you need to make the project a reality.
We’re looking for teams of 3 to 5 undergraduate students (enrolled at Canadian universities) with idas on how AI and ML could be used to tackle a specific community problem. With this program, we hope to reach as many communities as possible, and so we’re encouraging teams from across the country and from every walk of life to apply.
The mentorship program is free and will be conducted virtually. It runs from October 1st to December 2nd, so it falls within the school semester, and you’d need to allocate about 10 hours per week during that time.
In return, you’ll get all the support you need to turn your idea into a proofofconcept implementation. You’ll get valuable experience, skills and ecosystem contacts that will help you consider launching your career in AI. Think of it as a ‘fasttrack’ accelerator for your idea, and your skills and capabilities.
This program is open to all undergraduate students at all Canadian universities. You don’t need to be enrolled in a Computer Science program – each team member should have some basic programming knowledge, but specific experience using AI or ML isn’t necessary. What you do need, however, is a strong sense of curiosity, a passion for solving problems using AI and a burning desire to accelerate your personal development.
Let’s SOLVE it is one of a handful of initiatives that Borealis AI and the Royal Bank support in order to encourage diversity, skills development and innovation at Canadian universities. Along with initiatives like our Borealis AI Fellowships program (aimed at postgrad students) and our Internships program (focused on Masterslevel students), our goal is to help nurture the AI leaders of tomorrow.
If you are an undergraduate student with dreams of solving real problems in your community using AI, we want to help you get there. We look forward to seeing your team’s application!
]]>In this final part, we discuss challenges with transformer training dynamics and introduce some of the tricks that practitioners use to get transformers to converge. This discussion will be suitable for researchers who already understand the transformer architecture, and who are interested in training transformers and similar models from scratch.
Despite their broad applications, transformers are surprisingly difficult to train from scratch. One of the contributions of the original transformer paper was to use four tricks that collectively allow stable training:
1. Residual connections: Each transformer layer takes the $I\times D$ data matrix $\mathbf{X}$ where $I$ is the number of inputs and $D$ the dimensionality of those inputs and returns an object of the same size. It performs the following operations:
\begin{eqnarray}
\mathbf{X} &\leftarrow& \mathbf{X} + \bf{MhSa}[\mathbf{X}] \nonumber \\
\mathbf{X} &\leftarrow& \bf{Layernorm}[\mathbf{X}] \hspace{3cm}\nonumber\\
\mathbf{x}_{i} &\leftarrow& \mathbf{x}_{i}+\bf{mlp}[\mathbf{x}_{i}] \hspace{3.6cm}\forall\; i\in\{1\ldots I\}\nonumber\\
\mathbf{X} &\leftarrow& \bf{Layernorm}[\mathbf{X}], \tag{1}
\end{eqnarray}
which include two residual connections around the multihead selfattention $\bf{MhSa}[\bullet]$ and multilayer perceptron $\bf{mlp}[\bullet]$ components (figure 1).
2. Layer normalization: After each residual connection, a layer normalization procedure is applied:
\begin{equation}
\bf Layernorm[\mathbf{X}] = \gamma\cdot \frac{\mathbf{X}\mu}{\sigma}+\beta, \tag{2}
\end{equation}
where $\mu$ and $\sigma$ are the mean and standard deviation of the elements of $\mathbf{X}$ (but are separate for each member of the batch), and $\gamma$ and $\beta$ are learned parameters.
3. Learning rate warmup: The learning rate is increased linearly from $0$ to $R$ over first $T_{R}$ time steps so that:
\begin{equation}
\mbox{lr}[t] = R \cdot \frac{t}{T_{R}}. \tag{3}
\end{equation}
4. Adaptive optimizers: Transformers need to be trained with adaptive optimizers like Adam, which recursively estimates the momentum and the learning rate separately for each parameter at each timestep. In practice, relatively large batch sizes of $>1,000$ are usually employed.
Removing any of these tricks makes training unstable and often leads to complete training failures. However, they have been employed without a full understanding of why they are required.
As transformers are applied more widely, it is increasingly important that we have a better understanding of transformer training. To this end, a number of recent papers have been devoted to demystifying this topic and exploring better training methods. In the rest of this blog post, we will connect these separate efforts to form a comprehensive overview of this topic.
In this section we will review some tricks and see that there are complex dependencies between them: Some tricks cause problems, which are in turn resolved by others. We will see that there are complex dependencies between them, so that some of the tricks cause problems, which are in turn resolved by others. In the subsequent section we will discuss improvements to the training process that follow from this understanding.
Learning rate warmup (in which the learning rate is gradually increased during the early stages of training) is particularly puzzling. This is not required for most deep learning architectures. However, training fails for transformers if we just start with a typical learning rate. If we start with a very small learning rate, then the training is stable, but then it takes an impractically long time.
Xiong et al., 2020 explored this phenomenon by conducting experiments on a machine translation task with different optimizers and learning rate schedules. Their results (figure 2) show that learning rate warmup is essential for both Adam and SGD, and that the training process is sensitive to the warmup steps.
Although learning rate warmup works, it has some obvious disadvantages. It introduces an extra hyperparameter (the number of warmup steps) and it initializes the learning rate to zero which slows the training down. Hence, it's important that we understand why it is necessary.
To help answer this question, Huang et al., 2020 visualized the gradient of the loss $\mathcal{L}$ with respect to the input embeddings $\mathbf{X}$, and the size of the Adam updates during the first 100 steps of training (figure 3). They found that without warmup, the gradients vanish very quickly, and the Adam updates also rapidly become much smaller. Diminishing gradients at lower layers in the transformer model without warmup have also been observed by Liu et al., 2020.
To understand why learning rate warmup is required, and why the gradients vanish without it, we will first need to understand the reasons for, and the consequences of using residual connections, layer normalization, and Adam.
Residual networks were developed in computer vision; they make networks easier to optimize and allow deeper networks to be trained. In computer vision, the additive residual connections are usually placed around convolutional layers and combined with batch normalization. In the transformer, they are placed around the selfattention and feedforward networks and combined with layer normalization (figure 1). From this perspective, the transformer architecture could be considered a "deep residual selfattention network".
Zhang et al., 2019 show that the output variance of residual networks grows exponentially with depth. Hence, normalization is used to prevent gradient explosion for deep residual networks. Layer normalization is used in the transformer because the statistics of language data exhibit large fluctuations across the batch dimension, and this leads to instability in batch normalization.
Transformers also differ from convolutional networks in that stochastic gradient descent does not work well for training (figure 2) and adaptive optimizers like Adam are required. Liu et al., 2020 observed that differentiating through the selfattention mechanism creates unbalanced gradients. In particular, the gradients for the query $\boldsymbol\Phi_{q}$ and key $\boldsymbol\Phi_{k}$ parameters were much smaller than those for the value parameters $\boldsymbol\Phi_{v}$, and so the former parameters change much more slowly. This is a direct consequence of the mathematical expression for selfattention. The Adam optimizer fixes this problem by essentially having different learning rates for each parameter.
To conclude, we've seen that residual connections are needed to allow us to train deep networks. These cause gradient explosion, which is resolved by using layer normalization. The selfattention computation causes unbalanced gradients, which necessitates the use of Adam (figure 4). In the next section, we'll see that layer normalization and Adam themselves cause more problems, which ultimately result in the need for learning rate warmup.
Xiong et al., 2020 found that the magnitude of the gradients through layer normalization is inversely proportional to magnitude of the input. Specifically, the gradient has the following property:
\begin{equation}
\left\lVert \frac{\partial \bf Layernorm[\mathbf{X}]}{\partial \mathbf{X}} \right\rVert=\mathcal{O}\left(\frac{\sqrt{D}}{\lVert\mathbf{X}\rVert}\right), \tag{4}
\end{equation}
where $\mathbf{X}$ is the input to layer normalization and $D$ is the embedding dimension. If the input norm $\lVert \mathbf{X} \rVert$ is larger than $\sqrt{D}$ then backpropagating through layer normalization reduces the gradient magnitude in lower layers. As this effect compounds through multiple layers, it causes the gradient to vanish at lower layers for deep models. We will term this the gradient shrinkage effect.
Layer normalization also causes unbalanced dependencies between the two branches of the residual connection around the selfattention module. In other words, the output of $\bf LayerNorm[\mathbf{X}+\bf Sa[\mathbf{X}]]$ depends much more on the selfattention computation $\bf Sa[\mathbf{X}]$ than the skip connection $\mathbf{X}$. This means that the outputs depend much more on later layers than earlier layers. Liu et al., 2019 show that this happens empirically in practice.
Moreover, they show that this leads to amplified output perturbations; small changes to the network parameters cause large output fluctuations. More precisely, they proved that for a transformer network $\bf T_{N}[\mathbf{X},\boldsymbol\theta]$ with parameters $\boldsymbol\theta$, the output variance scales with the number of layers $N$ when we randomly perturb the parameters to $\boldsymbol\theta^{*} = \boldsymbol\theta+\boldsymbol\delta$:
\begin{equation}
\mbox{Var}\left[\bf T_{N}[\mathbf{X};\boldsymbol\theta]  \bf T_{N}[\mathbf{X};\boldsymbol\theta^{*}]\right] = \mathcal{O}(N), \tag{5}
\end{equation}
They also show that this happens empirically with both random parameter changes and Adam updates (figure 5). The result is that the output changes more and more when we update the parameters, which destabilizes transformer training.
Furthermore, using adaptive optimizers like Adam aggravates both the gradient shrinkage effect and the amplified output perturbations. Liu et al., 2019 show that the variance of the Adam updates is unbounded at the start of training, and these updates are also known to exhibit high variance in the early stages of training.
This can lead to problematic large updates early on which can make the input norm $\Vert \mathbf{X} \rVert$ to each layer increase as we move through the network and thus increased gradient shrinkage as predicted by equation 2.3. Moreover, the output fluctuation which is already amplified by the network structure will be even greater for these large parameter updates.
To summarize, residual connections are required in the transformer architecture for the ease of optimization, which further requires layer normalization to avoid gradient explosion and adaptive optimizers like Adam to address unbalanced gradients in the selfattention blocks. On the flip side, the use of layer normalization causes the gradients to shrink in the early layers and also amplifies the output perturbations. Moreover, the instability of Adam in the early stages of training exacerbates both of these effects (figure 6).
This is where learning rate warmup comes in: it effectively stabilizes the Adam updates during the early stages of training by making the parameter changes much smaller. Consequently, Adam no longer aggravates gradient shrinkage and amplification of output perturbations and training becomes relatively stable.
In the previous section, we argued that the transformer architecture, and the statistics of language data require us to use layer normalization and train with adaptive optimizers like Adam. These choices in turn cause other problems that are resolved by using learning rate warmup. In this section, we consider alternative methods for training deep transformers that don't require learning rate warmup.
We'll consider three approaches that respectively remove the normalization from the network, attempt to rebalance the dependency on the two paths of the residual networks, and reduce the variance of the optimizer updates.
Since both the problems of gradient shrinkage and unbalanced dependencies are directly connected to layer normalization, it is natural to question whether we can train deep transformer models without this step. It has indeed been demonstrated that this is possible, and that we can achieve even better generalization without layer normalization.
Recall that the normalization mechanism is introduced to prevent gradients exploding in deep residual networks. It follows that if we can stabilize the gradient updates $\Delta \boldsymbol\theta$, then we can remove layer normalization. Zhang et al., 2019 demonstrated that the gradient updates $\Delta \pmb \theta$ can be bounded when using the SGD optimizer to train residual MLP or convolution blocks by appropriately initializing the weights. Based on this work, Huang et al., 2020 derived an analogous initialization scheme for residual selfattention blocks.
Although the theoretical derivations are for SGD updates, these results hold well for adaptive optimizers like Adam in practice. Furthermore, it follows from the Taylor expansion:
\begin{equation}\label{eq:taylor}
\Delta\bf T[\mathbf{X},\boldsymbol\theta] \approx \frac{\partial \bf T[\mathbf{X},\boldsymbol\theta]}{\partial \pmb\theta} \Delta \pmb\theta, \tag{6}
\end{equation}
that the output fluctuation $\Delta f$ is also bounded by bounding the gradient updates $\Delta\boldsymbol\theta$. As a result, both the gradient vanishing and the amplified output perturbations are resolved with stable gradient updates.
The proposed initialization scheme is known as TFixup and is easy to implement. Consider a multihead selfattention block where the $h^{th}$ head computes
\begin{equation}
{\bf Sa}_{h}[\mathbf{X}] =\bf Softmax\left[\frac{(\mathbf{X}\boldsymbol\Phi_{qh})(\mathbf{X}\boldsymbol\Phi_{kh})^{T}}{\sqrt{d_{q}}}\right]\mathbf{X}\boldsymbol\Phi_{vh}. \tag{7}
\end{equation}
where $\mathbf{X}$ is the input data matrix containing word embeddings in its rows and $\boldsymbol\Phi_{qh}$, $\boldsymbol\Phi_{kh}$ and $\boldsymbol\Phi_{vh}$ are the weight parameters for the queries, keys, and values respectively. The outputs of these selfattention mechanisms are concatenated and another linear transform $\boldsymbol\Phi_{c}$ is applied to combine them:
\begin{equation}
{\bf MhSa}[\mathbf{X}] = \left[{\bf Sa}_{1}[\mathbf{X}]\;{\bf Sa}_{2}[\mathbf{X}]\;\ldots\;{\bf Sa}_{H}[\mathbf{X}] \right]\boldsymbol\Phi_{c}. \tag{8}
\end{equation}
The TFixup scheme for encoder decoder attention is then as follows:
In practice, TFixup is able to train significantly deeper transformer models with improved performance on the task of machine translation. For the detailed derivation of this method, we refer the readers to the original paper.
An alternative approach is to balance the residual dependencies, which in turn will limit the output perturbations $\Delta \bf T[\mathbf{X}]$. Equation 6 shows that controlling the magnitude of the output fluctuation $\Delta \bf T[\mathbf{X}]$ also bounds the magnitude of the gradient updates $\Delta \boldsymbol\theta$, which in turn mitigates the problem of gradient vanishing. Here we'll consider three possible approaches.
PreLN transformers: One simple solution is to change the location of layer normalization inside the transformer layer so that it occurs inside the residual blocks and before the selfattention or MLP (figure 7). This is known as the preLN transformer. This simple change can help control the gradient magnitude and balance the residual dependencies.
PreLN transformer models can be trained without learning rate warmup. However, they also lead to inferior empirical performance. It has been speculated that this is because now the models are restricted not to depend too much on the contents of their residual layers Liu et al., 2020.
Admin: To bridge this performance gap, adaptive model initialization or Admin aims to bound the output fluctuation $\Delta \bf T[\mathbf{X}]$ by controlling the residual dependencies while retaining the original architecture.
Admin adds a new parameter $1\times D$ parameter vector $\boldsymbol\psi$ to each residual block. The selfattention block is then constructed as $\bf LayerNorm[\mathbf{X} \odot \boldsymbol\Psi + \bf MhSa[\mathbf{X}]]$ where $\odot$ is the elementwise product and $\boldsymbol\Psi$ is an $I\times D$ matrix where each row is a copy of $\boldsymbol\psi$. The residual connection around the parallel MLP layer is treated in the same way (figure 8a).
The new parameters at the $n^{th}$ layer are initialized to be the output standard deviation at that layer before this intervention. This can be estimated by setting all elements of $\boldsymbol\psi$ to one and forward propagating on a few training instances.
ReZero: In a similar vein, ReZero removes the layer normalization and introduces a single trainable parameter $\alpha$ per residual layer so that the selfattention block residual layer becomes, $\mathbf{X} + \alpha\bf MhSa[\mathbf{X}]$, where $\alpha$ is initialized to zero (figure 8b). The result of this is that the entire network is initialized just to compute the identity function, and the contributions of the selfattention and MLP layers are gradually and adaptively introduced.
Empirically, both Admin and ReZero work well for training deeper transformer models with better generalization performance, which demonstrates the effectiveness of balancing the residual dependencies.
We noted before that the high variance of learning rates in the Adam optimizer at the early stages of training exacerbates the problems of amplified output perturbations and gradient vanishing. Liu et al., (2019) argue that this is due to the lack of samples in the early stages of learning. They base their argument on an experiment in which they do not change the model parameters or momentum term of Adam for the first 2000 learning steps, but only adapt the learning rate. After this, warmup is no longer required.
Based on these observations, they propose Rectified Adam or RAdam which gradually changes the momentum term over time in a way that helps avoid high variance. One way to think of this is that we have effectively incorporated learning rate warmup into the Adam algorithm, but in a principled way.
In the previous sections, we have seen that great progress has been made towards understanding transformer training. Several solutions have been proposed that allow the training of significantly deeper transformer models with improved empirical performance.
However, they have only been applied to tasks with sufficient training data such as machine translation and language modelling. This is possibly due to the commonlyheld belief that training deep transformers from scratch requires large datasets. For small datasets, it is typical just to add shallow and simple additional layers (e.g., a classifier head) to pretrained models and then finetune.
So, what prevents practitioners from training deep transformers on small datasets? It turns out that the final missing piece of the puzzle is the batch size. For small datasets, it's necessary to leverage large pretrained models and then finetune. However, the size of these models limits the batch size and when the batch size is small, the variance of the updates is even larger, which makes training even harder. Even if we could use a larger batch size, it usually results in poorer generalization, especially on small datasets.
In short, small datasets require pretrained models and small batch sizes to perform well, but these two requirements make training additional transformer layers challenging. To resolve the high variance of training updates in small batch sizes, the three ideas from the previous section can all be applied. However, these approaches all assume that the inputs to the transformers are randomly initialized embeddings, but this is not true if we are adding yettobetrained transformers on top of pretrained models (figure 9).
DTFixup is a datadependent initialization strategy developed by Borealis AI. It adapts the TFixup method for this type of mixed setting. DTFixup allows significantly deeper transformers to be trained with small datasets for challenging tasks such as TexttoSQL semantic parsing and logical reading comprehension. This demonstrates that training deep transformers with small datasets is feasible with the correct optimization procedure.
In the first two parts of this blog, we introduced the transformer, and discussed extensions and relations to other models. In this final part, we have discussed the complex topic of how to train deep transformer models effectively. We hope that this discussion will help practitioners to train deeper transformers for different applications and help identify potential directions for further improving transformer training.
]]>In this final post, we consider probabilistic contextfree grammars or PCFGs, which are are a special case of WCFGs. They are featured more than WCFGs in the earlier statistical NLP literature and in most teaching materials. As the name suggests, they replace the rule weights with probabilities. We will treat these probabilities as model parameters and describe algorithms to learn them for both the supervised and the unsupervised cases. The latter is tackled by expectationmaximization and leads us to develop the insideoutside algorithm which computes the expected rule counts that are required for the EM updates.
PCFGs are the same as WCFGs except that the weights are constrained; the weights of all rules with the same nonterminal on the lefthand side must be nonnegative and sum to one:
\begin{equation}
\sum_{\alpha} \mbox{g}[\text{A} \rightarrow \alpha] = 1. \tag{1}
\end{equation}
For example, we might have three rules with VP on the lefthand side: $\text{VP} \rightarrow \text{NP}$, $\text{VP} \rightarrow \text{NN}$ and $\text{VP} \rightarrow \text{PN}$. For a PCFG, this implies that:
\begin{equation}
\mbox{g}[\text{VP} \rightarrow \text{NP}]+\mbox{g}[\text{VP} \rightarrow \text{NN}]+\mbox{g}[\text{VP} \rightarrow \text{PN}]= 1. \tag{2}
\end{equation}
The rule weights are now probabilities and the weight $\mbox{G}[T]$ of an entire tree is the product of these probabilities. The tree weight $\mbox{G}[T]$ can hence be interpreted as the probability $Pr(T)$ of the tree:
\begin{eqnarray}\label{eq:tree_like}
Pr(T) &=& \mbox{G}[T]\nonumber \\
&=& \prod_{t\in T} \mbox{g}[T_{t}]\nonumber \\
&=& \prod_{A}\prod_{\alpha} \mbox{g}[\text{A} \rightarrow \alpha]^{\mbox{f}_{\text{A}\rightarrow\alpha}[T]} \tag{3}
\end{eqnarray}
where the function $f_{\text{A} \rightarrow \alpha}[T]$ counts the number of times $\text{A} \rightarrow \alpha$ appears in tree $T$.
PCFGs define a joint distribution $Pr(T,\mathbf{w})$ of trees $T$ and sentences $\mathbf{w}$. The probability of a sentence $Pr(\mathbf{w})$ can be computed through marginalization:
\begin{equation}\label{eq:parse_marginal}
Pr(\mathbf{w}) = \sum_{T\in \mathcal{T}[\mathbf{w}]} Pr(\mathbf{w}, T), \tag{4}
\end{equation}
where $\mathcal{T}[\mathbf{w}]$ is the set of all parse trees that are compatible with the observed sentence $\mathbf{w}$.
The conditional probability of the sentence $Pr(\mathbf{w}T)$ given the tree is just $1$, because the tree yields $\mathbf{w}$ (i.e., the tree deterministically produces the words). It follows that the joint distribution is:
\begin{equation}\label{eq:parsing_joint}
Pr(T,\mathbf{w}) = Pr(\mathbf{w}T) Pr(T) = Pr(T). \tag{5}
\end{equation}
However, the conditional probability $Pr(T\mathbf{w})\neq 1$ in general. When a sentence is ambiguous, there are multiple trees that can produce the same words. For PCFGs, the weighted parsing algorithm returns the tree with the maximum conditional probability, and the inside algorithm returns the marginal probability of the sentence (equation 4).
PCFGs are a generative approach to syntactic analysis in that they represent joint distributions over sentences and parse trees. They can also be used to sample random sentences: we start by drawing a sample from all of the rules $\text{S} \rightarrow \alpha$ with the start token $S$ on the left hand side according to the probabilities $g[\text{S}\rightarrow \alpha]$. For example, we might draw $\text{S}\rightarrow\text{NP VP}$. Then we draw samples from rules with $\text{NP}$ and $\text{VP}$ on the left handside respectively. The process continues until it draws terminal symbols (i.e., words) in every branch of the tree and no nonterminals remain at the leaves.
We now turn our attention to learning the rule probabilities for a PCFG. In this section we'll consider the supervised case where we have a treebank (i.e, a set of sentences annotated with trees). We 'll show that we can estimate the probabilities using a simple counting procedure. In subsequent sections, we'll consider the unsupervised case, where we must estimate the weights based on the sentences alone.
To learn the parameters of a PCFG from a treebank, we optimize the total log likelihood $L$ of the $I$ observed trees:
\begin{equation}
L = \sum_{i=1}^{I}\sum_{A} \sum_{\alpha} f_{\text{A}\rightarrow\alpha}[T_{i}]\log[\mbox{g}[\text{A} \rightarrow \alpha]]. \tag{6}
\end{equation}
with respect to the rule probabilities $g[\text{A} \rightarrow \alpha]$. The first sum is over the training examples, the second over the left hand side of the rules and the third over the righthand side. The function $f_{\text{A} \rightarrow \alpha}[T_{i}]$ counts the number of times rule $\text{A} \rightarrow \alpha$ appears in tree $T_{i}$.
To ensure that the result of this optimization process is a valid PCFG, we must also add a set of constraints that ensure that all rules with the same lefthand side sum to one:
\begin{equation}
\sum_{\alpha} \mbox{g}[\text{A} \rightarrow \alpha] = 1\hspace{1cm}\forall \mbox{A}\in\mathcal{V}.\tag{7}
\end{equation}
Taking just the terms with A on the left hand side, and adding a Lagrange multiplier that enforces this constraint we have:
\begin{equation}
L_{A}^{\prime} = \sum_{i=1}^{I}\sum_{\alpha} f_{\text{A}\rightarrow\alpha}[T_{i}]\log[\mbox{g}[\text{A} \rightarrow \alpha]]+\lambda\left(\sum_{\text{A}}g[\text{A} \rightarrow \alpha]  1\right). \tag{8}
\end{equation}
We then take derivatives with respect to each rule $g[\text{A} \rightarrow \alpha]$ and the Lagrange multiplier $\lambda$ and set the resulting expressions to zero. This yields a set of equations which can be rearranged to provide the maximum likelihood estimator for a given rule $\mbox{g}[\text{A} \rightarrow \text{B} \text{C}]$:
\begin{equation}\label{eq:treebank_ml}
\mbox{g}[\text{A} \rightarrow \text{B} \text{C}] = \frac{\sum_i^I f_{\text{A} \rightarrow \text{B} \text{C}}[T_i]}{\sum_i^I \sum_{\alpha} f_{\text{A}, \rightarrow \alpha}[T_i]}, \tag{9}
\end{equation}
where the numerator counts the number of times A is rewritten to BC and the denominator counts the number of times it is rerewritten to anything. See Chi and Geman (1998) for further details.
We'll now provide some code snippets that use a treebank to estimate the parameters of a PCFG using the above method. In treebanks the constituency trees are usually represented in a bracket notation. For example, the legendary first sentence from the Penn Treebank^{1} is Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29. with associated tree:
( (S
(NPSBJ
(NP (NNP Pierre) (NNP Vinken) )
(, ,)
(ADJP
(NP (CD 61) (NNS years) )
(JJ old) )
(, ,) )
(VP (MD will)
(VP (VB join)
(NP (DT the) (NN board) )
(PPCLR (IN as)
(NP (DT a) (JJ nonexecutive) (NN director) ))
(NPTMP (NNP Nov.) (CD 29) )))
(. .) ))
Throughout this section, we'll provide some Python code snippets that use the NLTK
(natural language toolkit) package. We'll show how to estimate the rule weights from annotated data using NLTK, and then we'll take a look inside the code to see how it is implemented.
NLTK has the convenient Tree
class to make exploration easier:
from nltk.tree import Tree
t = Tree.fromstring("(S (NP I) (VP (V saw) (NP him)))")
To extract a grammar that is useful for parsing we need a to convert the CFG to Chomsky Normal Form:
t.chomsky_normal_form()
We then extract grammar rules from the entire treebank:
productions = []
# Go over all treestrings
for line in treebank:
tree = Tree.fromstring(line)
tree.collapse_unary(collapsePOS = False)
tree.chomsky_normal_form(horzMarkov = 2)
productions += tree.productions()
Finally, we learn the weights:
from nltk import Nonterminal
from nltk import induce_pcfg
S = Nonterminal('S')
grammar = induce_pcfg(S, productions)
Now let's peek into NLTK
version 3.6
to see how these estimates are computed:
# Production count: the number of times a given production occurs
pcount = {}
# LHScount: counts the number of times a given lhs occurs
lcount = {}
for prod in productions:
lcount[prod.lhs()] = lcount.get(prod.lhs(), 0) + 1
pcount[prod] = pcount.get(prod, 0) + 1
prods = [
ProbabilisticProduction(p.lhs(),
p.rhs(),
prob=pcount[p] / lcount[p.lhs()])
for p in pcount
]
As expected, this just directly implements equation 13. Given the parameters of the model we can parse a sentence under the induced grammar using the CYK weighted parsing algorithm:
from nltk.parse import ViterbiParser
sent = ['How', 'much', 'is', 'the', 'fish', '?']
parser = ViterbiParser(grammar)
parse = parser.parse(sent)
In the previous section, we showed that it is relatively easy to estimate the parameters (rule weights) of PCFGs when we are given a set of sentences with the associated trees. Now we turn our attention to the unsupervised case. Here, we are given the sentences, but not the trees. This is a chickenandegg problem. If we knew the ruleweights, then we could perform weighted parsing to find the best fitting trees. Likewise, if we knew the trees, we could estimate the rule weights using the supervised approach above.
In the next section, we'll introduce some notation and define the cost function for parameter estimation in the unsupervised case. Then we'll describe two algorithms to optimize this cost function. First we'll introduce Viterbi training that alternates between finding the best trees for fixed rule weights and updating the rule weights based on these trees. Then, we'll introduce an expectation maximization (EM) algorithm that follows a similar approach, but takes into account the ambiguity of the parsing procedure.
Our goal is to calculate the rule weights $\mbox{g}[\text{A}\rightarrow\text{BC}]$. To make the notation cleaner, we'll define the vector $\boldsymbol\theta$ to contain the log rule weights, so that the element $\theta_{A \rightarrow BC}$ contains $\log \left[\mbox{g}[A \rightarrow BC]\right]$. We aim to maximize the joint probability of the observed sentences and the associated trees with respect to these parameters:
\begin{equation}
\boldsymbol\theta^{*} = \underset{\boldsymbol\theta}{\text{arg}\,\text{max}} \left[\sum_i^I \log\left[Pr(T_{i}, \mathbf{w}_{i}\boldsymbol\theta)\right]\right] \tag{10}
\end{equation}
where $i$ indexes $I$ training examples, each consisting of sentence $\mathbf{w}_{i}$ and associated parse tree $T_{i}$.
Unfortunately, in the unsupervised learning case, we don't know the associated parse trees. There are two possible solutions to this problem. In Viterbi training, we will simultaneously maximize the likelihood with respect to the parameters $\boldsymbol\theta$ and with respect to the choice of tree $T_{i}$ from the set of valid trees $\mathcal{T}[\mathbf{w}_{i}]$ for each training example $i$:
\begin{equation}\label{eq:parse_viterbi}
\boldsymbol\theta^{*} = \underset{\boldsymbol\theta}{\text{arg}\,\text{max}}\left[ \sum_i^I
\log\left[\underset{T_{i}\in\mathcal{T}_{i}[\mathbf{w}_{i}]}{\text{arg}\,\text{max}}\left[Pr(T_{i}, \mathbf{w}_{i}\boldsymbol\theta)\right]\right]\right]. \tag{11}
\end{equation}
In the expectationmaximization approach we will marginalize over all the possible trees:
\begin{equation}
\boldsymbol\theta^{*} = \underset{\boldsymbol\theta}{\text{arg}\,\text{max}} \left[\sum_i^I \log\left[\sum_{ T_{i}\in\mathcal{T}_{i}[\mathbf{w}_{i}]}Pr(T_{i}, \mathbf{w}_{i}\boldsymbol\theta)\right]\right]. \tag{12}
\end{equation}
To summarize, Viterbi training finds the parameters that maximize the joint likelihood of the sentences and their highest probability parses, while the expectation maximization approach searches for the parameters that maximizes the marginal likelihood of the sentences.
Viterbi training optimizes the cost function in equation 11 by alternately maximizing over the parameters and fitted trees. More precisely, we first initialize the parameters $\boldsymbol\theta$ to random values. Then we alternate between two steps:
\begin{equation}
\theta_{\text{A} \rightarrow \text{B} \text{C}} = \log\left[ \frac{\sum_i^I f_{\text{A} \rightarrow \text{B} \text{C}}[T_i]}{\sum_i^I \sum_\alpha f_{\text{A} \rightarrow \alpha}[T_i]}\right], \tag{13}
\end{equation}
where the function $f_{\text{A} \rightarrow \text{B} \text{C}}[T_i]$ counts the number of times $\text{A} \rightarrow \text{B} \text{C}$ appears in tree $T_i$ and $\sum_\alpha f_{\text{A} \rightarrow \alpha}[T_i]$ counts the number of times that $\text{A}$ is rewritten to anything in tree $T_{i}$.
Now let's turn our attention to the expectation maximization (EM) approach which is somewhat more complicated. Recall that we want to maximize the cost function:
\begin{equation}
\boldsymbol\theta^{*} = \underset{\boldsymbol\theta}{\text{arg}\,\text{max}} \left[\sum_i^I \log\left[\sum_{ T_{i}\in\mathcal{T}_{i}[\mathbf{w}_{i}]}Pr(T_{i}, \mathbf{w}_{i}\boldsymbol\theta)\right]\right]. \tag{14}
\end{equation}
We'll present a brief overview of the EM method in an abstract form. Then we'll map this to our use case. We'll find expressions for the EStep and the MStep respectively. The expression for the MStep will contain terms that are expected rule counts. We'll show that we can find these by taking the derivative of the log partition function. This will finally lead us to the insideoutside algorithm which computes these counts.
The EM algorithm is a general tool for maximizing cost functions of the form:
\begin{equation}
\boldsymbol\theta^{*} = \underset{\boldsymbol\theta}{\text{arg}\,\text{max}} \left[\sum_i^I \log\left[\sum_{h_{i}}Pr(h_{i}, \mathbf{w}_{i}\boldsymbol\theta)\right]\right], \tag{15}
\end{equation}
where $h_{i}$ a hidden (i.e., unobserved) variable associated with observations $\mathbf{w}_{i}$.
The EM algorithm alternates between the expectation step (Estep) and the maximization step (Mstep). In the Estep, we consider the parameters $\boldsymbol\theta$ to be fixed and compute the posterior distribution $Pr(h_{i}\mathbf{w}_{i},\boldsymbol\theta)$ over the hidden variables $h_{i}$ for each of the $I$ examples using Bayes' rule:
\begin{equation}
Pr(h_{i}\mathbf{w}_{i},\boldsymbol\theta) = \frac{Pr(h_{i}, \mathbf{w}_{i}\boldsymbol\theta)}{\sum_{h_{i}}Pr(h_{i}, \mathbf{w}_{i}\boldsymbol\theta)}. \tag{16}
\end{equation}
In the Mstep we update the parameters using the rule:
\begin{eqnarray}
\boldsymbol\theta &=&\underset{\boldsymbol\theta}{\text{arg}\,\text{max}}\left[\sum_i^I \sum_{h_{i}}Pr(h_{i}\mathbf{w}_{i},\boldsymbol\theta^{old})\log\left[Pr(h_{i}, \mathbf{w}_{i}\boldsymbol\theta)\right]\right]\nonumber \\
&=&\underset{\boldsymbol\theta}{\text{arg}\,\text{max}}\left[\sum_i^I \mathbb{E}_{h_{i}}\left[\log\left[Pr(h_{i}, \mathbf{w}_{i}\boldsymbol\theta)\right]\right]\right] \tag{17}
\end{eqnarray}
where the expectation is calculated with respect to the distribution that we computed in the EStep and $\boldsymbol\theta^{old}$ denotes the previous parameters that were used in the EStep.
It is not obvious why the EM update rules maximize the cost function in equation 15, but this discussion is beyond the scope of this tutorial. For more information consult chapter 7 of this book.
Now let's map the EM algorithm back to our use case. Here, the choice of the unknown underlying parse tree $T_{i}$ is the hidden variable, and we also have for our case $Pr(\mathbf{w}_{i},T_{i}) = Pr(T_{i})$ (equation 5). This gives us the EStep:
\begin{eqnarray}
Pr(T_{i}\mathbf{w}_{i},\boldsymbol\theta) = \frac{Pr(T_{i}\boldsymbol\theta)}{\sum_{T_{i}\in\mathcal{T}[\mathbf{w}]}Pr(T_{i}\boldsymbol\theta)}, \tag{18}
\end{eqnarray}
and MStep:
\begin{eqnarray}\label{eq:tree_m_step}
\boldsymbol\theta^{t+1} &=&\underset{\boldsymbol\theta}{\text{arg}\,\text{max}}\left[\sum_i^I \mathbb{E}_{T_{i}}\left[\log\left[Pr(T_{i}\boldsymbol\theta)\right]\right]\right], \tag{19}
\end{eqnarray}
where this expectation is with respect to the posterior distribution $Pr(T_{i}\mathbf{w}_{i},\boldsymbol\theta^{old})$ that was calculated in the EStep. The maximization in the MStep is subject to the usual PCFG constraints that the weights of all the rules mapping from a given nonterminal $\text{A}$ must sum to one:
\begin{equation}\label{eq:tree_m_step_constraint}
\sum_{\alpha}\exp\left[\theta_{A\rightarrow\alpha}\right] = 1. \tag{20}
\end{equation}
We'll now consider each of those steps in more detail.
Recall, that in the EStep we wish to compute the posterior distribution $Pr(T\mathbf{w},\boldsymbol\theta)$ over the parse trees $T$ for each sentence $\mathbf{w}$ given the current parameters $\boldsymbol\theta$:
\begin{eqnarray}\label{eq:Estep_parsing}
Pr(T\mathbf{w},\boldsymbol\theta) = \frac{Pr(T\boldsymbol\theta)}{\sum_{T\in\mathcal{T}[\mathbf{w}]}Pr(T\boldsymbol\theta)} \tag{21}
\end{eqnarray}
For a PCFG, the conditional probability in the numerator is just the product of the weights in the tree and the denominator is the partition function $Z$. Let's derive an expression for the numerator:
\begin{align}
\tag{definition of cond prob}
Pr(T\boldsymbol\theta^{[t]}) &=\prod_{t \in T} \mbox{g}[T_t]) \\
\tag{logsumexp}
&= \exp \left[\sum_{t \in T} \log \left[\mbox{g}[T_t]\right] \right] \\
\tag{parameter notation}
&= \exp \left[\sum_{t \in T} \theta_{T_t} \right] \\
\tag{rule counts}
&= \exp \left[\sum_{r \in R} f_r[T] \cdot \theta_{r} \right]\\
\tag{vectorized}
&= \exp \left[\mathbf{f}^{T} \boldsymbol\theta \right],
\end{align}
where $\mathbf{f}^{T}$ is a $\mathcal{R}\times 1$ vector in which the $r^{th}$ entry corresponds to the number of times that rule $R$ occurs in the tree and $\boldsymbol\theta$ is a vector of the same size where each entry is the log probability (weight) of that rule.
Armed with this new formulation, we can now rewrite equation 21 as
\begin{eqnarray}\label{eq:E_Step_parsing_post}
Pr(T\mathbf{w}_{i},\boldsymbol\theta) &=& \frac{\exp \left[\mathbf{f}^{T} \boldsymbol\theta \right]}{\sum_{T\in \mathcal{T}[\mathbf{w}]}\exp \left[\mathbf{f}^{T} \boldsymbol\theta \right]}\nonumber \\
&=& \frac{\exp \left[\mathbf{f}^{T} \boldsymbol\theta \right]}{Z}, \tag{22}
\end{eqnarray}
where $Z$ is the partition function that we can calculate using the inside algorithm.
In the MStep, we compute:
\begin{eqnarray}
\boldsymbol\theta^{t+1} &=&\underset{\boldsymbol\theta}{\text{arg}\,\text{max}}\left[\sum_i^I \mathbb{E}_{T_{i}}\left[\log\left[Pr(T_{i}\boldsymbol\theta)\right]\right]\right]\nonumber \\
&=& \underset{\boldsymbol\theta}{\text{arg}\,\text{max}}\left[\sum_i^I \mathbb{E}_{T_{i}}\left[\sum_{\text{A}}\sum_{\alpha} \mbox{f}_{\text{A}\rightarrow\alpha}[T_{i}]\theta_{\text{A} \rightarrow \alpha}]\right]\right], \tag{23}
\end{eqnarray}
where as usual, the function $\mbox{f}_{\text{A}\rightarrow\alpha}[T_{i}]$ counts how many times the rules $\text{A}\rightarrow\alpha$ is used in tree $T_{i}$.
This is now very similar to the supervised case; we maximize equation 23 subject to the constraint in equation 20 using Lagrange multipliers to yield the update rule:
\begin{eqnarray}
\theta_{\text{A} \rightarrow \text{B} \text{C}} = \exp\left[\frac{\sum_i^N \mathbb{E}_{T_i} [\mbox{f}_{\text{A} \rightarrow \text{B} \text{C}}[T_i]]}{\sum_i^N \sum_\alpha \mathbb{E}_{T_i} [\mbox{f}_{\text{A} \rightarrow \alpha}[T_i]]} \right], \tag{24}
\end{eqnarray}
where the expectation is computed over the posterior distribution $Pr(T_{i}\mathbf{w}_{i})$ that we compute in the EStep. The final expression is the same as the supervised case (equation 13) except that we are now using the log weights $\theta_{\text{A} \rightarrow \text{B} \text{C}}$ and we are taking expectations over the counting functions $\mbox{f}_{\text{A} \rightarrow \alpha}[T_i]]$.
In this section, we'll derive an expression for the expected counts $\mathbb{E}_{T_i} [\mbox{f}_{\text{A} \rightarrow \text{B} \text{C}}[T_i]]$ in the MStep. First, we substitute in the expression for the posterior distribution (equation 22) from the EStep:
\begin{eqnarray}
\mathbb{E}_{T}\left[f_{r}\right]
&=& \sum_{T\in \mathcal{T}[\mathbf{w}]}Pr(T\mathbf{w},\boldsymbol\theta)\cdot f_{r}\nonumber \\
&=& \frac{1}{Z}\sum_{T\in \mathcal{T}[\mathbf{w}]}\exp \left[\mathbf{f}^{T} \cdot \boldsymbol\theta \right]\cdot f_{r} \tag{25}
\end{eqnarray}
Now we perform some algebraic manipulation of the right hand side:
\begin{eqnarray}
\mathbb{E}_{T}\left[f_{r}\right]
&=& \frac{1}{Z}\sum_{T\in \mathcal{T}[\mathbf{w}]}\exp \left[\mathbf{f}^{T} \cdot \boldsymbol\theta \right]\cdot f_{r} \nonumber \\
&=& \frac{1}{Z}\sum_{T\in \mathcal{T}[\mathbf{w}]}\exp \left[\mathbf{f}_{j}^{T}\boldsymbol\theta \right]\frac{\partial }{\partial \theta_{r}} \mathbf{f}^{T} \cdot \boldsymbol\theta \nonumber \\
&=&
\frac{1}{Z}\frac{\partial }{\partial \theta_{r}} \sum_{T\in \mathcal{T}[\mathbf{w}]}\exp \left[\mathbf{f}^{T} \boldsymbol\theta \right]\nonumber\\
&=& \frac{1}{Z}\frac{\partial Z}{\partial \theta_{r}} \nonumber\\
&=& \frac{\partial \log[Z]}{\partial \theta_{r}}. \tag{26}
\end{eqnarray}
We see that the expected count for the $r^{th}$ rule is just the derivative of the log partition function $Z$ with respect to the $r^{th}$ log weight $\theta_{r}$, which we calculated with the inside algorithm.
To summarize the EM algorithm, we alternate between computing the expected counts for each rule $\text{A} \rightarrow \text{B} \text{C}$ and sentence $\mathbf{w}_{i}$:
\begin{eqnarray}
\mathbb{E_{T_i}}\left[\mbox{f}_{\text{A} \rightarrow \text{B} \text{C}}\right] &=& \frac{\partial \log[Z_{i}]}{\partial \theta_{\text{A} \rightarrow \text{B} \text{C}}}, \tag{27}
\end{eqnarray}
and updating the parameters:
\begin{eqnarray}
\theta_{\text{A} \rightarrow \text{B} \text{C}} = \exp\left[\frac{\sum_i^N \mathbb{E}_{T_i} [\mbox{f}_{\text{A} \rightarrow \text{B} \text{C}}[T_i]]}{\sum_i^N \sum_\alpha \mathbb{E}_{T_i} [\mbox{f}_{\text{A} \rightarrow \alpha}[T_i]]} \right]. \tag{28}
\end{eqnarray}
To do this, we need to know how to compute the derivative of the log partition function. Since the logpartition function is calculated using the inside algorithm, a really simple way to do this is just to use automatic differentiation. We treat the inside algorithm as a differentiable function as we would a neural network and use backpropagation to estimate the derivatives using code similar to: Z = inside(w,
$\theta$); log(Z).backward()
. This is known as the insideoutside algorithm.
In the previous section, we argued that we could compute the expected counts by automatic differentiation of the inside algorithm. In this section, we'll take this one step further; we'll apply backpropagation by hand to the insidealgorithm, to yield the outside algorithm which directly computes these counts.
Let's think of the insidealgorithm as a differentiable program. It takes a sentence $\mathbf{w}$ and parameters $\log[g_{r}] = \theta_r$ as input, and returns the partition function $Z$ as output. The code is:
0 # Initialize data structure
1 chart[1...n, 1...n, 1...V] := 0
2
3 # Use unary rules to find possible parts of speech at preterminals
4 for k := 1 to n # start position
5 for each unary rule A > w_k
6 chart[k1, k, A] := g(A > w_k)
7
8 # Sum over binary rules
9 for l := 2 to n # substring length
10 for i := 1 to nl #start point
11 k = i + l #end point
11 for j := i + 1 to k1 # mid point
12 for each binary rule A > B C
13 chart[i, k, A] += g[A > B C] x
chart[i, j, B] x
chart[j, k, C]
14
15 return chart[0, n, S]
Notice that we have changed the indexing in the loops from our original presentation. This will make the notation less cumbersome. In this notation, the chart is indexed so that chart[i, k, A]
contains the inside weights $\alpha[\text{A}_i^k]$ that represent the total weight of all the feasible trees in the subsequence from position $i$ to position $k$.
Now consider what happens when we call log(Z).backward()
. In the original inside algorithm, we worked from the leaves of the parse tree up to the root where the partition function was computed. When we backpropagate, we reverse this order of operations and work from the root downwards. We'll now work backwards through the inside algorithm and discuss each part in turn.
Main loop: Based on this intuition we can already sketch the loop structure going backwards from largest constituent to smallest:
1 # Main parsing loop topdown
2 for l := n downto 2
3 for i := 1 to nl
3 k = i + 1
4 for j := i + 1 to k1
do stuff
To simplify matters, let's assume that we want to take the derivative of the partition function $Z$ itself rather than $\log[Z]$ for now. The update in the inside algorithm was given by:
\begin{equation}
\alpha[\text{A}^i_k] += \mbox{g}[\text{A} \rightarrow \text{B} \text{C}] \times \alpha[\text{B}_i^j] \times \alpha[\text{C}_j^k] \tag{29}
\end{equation}
Now we use reverse mode automatic differentiation to work backwards through the computation graph implied by this update. Using the notation $\nabla x = \frac{\partial Z}{\partial x}$, we start from the base case at the root of the tree $\alpha[\text{S}_0^n]$, which holds the partition function so $\nabla \alpha[\text{S}_0^n] = \frac{\partial Z}{\partial Z} =1$. Then we recursively compute:
\begin{align}
\nabla \mbox{g}[\text{A} \rightarrow \text{B} \text{C}] &+= \nabla \alpha[\text{A}_i^k] \left(\alpha[\text{B}_i^j] \times \alpha[\text{C}_j^k]\right) \label{eq:outside_rule_update}\tag{30}\\
\nabla \alpha[\text{B}_i^j] &+= \nabla \alpha[\text{A}_i^k]\left(\mbox{g}[\text{A} \rightarrow \text{B} \text{C}] \times \alpha[\text{C}_j^k] \right)\label{eq:outside update}\tag{31}\\
\nabla \alpha[\text{C}_j^k] &+= \nabla \alpha[\text{A}_i^k]\left(\mbox{g}[\text{A} \rightarrow \text{B} \text{C}] \times \alpha[\text{B}_i^j]\right)\label{eq:outside update2} \tag{32}
\end{align}
where in each case, the first term is the accumulated derivative that is being passed back from the later nodes of the computation graph, and the second term in brackets is the derivative of the the current node with respect to each quantity involved.
Preterminal loop: Working further back through the algorithm, we must also compute the derivatives of the preterminal loop in lines 46 of the algorithm. $\alpha[\text{A}_{k1}^k] += \mbox{g}[\text{A} \rightarrow w_k]$:
\begin{equation}\label{eq:final_beta}
\nabla \mbox{g}[\text{A} \rightarrow w_k] += \nabla \alpha[\text{A}_{k1}^k]. \tag{33}
\end{equation}
Outer weights: To ease further discussion let us rename the partial derivatives of the partition function with respect to the inside weights as outerweights so that $\nabla \alpha [A_{i}^{k}] = \beta [A_{i}^{k}]$. Similarly, we'll rename the partial derivatives of the partition function with respect to the rules as $\nabla g[\text{A}\rightarrow \text{B} \text{C}] = \beta [\text{A}\rightarrow \text{B} \text{C}]$.
Expected counts: So far, we have been computing the derivatives of $Z$ with respect to the rule weights $\mbox{g}[\text{A} \rightarrow \text{B} \text{C}]$. However, we need to compute the expected counts, which are the derivatives of $\log[Z]$ with respect to the parameters $\theta_{\text{A} \rightarrow \text{B} \text{C}}$. We can make this adjustment using the chain rule:
\begin{eqnarray}
\mathbb{E}_{T}\left[\mbox{f}_{\text{A} \rightarrow \text{B} \text{C}}\right] &=& \frac{\partial \log[Z]}{\partial \theta_{\text{A} \rightarrow \text{C} \text{B}}}\nonumber \\
&=& \frac{\partial \log[Z]}{\partial Z} \cdot \frac{\partial Z}{\partial \mbox{g}[\text{A} \rightarrow \text{B} \text{C}]} \cdot
\frac{\partial \mbox{g}[\text{A} \rightarrow \text{B} \text{C}] }{ \theta_{\text{A} \rightarrow \text{C} \text{B}}} \nonumber \\
&=& \frac{1}{Z} \cdot \beta[\text{A} \rightarrow \text{B} \text{C}] \cdot \mbox{g}[\text{A} \rightarrow \text{B} \text{C}].\label{eq:count_final} \tag{34}
\end{eqnarray}
where the second term is just the definition of $\beta[\text{A} \rightarrow \text{B} \text{C}]$ and the final term follows from the definition of the parameters $\theta_{\text{A} \rightarrow \text{B} \text{C}} = \log[\mbox{g}[\text{A} \rightarrow \text{B} \text{C}]]$.
Now we will put this all together and write out the insideoutside algorithm we obtained with this mechanistic procedure of differentiating the insidealgorithm. As opposed to a single chart, we will keep track of the intermediate values with three arrays:
in_weights
: Insideweights $\alpha[\text{A}_i^k]$ (i.e., the result of the inside algorithm).out_weights
: Outside weights of anchored nonterminals $\beta[\text{A}_i^k] = \nabla \alpha[\text{A}_i^k]$.out_rules
: Outside weights of rules $\beta[\text{A} \rightarrow \text{B} \text{C}] = \nabla \mbox{g}[\text{A} \rightarrow \text{B} \text{C}]$.The insideoutside algorithm is now:
1 # Run inside first
2 in_weights, Z := INSIDE(w, G)
3 out_weights[1 ... n, 1 ...n, 1...V] := 0
4 out_rules[V] := 0
5 # Partial derivative of Z
6 out_weights[0, n, S] += 1
7
8 # Topdown sweep
9 for l := n downto 2 # substring length
10 for i := 0 to nl # start point
11 k = i + l # endpoint
12 for j := i+1 to k1 # midpoint
13 for each binary rule A > B C
14 out_rules[A > BC] += out_weights[i, k, A] x
in_weights[i, j, B] x
in_weights[j, k, C]
15 out_weights[i, j, B] += out_weights[i, k, A] x
g[A > BC] x
in_weights[j, k, C]
16 out_weights[j, k, C] += out_weights[i, k, A] x
g[A > BC] x
in_weights[i, j, B]
17 # Width 1 constituents
18 for k=1 to n:
19 for each unary rule A > w_k
20 out_rules[A > w_k] += out_weights[k1, k, A]
21 # Expected counts
22 for R in rules:
23 counts[R] = 1/Z x out_rules[R] x g[R]
24 return counts
It's easy to see from the nested for loop
structure that the insideoutside algorithm together has the same complexity as the insidealgorithm alone.
Up to this point we have discussed quite mechanistically how to compute the outsideweights. Now let us finish by discussing exactly what they mean.
As we saw in Part II, the inside weights correspond to summing over the weights of all possible insidetrees (figure 1a):
\begin{equation}\label{eq:inside_update_2}
\alpha[A_i^k] = \sum_{B, C}\sum_{j=i+1}^{k} \mbox{g}[A \rightarrow B C] \times \alpha[B_i^j] \times \alpha[C_j^k]. \tag{35}
\end{equation}
The insideweight $\alpha[A_i^k]$ corresponds to the sum of all the left and right child inside weights, considering all possible split points $j$ and all possible nonterminals B and C (figure 1).
From equations 31 and 32, we see that the equivalent recursive definition for the outside weights is:
\begin{eqnarray}
\beta[\text{B}_i^j] = & \sum_{\text{A}, \text{C}} \sum_{k=j+1}^{n} \mbox{g}[\text{A} \rightarrow \text{B} \text{C}] \times \alpha[\text{C}_{j+1}^{k}] \times \beta[\text{A}_i^k]\nonumber \\
&+\sum_{\text{A}, \text{C}} \sum_{k=0}^{i1} \mbox{g}[\text{A} \rightarrow \text{C} \text{B}] \times \alpha[\text{C}_{k}^{i1}] \times \beta[\text{A}_{k1}^j] \tag{36}
\end{eqnarray}
The outsidetrees of $\text{A}_i^j$ are are the set of trees that are rooted in $S$ and yield $w_1, \ldots, w_{i1}, \text{A}, w_{j+1} \ldots, w_n$, where $n$ is the sentence length. These are the trees that originate from $S$ and occur around $\text{A}_i^j$ (figure 2). The recursive update computes the outer weight in terms of the parent outer weight and the sibling inner weight. It sums over all possible parents, siblings, the order of the siblings (there is one term for each in the equation) and the split point $k$.
We have seen that the inside weights sum over all insidetrees and the outside weights sum over all outside trees. It follows that total weight of all the parses that contain the anchored nonterminal $\text{A}_i^k$ is simply $\alpha[\text{A}_i^k] \cdot \beta[\text{A}_i^k]$. If we normalize by the partion function, $Z$ we retrieve the marginal probability of the anchored nonterminal $\text{A}_i^k$ which is given by $(\alpha[\text{A}_i^k] \cdot \beta[\text{A}_i^k)]/Z$.
By combining equations 30 and 34, we see how the inside and outsideweights fit together to compute the expected counts:
\begin{equation}
\mathbb{E}_{T}\left[\text{A} \rightarrow \text{B} \text{C}\right]= \frac{1}{Z} \times \sum_{1 \leq i \leq j \leq k \leq n} \mbox{g}[\text{A} \rightarrow \text{B} \text{C}] \times \beta[\text{A}_i^k] \times \alpha[\text{B}_i^j] \times \alpha[\text{C}_j^k] \tag{37}
\end{equation}
This equation enumerates all possible spans, where A could be rewritten to B and C (inside), but also considers every configuration that could happen around A.
In this blog, we introduced probabilistic contextfree grammars and we discussed both supervised and unsupervised methods for learning their parameters. In the former case, we learn the rule weights from pairs of sentences and parse trees and there is a simple intuitive closedform solution.
Then we considered the unsupervised case, where we only have access to the sentences. The first strategy we described is Viterbi training, which is fairly simple: we already know how to do weighted parsing to find the best trees for known parameters, and we also know how to maximize the likelihood of the parameters if we are given the trees. The Viterbi EM initializes the parameters randomly and then parses the entire corpus. Taking these parses as pseudolabels, it reestimates the parameters to maximize their likelihood. Then we iterate until convergence.
Viterbi training is sometimes called Viterbi EM, hard EM or selfEM. Variants include unsupervised parsing of different formalisms and methods to improve over vanilla supervised training. The same ideas have been adapted for image classification and segmentation.
We also discussed the softEM approach to unsupervised learning of rule weights in PCFGs. Rather than inferring the trees with the highest weight we computed the expected number of times each rule would appear in the parses. Then these expected counts were used to find the maximum likelihood estimates for the parameters. To compute these expected counts, we used the insideoutside algorithm, which can be summarized by the code snippet Z = inside(w,
$\theta$); log(Z).backward()
. Finally, we made a big leap and discussed how to mechanistically derive the outsidealgorithm from the insidepass to obtain the expected rule counts.
1 The Penn Tree Bank is not publicly available, but a free alternative is the QuestionBank which contains English questions annotated with trees.
]]>In this blog, we will introduce weighted contextfree grammars or WCFGs. These assign a nonnegative weight to each rule in the grammar. From here, we can assign a weight to any parse tree by multiplying the weights of its component rules together. We present two variations of the CYK algorithm that apply to WCFGs. (i) The inside algorithm computes the sum of the weights of all possible analyses (parse trees) for a sentence. (ii) The weighted parsing algorithm find the parse tree with the highest weight.
In Part III of this tutorial, we introduce probabilistic contextfree grammars. These are a special case of WCFGs where the weights of all rules with the same lefthand side sum to one. We then discuss how to learn these weights from a corpus of text. We will see that the inside algorithm is a critical part of this process.
Before we start our discussion, let's briefly review what we learned about contextfree grammars and the CYK recognition algorithm in part I of this tutorial. Recall that we defined a contextfree grammar as the tuple $\langle S, \mathcal{V}, \Sigma, \mathcal{R}\rangle$ with a start symbol $S$, nonterminals $\mathcal{V}$, terminals $\Sigma$ and finally the rules $\mathcal{R}$.
In our examples, the nonterminals are a set $\mathcal{V}=\{\mbox{VP, PP, NP, DT, NN, }\ldots\}$ containing subclauses (e.g., verbphrase $\mbox{VP}$ ) and parts of speech (e.g., noun $\mbox{NN}$). The terminals contain the words. We will consider grammars in Chomsky Normal Form, where the rules either map one nonterminal to two other non terminals (e.g., $\text{VP} \rightarrow \text{V} \; \text{NP})$ or a single terminal symbol (e.g., $\text{V}$> eats).
The CYK recognition algorithm takes a sentence and a grammar in Chomsky Normal Form and determines if the sentence is valid under the grammar. With minor changes, it can also return the set of valid parse trees. It constructs a chart where each position in the chart corresponds to a subsequence of words (figure 2). At each position, there is a binary array with one entry per rule, where this entry is set to true if this rule can be applied validly to the associated subsequence.
The CYK algorithm works by first finding valid unary rules that map preterminals representing parts of speech to terminals representing words (e.g., DT$\rightarrow$ the). Then it considers subsequences of increasing length and identifies applicable binary nonterminal rules (e.g., $\mbox{NP}\rightarrow \mbox{DT NN})$. The rule is applicable if there are two subtrees lower down in the chart whose roots match its right hand side. If the algorithm can place the start symbol in the topleft of the chart, then the overall sentence is valid. The pseudocode is given by:
0 # Initialize data structure
1 chart[1...n, 1...n, 1...V] := FALSE
2
3 # Use unary rules to find possible parts of speech at preterminals
4 for p := 1 to n # start position
5 for each unary rule A > w_p
6 chart[1, p, A] := TRUE
7
8 # Main parsing loop
9 for l := 2 to n # subsequence length
10 for p := 1 to nl+1 # start position
11 for s := 1 to l1 # split width
12 for each binary rule A > B C
13 chart[l, p, A] = chart[l, p, A] OR
(chart[s, p, B] AND chart[ls,p+s, C])
14
15 return chart[n, 1, S]
For a much more detailed discussion of this algorithm, consult Part I of this blog.
Weighted contextfree grammars (WCFGs) are contextfree grammars which have a nonnegative weight associated with each rule. More precisely, we add the function $g: \mathcal{R} \mapsto \mathbb{R}_{\geq 0}$ that maps each rule to a nonnegative number. The weight of a full derivation tree $T$ is then the product of the weights of each rule $T_t$:
\begin{equation}\label{eq:weighted_tree_from_rules}
\mbox{G}[T] = \prod_{t \in T} g[T_t]. \tag{1}
\end{equation}
Contextfree grammars generate strings, whereas weighted context free grammars generate strings with an associated weight.
We will interpret the weight $g[T_t]$ as the degree to which we favor a rule, and so, we "prefer" parse trees $T$ with higher overall weights $\mbox{G}[T]$. Ultimately, we will learn these weights in such a way that real observed sentences have high weights and ungrammatical sentences have lower weights. From this viewpoint, the weights can be viewed as parameters of the model.
Since the tree weights $G[T]$ are nonnegative, they can be interpreted as unnormalized probabilities. To create a valid probability distribution over possible parse trees, we must normalize by the total weight $Z$ of all tree derivations:
\begin{eqnarray}
Z &=& \sum_{T \in \mathcal{T}[\mathbf{w}]} \mbox{G}[T] \nonumber \\
&=& \sum_{T \in \mathcal{T}[\mathbf{w}]} \prod_{t \in T} \mbox{g}[T_t], \tag{2}
\end{eqnarray}
where $\mathcal{T}[\mathbf{w}]$ represents the set of all possible parse trees from which the observed words $\mathbf{w}=[x_{1},x_{2},\ldots x_{L}]$ can be derived. We'll refer to the normalizing constant $Z$ as the partition function. The conditional distribution of a possible derivation $T$ given the observed words $\mathbf{w}$ is then:
\begin{equation}
Pr(T\mathbf{w}) = \frac{\mbox{G}[T]}{Z}. \tag{3}
\end{equation}
We defined the partition function $Z$ as the sum of the weights of all the trees $\mathcal{T}[\mathbf{w}]$ from which the observed words $\mathbf{w}$ can be derived. However, in Part I of this tutorial we saw that the number of possible binary parse trees increases very rapidly with the sentence length.
The CYK recognition algorithm used dynamic programming to search this huge space of possible trees in polynomial time and determine whether there is at least one valid tree. To compute the partition function, we will use a similar trick to search through all possible trees and sum their weights simultaneously. This is known as the inside algorithm.
Before we present the inside algorithm, we need to introduce the semiring. This abstract algebraic structure will help us adapt the CYK algorithm to compute different quantities. A semiring is a set $\mathbb{A}$ on which we have defined two binary operators:
1. $\oplus$ is a commutative operation with identity element 0, which behaves like the addition $+$:
2. $\otimes$ is an associative operation that (right) distributes over $\oplus$ just like multiplication $\times$. It has the identity element 1 and absorbing element 0:
Similarly to grammars we will just denote semirings as tuples: $\langle\mathbb{A}, \oplus, \otimes, 0, 1\rangle$. You can think of the semiring as generalizing the notions of addition and multiplication.^{1}
Computing the partition function $Z$ for the conditional distribution $Pr(T\mathbf{w})$ might appear difficult, because it sums over the large space of possible derivations for the sentence $\mathbf{w}$. However, we've already seen how the CYK recognition algorithm accepts or rejects a sentence in polynomial time, while sweeping though all possible derivations. The inside algorithm uses a variation of the same trick to compute the partition function.
When used for recognition, the $\texttt{chart}$ holds values of $\texttt{TRUE}$ and $\texttt{FALSE}$ and the computation was based on two logical operators OR and AND, and we can think of these as being part of the semiring $\langle\{\texttt{TRUE}, \texttt{FALSE}\}, OR, AND, \texttt{FALSE}, \texttt{TRUE}\rangle$.
The inside algorithm replaces this semiring with the sumproduct semiring $\langle\mathbb{R}_{\geq 0} \cup \{+\infty\} , +, \times, 0, 1\rangle$ to get the following procedure:
0 # Initialize data structure
1 chart[1...n, 1...n, 1...V] := 0
2
3 # Use unary rules to find possible parts of speech at preterminals
4 for p := 1 to n # start position
5 for each unary rule A > w_p
6 chart[1, p, A] := g[A> w_p]
7
8 # Main parsing loop
9 for l := 2 to n # subsequence length
10 for p := 1 to nl+1 # start position
11 for s := 1 to l1 # split width
12 for each binary rule A > B C
13 chart[l, p, A] = chart[l, p, A] +
(g[A > B C] x chart[s, p, B] x chart[ls,p+s, C] )
14
15 return chart[n, 1, S]
where we have highlighted the differences from the recognition algorithm in green.
As in the CYK recognition algorithm, each position $(p,l)$ in the $\texttt{chart}$ represents the subsequence that starts at position $p$ and is of length $l$ (figure 2). In the inside algorithm, every position in the chart holds a length $V$ vector where the $v^{th}$ entry corresponds to the $v^{th}$ nonterminal. The value held in this vector is the sum of the weights of all subtrees for which the $v^{th}$ nonterminal is the root.
The intuition for the update rule in line 13 is simple. The additional weight for adding rule $A\rightarrow BC$ into the chart is the weight $g[A\rightarrow BC]$ for this rule times the sum of weights of all possible left subtrees rooted in B times the sum of weights of all possible right subtrees rooted in C. As before, there may be multiple possible rules that place nonterminal $A$ in a position corresponding to different splits of the subsequence and here we perform this computation for each rule and sum the results together.
In figures 3 and 4 we show a worked example of the inside algorithm for the same sentence as we used for the CYK recognition algorithm. Figure 3a corresponds to lines 46 of the algorithm where we are initializing the first row of the chart based on the unary rule weights. Figure 3b corresponds to the main loop in lines 913 for subsequence length $l=2$. Here we assign binary nonterminal rules and compute their weights as (cost of rule $\times$ weight of left branch $\times$ weight of right branch).
Figure 4a corresponds to the main loop in lines 913 for subsequence length $l=5$. At position (5,2), there are two possible rules that apply, both of which result in the same nonterminal. We calculate the weights for each rule as before, and add the results so that the final weight at this position sums over all subtrees. Figure 4b shows the final result of the algorithm. The weight associated with the start symbol $S$ at position (6,1) is the partition function.
Our discussion so far does not make it clear why the method for computing the partition function is known as the inside algorithm. This is because the $\texttt{chart}$ holds the insideweights for each anchored nonterminal. By "anchored" we mean a nonterminal $A_i^k$ pronounced "Aye from eye to Kay" is anchored to a span in the sentence (i.e, a substring). It yields the string $A_i^k \Rightarrow w_i, \ldots, w_k$.
An anchored rule then has the form $A_i^k \rightarrow B_i^j C_j^k$. With this notation in our hand we can provide the recursive definition to the inside weight of anchored nonterminals:
\begin{equation}\label{eq:inside_update}
\alpha[A_i^k] = \sum_{B, C}\sum_{j=i+1}^k \mbox{g}[A \rightarrow B C] \times \alpha[B_i^j] \times \alpha[C_j^k]. \tag{4}
\end{equation}
The insideweight $\alpha[A_i^k]$ corresponds to the sum of all the left and right subtrees considering all possible split points $j$ and all possible nonterminals B and C (figure 5).
In the previous section, we saw that we could transform the CYK recognition algorithm into the inside algorithm, by just changing the underlying semiring. With his small adjustment, we showed that we can compute the partition function (sum of weights of all tree derivations) in polynomial time. In this section, we apply a similar trick to weighted parsing.
Recall that the partition function $Z$ was defined as the sum of all possible derivations:
\begin{eqnarray}
Z &=& \sum_{T \in \mathcal{T}[\mathbf{w}]} \mbox{G}[T] \nonumber \\
&=& \sum_{T \in \mathcal{T}[\mathbf{w}]} \prod_{t \in T} \mbox{g}[T_t], \tag{5}
\end{eqnarray}
In contrast, weighted parsing aims to find the derivation $T^{*}$ with the highest weight among all possible derivations:
\begin{eqnarray}
T^{*} &=& \underset{T \in \mathcal{T}[\mathbf{w}]}{\text{arg} \, \text{max}} \; \left[\mbox{G}[T]\right] \nonumber \\
&=& \underset{T \in \mathcal{T}[\mathbf{w}]}{\text{arg} \, \text{max}} \left[\prod_{t \in T} \mbox{g}[T_t]\right], \tag{6}
\end{eqnarray}
where $\mbox{G}[T]$ is the weight of a derivation tree which is computed by taking the product of the weights $\mbox{g}[T_t]$ of the rules.
Once again we will modify the semiring in the CYK algorithm to perform the task. Let us replace the sumproduct semiring $\langle\mathbb{R}_{\geq 0} \cup \{+\infty\} , +, \times, 0, 1\rangle$ with the maxproduct semiring $<\mathbb{R}_{\geq 0} \cup \{+\infty\} , \max[\bullet], \times, 0, 1>$ to find the score of the "best" derivation. This gives us the following algorithm:
0 # Initialize data structure
1 chart[1...n, 1...n, 1...V] := 0
2
3 # Use unary rules to find possible parts of speech at preterminals
4 for p := 1 to n # start position
5 for each unary rule A > w_p
6 chart[1, p, A] := g[A > w_p]
7
8 # Main parsing loop
9 for l := 2 to n # subsequence length
10 for p := 1 to nl+1 # start position
11 for s := 1 to l1 # split width
12 for each binary rule A > B C
13 chart[l, p, A] = max[chart[l, p, A],
(g[A > B C] x chart[s, p, B] x chart[ls,p+s, C]]
14
15 return chart[n, 1, S]
The differences from the CYK recognition algorithm are colored in green, and the single difference from both the inside algorithm and the CYK recognition algorithm is colored in orange.
Once more, each position $(p,l)$ in the $\texttt{chart}$ represents the subsequence that starts at position $p$ and is of length $l$. In the inside algorithm, each position contained a vector with one entry for each of the $V$ rules. Each element of this vector contained the sum of the weights of all of the subtrees which feed into this anchored nonterminal. In this variation, each element contains the maximum weight among all the subtrees that feed into this anchored nonterminal. Position (n,1) represents the whole string, and so the value $\texttt{chart[n, 1, S]}$ is the maximum weight among all valid parse trees. If this is zero, then there is no valid derivation.
The update rule at line 13 for the weight at $\texttt{chart[l, p, A]}$ now has the following interpretation. For each rule $\texttt{A > B C}$ and for each possible split $\texttt{s}$ of the data, we multiply the the rule weight $\texttt{g[A > B C]}$ by the two weights $\texttt{chart[s, p, B]}$ and $\texttt{chart[ls, p+s, B]}$ associated with the two child subsequences. If the result is larger than the current highest value, then we update it. If we are interested in the parse tree itself, then we can store backpointers indicating which split yielded the maximum value at each position, and traverse backwards to retrieve the best tree.
In figure 6 we illustrate worked example of weighted parsing. The algorithm starts by assigning weights to preterminals exactly as in figure 3a. The computation of the weights for subsequences of length $l=2$ is also exactly as in figure 3b, and the algorithm also proceeds identically for $l=3$ and $l=4$.
The sole difference occurs for the subsequence of length $l=5$ at position $p=2$ (figure 6). There are two possible rules that both assign the nonterminal VP to the chart at this position. In the inside algorithm, we calculated the weights of these rules and summed them. In weighted parsing, we store the largest of these weights, and this operation corresponds to the $\mbox{max}[\bullet,\bullet]$ function on line 13 of the algorithm.
At the end of the procedure, the weight associated with the start symbol at position (6,1) corresponds to the tree with the maximum weight and so is considered the "best". By keeping track of which subtree yielded the maximum weight at each split, we can retrieve this tree which corresponds to our best guess at parsing the sentence.
We've seen that we can add weights to CFGs and replace the $AND, OR$ semiring with $+, \times$ to find the total weight of all possible derivations (i.e. compute the partition function with the inside algorithm). Further more, but we can use $\max, \times$ instead to find the parse tree with the highest weight.
The semirings allow us to unify the CYK recognition, inside, and weighted parsing algorithms by recursively defining the chart entries as:
\begin{equation}
\texttt{chart}[A_i^k] = \bigoplus_{B, C, j} \mbox{g}[A \rightarrow B C] \otimes \texttt{chart}[B_i^j] \otimes \texttt{chart}[C_i^k], \tag{7}
\end{equation}
where for recognition $\mbox{g}[A \rightarrow B C]$ just returns $\texttt{TRUE}$ for all existing rules.
Readers familiar with graphical models, will no doubt have noticed the similarity between these methods and sumproduct and maxproduct belief propagation. Indeed, we could alternatively have presented this entire argument in terms of graphical models, but the semiring formulation is more concise.
In the final part of this blog, we will consider probabilistic contextfree grammars, which are a special case of weightedcontext free grammars. We'll develop algorithms to learn the weights from (i) a corpus of sentences with known parse trees and (ii) just the sentences. The latter case will lead to a discussion of the famous insideoutside algorithm.
^{1. }If you are wondering why is it "semi", its because the magnificent rings also have an additive inverse for each element: $x \oplus (x) = 0$.
]]>In the first section, we'll discuss position embeddings. The transformer operates on unordered sets of embeddings, but often we are processing ordered sequences (e.g., words in NLP). We will describe the ways that the architecture has been adapted to take into account the position of each element in the sequence. In the second section, we'll discuss efficiency. The attention computation grows quadratically with the sequence length and in practice this limits the maximum length we can use. We'll describe work that allows the transformer to work efficiently with longer sequences. We will conclude by describing how the selfattention mechanism relates to other models, including RNNs, graph neural networks, capsule networks, Hopfield networks, CNNs, gating networks, and hypernetworks.
In part I, we discussed how the core component of the transformer is dotproduct self attention $\bf Sa[\mathbf{X}]$. In this section, we'll provide a brief review of this mechanism. Selfattention takes a set of vectors $\{\mathbf{x}_{i}\}_{i=1}^{I}$ (which form the $I$ rows of $\mathbf{X}$) and modifies them based on the degree to which they attend to each other:
\begin{equation}
{\bf Sa}[\mathbf{X}] =\bf Softmax\left[\frac{(\mathbf{X}\boldsymbol\Phi_{q})(\mathbf{X}\boldsymbol\Phi_{k})^{T}}{\sqrt{d_{q}}}\right]\mathbf{X}\boldsymbol\Phi_{v}, \tag{1}
\end{equation}
where the function $\bf Softmax[\bullet]$ performs a separate softmax operation on each row of the input. The terms $\boldsymbol\Phi_{q}, \boldsymbol\Phi_{k}$ and $\boldsymbol\Phi_{v}$ are known as the query matrices, key matrices and value matrices respectively, and when applied to the data they form the queries $\mathbf{X}\boldsymbol\Phi_{q}$, keys $\mathbf{X}\boldsymbol\Phi_{k}$, and values $\mathbf{X}\boldsymbol\Phi_{v}$.
In simple terms, for each input $\mathbf{x}_{i}$ the self attention mechanism returns a weighted sum of the values for every input $\mathbf{x}_{j}$, where the weight depends on the dot product similarity between the query for $\mathbf{x}_{i}$ and the key for $\mathbf{x}_{j}$. These similarities are normalized by the softmax function so that they are positive and sum to one and after normalization are referred to as attention. The term $\bf Softmax\left[(\mathbf{X}\boldsymbol\Phi_{q})(\mathbf{X}\boldsymbol\Phi_{k})^{T}/\sqrt{d_{q}}\right]$ is of size $I\times I$ and is known as the attention matrix.
The selfattention mechanism is equivariant to permutations of the input. In other words, if we apply a permutation matrix $\mathbf{P}$ to the rows of the matrix $\mathbf{X}$, the output will also be permuted, but will otherwise stay the same:
\begin{eqnarray}
{\bf Sa}[\mathbf{P}\mathbf{X}] &=&\bf Softmax\left[\frac{(\mathbf{P}\mathbf{X}\boldsymbol\Phi_{q})(\mathbf{P}\mathbf{X}\boldsymbol\Phi_{k})^{T}}{\sqrt{d_{q}}}\right]\mathbf{P}\mathbf{X}\boldsymbol\Phi_{v}\nonumber\\
&=&\mathbf{P}\cdot \bf Softmax\left[\frac{(\mathbf{X}\boldsymbol\Phi_{q})(\mathbf{X}\boldsymbol\Phi_{k})^{T}}{\sqrt{d_{q}}}\right]\mathbf{P}^{T}\mathbf{P}\mathbf{X}\boldsymbol\Phi_{v}\nonumber \\
&=&\mathbf{P}\cdot\bf Softmax\left[\frac{(\mathbf{X}\boldsymbol\Phi_{q})(\mathbf{X}\boldsymbol\Phi_{k})^{T}}{\sqrt{d_{q}}}\right]\mathbf{X}\boldsymbol\Phi_{v}\nonumber \\
&=&\mathbf{P}\cdot {\bf Sa}[\mathbf{X}] . \tag{2}
\end{eqnarray}
This is not desirable when the vectors $\mathbf{x}_{i}$ represents words in a sentence as the order of the inputs is important; the sentences The man ate the fish and The fish ate the man have different meanings and we hope that any neural processing will take this into account.
Before discussing how to encode positional information, it is worth thinking about what properties we would like this encoding to have. First, we need to know the relative position of two words rather than their absolute position. Transformers are trained with spans of text that may contain multiple sentences, and the start of the span may be midway through the sentence. Consequently, the absolute position does not contain much useful information.
Second, word embeddings that are far from one another in the sequence might be expected to interact with one another less than those that are closer. For example, when we disambiguate a pronoun (e.g., understanding who he is in a sentence like He ate the sandwich), it's likely that the answer is close at hand, not several thousand words away. Finally, we might expect that we need the relative position with less and less accuracy as the distance between tokens increases. For small distances, the relative word position directly affects the meaning of the sentence, but for larger distances the words are probably in different sentences and the exact distance between them matters much less.
In the original transformer paper, position was encoded by adding a predetermined matrix $\boldsymbol\Pi$ to the input embedding matrix $\mathbf{X}$ where the position embeddings are predefined as:
\begin{eqnarray}
\Pi_{i, 2f} &=& \sin[\omega_f i] \nonumber\\
\Pi_{i, 2f+1} &=& \cos[\omega_f i] \tag{3}
\end{eqnarray}
where $i$ indexes the position in the sequence and $f$ indexes pairs of adjacent embedding dimensions. The angular frequencies $\omega_f$ of adjacent dimensions $d = 2f$ and $d+1 = 2f+1$ are the same and take the value $\omega_f = 10000^{2f/D}$ (figure 1).
One way to think about adding the matrix $\boldsymbol\Pi$ is that we are adding a different vector to the embedding $\mathbf{x}_{i}$ where this vector encodes the absolute position $i$. So if the same word occurs at different positions in the sequence, it would have two different embeddings. For this reason, this sinusoidal encoding is considered an absolute position embedding.
This scheme is worth examining closely. In the selfattention mechanism we apply linear transformations $\boldsymbol\Phi_{q}$ and $\boldsymbol\Phi_{k}$ to $\mathbf{X}+\boldsymbol\Pi$ and then compute dot products between every pair of columns in the resulting matrices. We'll now consider several interesting properties that emerge we apply linear transformations to this sinusoidal embedding and take dot products.
Separating position and word embeddings: At first sight, adding the position embeddings to the data seems a bad idea; we probably need both the word embedding and the position embedding without having them hopelessly entangled. However, this is not necessarily a problem. Since the embedding dimension $D$ is usually greater than the maximum sequence length $I$ (e.g., BERT used D=1024, I=512), it is possible for the system to learn word embeddings that lie outside the subspace of the position embeddings. If this were the case, the system could recover the word embeddings by learning linear transformations $\boldsymbol\Phi_{q}$ and $\boldsymbol\Phi_{k}$ where the nullspace spans the position embeddings. Similarly, the system could recover the position embeddings.
Downweighting distant elements: The dot product between the position encodings $\boldsymbol\pi_{i}$ and $\boldsymbol\pi_{j}$ at different positions $i$ and $j$ (i.e. rows of $\boldsymbol\Pi)$ gets smaller as the relative position $ij$ increase (figure 2). So if the system were to retrieve the position embeddings using a linear transform as described above, it could create an attention matrix that increasingly downweights attention between elements as they become more distant when it computes the dot products.
Relative vs. absolute positions: We have added a unique embedding $\boldsymbol\pi_{i}$ at each absolute position $i$. However, it's possible to transform the embedding $\boldsymbol\pi_{i}$ at position $i$ to that at relative position $i+j$ using a linear operation. To see this, consider the embeddings $\left(\sin[\omega_{f}i]\;\;\cos[\omega_{f} i]\right)^{T}$ at word position $i$ and two adjacent dimensions $d$ and $d+1$ of the embedding. Applying the following linear transform we get:
\begin{eqnarray}
\begin{pmatrix}\cos[\omega_{f} j]&\sin[\omega_{f} j]\\\sin[\omega_{f} j]&\cos[\omega_{f} j]\end{pmatrix}
\begin{pmatrix}\sin[\omega_{f} i]\\\cos[\omega_{f} i]\end{pmatrix} &=&\begin{pmatrix}
\cos[\omega_{f} j]\sin[\omega_{f} i]+ \sin[\omega_{f} j]\cos[\omega_{f} i]\\
\sin[\omega_{f} j]\sin[\omega_{f} i]+\cos[\omega_{f} j]\cos[\omega_{f} i]\end{pmatrix}\nonumber \\ &=&
\begin{pmatrix}\sin[\omega_{f} (i+j)]\\\cos[\omega_{f} (i+j)]\end{pmatrix} \tag{4}
\end{eqnarray}
where we have used the trigonometric addition identities. So by applying the appropriate linear transformation, the system can transform the position encoding at position $i$ to that at position $i+j$. If it did this for just the queries, then the dot products between position vectors would take a maximum value at a relative offset of $j$ rather than 0.
Note that all of the above is supposition; the trained network does not necessarily do any of these things. The point is that these capabilities are available to it if it chooses to use them.
We've seen that it's possible to use sinusoidal embeddings for which the linear projections and dotproducts have useful properties. An obvious next step is to learn the position embedding matrix $\boldsymbol\Pi$ during training. This approach was also tried in the original transformer paper and adopted by subsequent encoder models like BERT and GPT2.
The advantage of learning the position embeddings is that we can potentially capture more complex properties. The disadvantage is that it adds a lot of extra parameters to the model, and once learned, the model cannot be extended to longer sequence lengths.
It's interesting however, to test if the learned position embeddings capture the desirable properties of the sinusoidal embeddings. Wang and Chen (2020) compared the cosine similarities (closely related to dot products) between embeddings at different relative distances (figure 3). For GPT2 the similarity of the embeddings decreases as a function of distance for small distances with a periodic component at larger distances. For BERT, the results are more noisy and complicated.
They also examined if it is possible to predict the absolute positions by applying linear regression to the learned embedding. For the BERT embeddings, the error in these predictions is large, for the GPT2 embeddings very small, and for the sinusoidal embeddings zero. The same experiment can be done by regressing pairs of position embeddings to predict relative position. Here, the error is again greatest for the BERT embeddings, but this time, the GPT2 embeddings outperform the predefined sinusoidal embeddings.
Adding position embeddings modifies the selfattention calculation to:
\begin{equation}
\bf Sa [\mathbf{X}] = \bf Softmax\left[\frac{((\mathbf{X}+\boldsymbol\Pi)\boldsymbol\Phi_{q})((\mathbf{X}+\boldsymbol\Pi)\boldsymbol\Phi_{k})^{T}}{\sqrt{d_{q}}}\right](\mathbf{X}+\boldsymbol\Pi)\boldsymbol\Phi_{v}. \tag{5}
\end{equation}
The position matrix modifies both the attention matrix (the softmax term) and the computation of the values. There have been a number of studies in which the latter modification is dropped so that just the attention matrix is changed:
\begin{equation}
\bf Sa [\mathbf{X}] = \bf Softmax\left[\frac{((\mathbf{X}+\boldsymbol\Pi)\boldsymbol\Phi_{q})((\mathbf{X}+\boldsymbol\Pi)\boldsymbol\Phi_{k})^{T}}{\sqrt{d_{q}}}\right]\mathbf{X}\boldsymbol\Phi_{v}. \tag{6}
\end{equation}
In these circumstances, the position information is usually added at every layer as it is only represented very implicitly in the output of the computation.
Let's consider the unnormalized and presoftmax attention matrix:
\begin{equation}
\tilde{\mathbf{A}} = ((\mathbf{X}+\boldsymbol\Pi)\boldsymbol\Phi_{q})((\mathbf{X}+\boldsymbol\Pi)\boldsymbol\Phi_{k})^{T}, \tag{7}
\end{equation}
which has elements:
\begin{eqnarray}
\tilde{a}_{i,j} &=& ((\mathbf{x}_{i}+\boldsymbol\pi_{i})\boldsymbol\Phi_{q})((\mathbf{x}_{j}+\boldsymbol\pi_{j})\boldsymbol\Phi_{k})^{T}\nonumber \\
&=& \underbrace{\mathbf{x}_{i}\boldsymbol\Phi_q\boldsymbol\Phi_{k}^{T}\mathbf{x}_{j}^{T}}_\text{contentcontent}+\underbrace{\mathbf{x}_{i}\boldsymbol\Phi_q\boldsymbol\Phi_{k}^{T}\boldsymbol\pi_{j}^{T}}_{\text{contentposition}}+\underbrace{\boldsymbol\pi_{i}\boldsymbol\Phi_q\boldsymbol\Phi_{k}^{T}\mathbf{x}_{j}^{T}}_{\text{positioncontent}}+\underbrace{\boldsymbol\pi_{i}\boldsymbol\Phi_q\boldsymbol\Phi_{k}^{T}\boldsymbol\pi_{j^{T}}}_{\text{positionposition}},\label{eq:attention_breakdown} \tag{8}
\end{eqnarray}
where we can see that each element has four contributions in which the position embedding $\boldsymbol\pi$ and the content vector $\mathbf{x}$ interact differently. This expression has been modified in various ways
Untied embeddings: One simple modification is to decouple or untie the content and position components rather than add them together before projection. A simple way to do this is to remove the terms where they interact and to use a separate linear transform for each to give:
\begin{equation}
\tilde{a}_{i,j} = \underbrace{\mathbf{x}_{i}\boldsymbol\Phi_q\boldsymbol\Phi_{k}^{T}\mathbf{x}_{j}^{T}}_\text{contentcontent}+\underbrace{\boldsymbol\pi_{i}\boldsymbol\Psi_q\boldsymbol\Psi_{k}^{T}\boldsymbol\pi_{j}^{T}}_{\text{positionposition}}. \tag{9}
\end{equation}
Relative embeddings: Another modification is to directly inject information about the relative position. For example, Shaw et al. (2018) add a term $\boldsymbol\pi_{ij}$ which depends on the position difference.
\begin{equation}\label{eq:rel_pos_shaw}
\tilde{a}_{i,j} = \underbrace{\mathbf{x}_{i}\boldsymbol\Phi_q\boldsymbol\Phi_{k}^{T}\mathbf{x}_{j}^{T}}_\text{contentcontent}+\underbrace{\mathbf{x}_{i}\boldsymbol\Phi_q\boldsymbol\pi_{ij}^{T}}_{\text{contentposition}}. \tag{10}
\end{equation}
where a different position vector $\boldsymbol\pi_{ij}$ is learned for each signed position offset $ij$ where this offset is usually clipped so after a certain distance, all terms are the same. Note that this position vector is defined directly in the space of the keys rather than projected into it^{1}.
Raffel et al. (2019) simplified this further by simply adding a learnable scalar $\pi_{ij}$ to the attention matrix
\begin{equation}
\tilde{a}_{i,j} = \underbrace{\left(\mathbf{x}_{i}\boldsymbol\Phi_q\boldsymbol\Phi_{k}^{T}\mathbf{x}_{j}^{T}\right)}_\text{contentcontent} + \pi_{ij}. \tag{11}
\end{equation}
where $\pi_{ij}$ is a different scalar for each signed offset $ij$. Relative position information has also been combined directly in other ways various other ways such as simply multiplying the attentions by a modifying factor $\pi_{ij}$:
\begin{equation}
\tilde{a}_{i,j} = \underbrace{\left(\mathbf{x}_{i}\boldsymbol\Phi_q\boldsymbol\Phi_{k}^{T}\mathbf{x}_{j}^{T}\right)}_\text{contentcontent}\cdot \pi_{ij}. \tag{12}
\end{equation}
where $\pi_{ij}$ is a different scalar for each absolute offset $ij$.
Finally, we note that predefined sinusoidal embeddings have also been used in a system based on equation 10 (where $\boldsymbol\pi_{ij}$ now contains sinusoidal terms in relative position $ij$) and also in more complex ways.
Combining ideas: Many schemes combine have proposed position embeddings that combine the ideas of (i) only retaining certain terms from equation 8, (ii) using different projection matrices for the content and position embeddings, and (iii) using relative embeddings. For example, in DeBERTa they use:
\begin{equation}
\tilde{a}_{i,j} =
\underbrace{\mathbf{x}_{i}\boldsymbol\Phi_q\boldsymbol\Phi_{k}^{T}\mathbf{x}_{j}^{T}}_\text{contentcontent}+
\underbrace{\mathbf{x}_{i}\boldsymbol\Phi_q\boldsymbol\Psi_{k}^{T}\boldsymbol\pi_{ij}^{T}}_{\text{contentposition}}+
\underbrace{\boldsymbol\pi_{ji}\boldsymbol\Psi_q\boldsymbol\Phi_{k}^{T}\mathbf{x}_{j}^{T}}_{\text{positioncontent}}. \tag{13}
\end{equation}
where they drop the positionposition term and have a different relative embedding $\boldsymbol\pi_{ij}$ for each signed offset $ij$ between the positions.
In this section we have provided a brief overview of how position information is added into transformers. At the time of writing, it is not clear which of these position embedding schemes is empirically superior. For downstream tasks on BERT, relative position embeddings generally perform better than absolute position embeddings, but there does not seem to be much difference between sinusoidal embeddings and learned embeddings. To learn more about position embeddings, consult this survey paper.
In the second part of this blog, we consider modifications to the selfattention mechanism that make it more efficient as the sequence length increases. The selfattention mechanism takes $I$ inputs $\mathbf{x}_{i}$ and returns $I$ modified outputs. In this process, each input $\mathbf{x}_{i}$ interacts with one another; each output is a weighted sum of the values corresponding to every input, where the weights depend on how much the input attends to every other input. As such, the transformer naturally has quadratic complexity in the size $I$ of the input sequence.
However, there are some situations in which we might expect this input set to be extremely large. In NLP, we may wish to summarize long documents or answer questions about a body of documents. In other modalities like vision or audio processing, the data can also be of extremely high dimension. In these circumstances, the quadratic complexity of the attention mechanism can become the limiting factor and a subfield has emerged that tries to address this bottleneck.
In this section, we review three lines of work. First, we discuss methods that aim to reduce the size of the attention matrix. Second, we review approaches that introduce sparsity into the attention matrix. Finally, we present methods that treat the selfattention computation as a kernel function and try to approximate this to create algorithms with linear complexity in the sequence length.
One simple idea to make selfattention more efficient is to reduce the size of the attention matrix. In memory compressed attention, a strided convolution is applied to the keys and values so the selfattention operation becomes:
\begin{equation}
\bf Sa[\mathbf{X}] = \bf Softmax\left[\mathbf{X}\boldsymbol\Phi_{q}(\boldsymbol\theta_{k}\circledast\mathbf{X}\boldsymbol\Phi_{k})^{T} \right](\boldsymbol\theta_{v}\circledast\mathbf{X}\boldsymbol\Phi_{v}), \tag{14}
\end{equation}
where $\boldsymbol\theta_{k}$ and $\boldsymbol\theta_{v}$ are the convolution kernels. If the stride $s$ is the same as the kernel size, then the effect is to take a learned weighted average of nearby key/value vectors and the resulting attention matrix reduces to size $I\times I/s$ (figure 5).
The Linformer applies a very similar trick that is motivated by the observation that the selfattention mechanism is often lowrank in practice. Consequently, we can reduce the complexity of the calculation by projecting the keys and value into a learned subspace:
\begin{equation}
\bf Sa[\mathbf{X}] = \bf Softmax\left[\mathbf{X}\boldsymbol\Phi_{q}(\boldsymbol\Psi_{k}\mathbf{X}\boldsymbol\Phi_{k})^{T} \right](\boldsymbol\Psi_{v}\mathbf{X}\boldsymbol\Phi_{v}), \tag{15}
\end{equation}
where $\boldsymbol\Psi_{k}$ and $\boldsymbol\Psi_{v}$ are the $I/s\times I$ projection matrices for the keys and values respectively.
Another approach to making attention more computationally efficient is to constrain the attention computation so that every input does not attend to every other input. In local attention the inputs are divided into disjoint groups of neighbours and each block is passed through a separate selfattention mechanism before recombining (figure 6) In this way, inputs within the same block only attend to one another. Of course, this has the disadvantage that elements that are far from each other in the sequence never interact with one another, but alternating transformer layers that use local and full attention solves this problem.
Local attention can be visualized by plotting a matrix showing interaction of the queries and keys (figure 6). Note that for the decoder version, we also employ masked selfattention so each query can only attend to keys that have the same index or less and there are no interactions in the upper triangular portion.
Visualizing attention in this way leads naturally to the idea of using a convolutional structure (figure 7), in which each input only interacts with the nearest few inputs (or nearest preceding inputs for decoders). When used alone, this will mean that it may take many layers for information to propagate along the sequence. Again, this drawback can be remedied by alternating layers with the convolutional attention patterns and layers with full attention. Indeed, this is what is done in GPT3. A different approach that maintains the overall sparsity is to use dilated convolutions with different dilation rates in different layers (figure 7bc), or by introducing layers where some a few of the queries interact with every key (figure 7d). Collectively, these methods are referred to as sparse transformers.
The Longformer also used a convolutional structure which is sometimes dilated, but simultaneously allowed some keys to and queries to interact with all of the others (figure 9a). This was referred to as global attention and the positions correspond to special tokens such as the $<$cls$>$ token in BERT or special tokens in question answering tasks that delimit the question and answer. Note that global attention can only be used in encoder models since elements attend to every other element and hence see ahead in the sequence.
A natural extension of this method is to define some new content embeddings which attend to all of the keys and queries, but do not themselves correspond to any individual tokens in the input (figure 9). This is known as the extended transformer construction (ETC). These additional global content embeddings act as a kind of memory, which can both receive and broadcast information from all of the elements and are combined with a sparse convolutional pattern which ensures strong interactions between nearby inputs. The BigBird model took this idea by one step further by also adding sparse random connections between elements to help ensure the rapid mixing of information from different parts of the sequence.
One notable complication of using global content embeddings occurs if it is combined with relative attention; there is no relative offset between the global and regular elements, and so special relative position embeddings are learned for mapping to, from, and between, the global content embeddings.
In this section we have reviewed approaches that make selfattention more efficient, by limiting the interaction between different inputs. Note that all of these methods use predefined sparsity patterns. There is also another line of research that attempts to learn the sparsity pattern. This includes the routing transformer, reformer and Sinkhorn transformer.
A third approach to making selfattention more efficient it to approximate the attention computation using Kernel methods. The premise is that the dot product attention for the $i^{th}$ query can thought of as a special case of the following computation:
\begin{equation}
\mathbf{x}_{i}^{\prime} = \frac{\sum_{j=1}^{I}\mbox{sim}[\mathbf{x}_{i}\boldsymbol\Phi_{q}, \mathbf{x}_{j}\boldsymbol\Phi_{k}]\mathbf{x}_{j}\boldsymbol\Phi_{v}}{\sum_{j=1}^{I}\mbox{sim}[\mathbf{x}_{i}\boldsymbol\Phi_{q}, \mathbf{x}_{j}\boldsymbol\Phi_{k}]} \tag{16}
\end{equation}
where $\mbox{sim}[\bullet,\bullet]$ returns a measure of similarity between the two arguments. For dotproduct selfattention, this is defined as $\mbox{sim}[\mathbf{x}_{i}\boldsymbol\Phi_{q}, \mathbf{x}_{j}\boldsymbol\Phi_{k}] = \exp[\mathbf{x}_{i}\boldsymbol\Phi_{q}(\mathbf{x}_{j}\boldsymbol\Phi_{k})^{T}]$.
We now treat this similarity as a kernel function, and as such it can be expressed as the dot product of nonlinear transformations $\bf z[\bullet]$ of the inputs
\begin{equation}
\mbox{sim}[\mathbf{x}_{i}\boldsymbol\Phi_{q}, \mathbf{x}_{j}\boldsymbol\Phi_{k}] = \bf z[\mathbf{x}_{i}\boldsymbol\Phi_{q}]\bf z[\mathbf{x}_{j}\boldsymbol\Phi_{k}]^{T}, \tag{17}
\end{equation}
which means that the output becomes:
\begin{eqnarray}
\mathbf{x}_{i}^{\prime} &=& \frac{\sum_{j=1}^{I}\bf z [\mathbf{x}_{i}\boldsymbol\Phi_{q}]\bf z [\mathbf{x}_{j}\boldsymbol\Phi_{k}]^{T}\mathbf{x}_{j}\boldsymbol\Phi_{v}}{\sum_{j=1}^{I}\bf z[\mathbf{x}_{i}\boldsymbol\Phi_{q}]\bf z[\mathbf{x}_{j}\boldsymbol\Phi_{k}]^{T}}\nonumber \\
&=&\frac{\bf z[\mathbf{x}_{i}\boldsymbol\Phi_{q}]\sum_{j=1}^{I}\bf z[\mathbf{x}_{j}\boldsymbol\Phi_{k}]^{T}\mathbf{x}_{j}\boldsymbol\Phi_{v}}{\bf z[\mathbf{x}\boldsymbol\Phi_{q}]\sum_{j=1}^{I}\bf z[\mathbf{x}_{j}\boldsymbol\Phi_{k}]^{T}}, \tag{18}
\end{eqnarray}
where we have used the associativity property of matrix multiplication between the first and second lines.
If we could find $\bf z[\bullet]$ such that $\bf z[\mathbf{x}_{i}\boldsymbol\Phi_{q}]\bf z[\mathbf{x}_{j}\boldsymbol\Phi_{k}]^{T} = \exp[\mathbf{x}_{i}\boldsymbol\Phi_{q}(\mathbf{x}_{j}\boldsymbol\Phi_{k})^{T}]$, then this is much more efficient. We compute the terms in the sums once and then compute each $\mathbf{x}_{i}$ term separately with a matrix multiplication. It turns out that such a nonlinear transform $\bf z[\bullet]$ does indeed exist, but unfortunately, it maps the argument to an infinite dimensional space. From a computational viewpoint, this is not very helpful!
We'll describe two approaches that sidestep this problem. First, the linear transformer implicitly uses a different measure of similarity $\bf sim[\mathbf{a},\mathbf{b}] = \bf z[\mathbf{a}]\bf z[\mathbf{b}]^{T}$ by defining a function $\bf z[\bullet]$ which is more tractable. In particular, they use $\bf z[\mathbf{a}] = \bf elu[\mathbf{a}]+1$ where $\bf elu[\bullet]$ is the exponential linear unit which is a pointwise nonlinearity. Second, the performer attempts to approximate the standard dotproduct similarity using a finite dimensional mapping $\bf z[\bullet]$. The latter approach is empirically more successful, but this may be because the tricks for training transformers (see part III of this blog) do not transfer effectively to using a different similarity measure.
These approaches can be also adapted to decoders. Here, when we calculate the output corresponding to input $\mathbf{x}_{i}$ we only use the partial sums up to index $i$:
\begin{eqnarray}
\mathbf{x}_{i}^{\prime} &=& \frac{\bf z[\mathbf{x}_{i}\boldsymbol\Phi_{q}]\sum_{j=1}^{i}\bf z[\mathbf{x}_{j}\boldsymbol\Phi_{k}]^{T}\mathbf{x}_{j}\boldsymbol\Phi_{v}}{\bf z[\mathbf{x}_{i}\boldsymbol\Phi_{q}]\sum_{j=1}^{i}\bf z[\mathbf{x}_{j}\boldsymbol\Phi_{k}]^{T}} \nonumber \\
&=& \frac{\bf z[\mathbf{x}_{i}\boldsymbol\Phi_{q}]\mathbf{A}_{i}}{\bf z[\mathbf{x}_{i}\boldsymbol\Phi_{q}]\mathbf{b}_i}, \tag{19}
\end{eqnarray}
where $\mathbf{A}_{i}$ and $\mathbf{b}_{i}$ represent the partial sums in the numerator and denominator respectively. If we initialize $\mathbf{A}_{0}$ and $\mathbf{b}_{0}$ to zero, then the we can compute all the terms efficiently by iterating:
\begin{eqnarray}\label{eq:transformer_rnn}
\mathbf{A}_{i}&\leftarrow&\mathbf{A}_{i1}+ \bf z[\mathbf{x}_{j}\boldsymbol\Phi_{k}]^{T}\mathbf{x}_{i}\boldsymbol\Phi_{v}\nonumber \\
\mathbf{b}_{i}&\leftarrow&\mathbf{b}_{i1}+ \bf z[\mathbf{x}_{j}\boldsymbol\Phi_{k}]^{T}\nonumber \\
\mathbf{x}_{i}^{\prime}&\leftarrow& \frac{\bf z[\mathbf{x}_{i}\boldsymbol\Phi_{q}]\mathbf{A}_{i}}{\bf z[\mathbf{x}_{i}\boldsymbol\Phi_{q}]\mathbf{b}_i}. \tag{20}
\end{eqnarray}
In conclusion, if we consider the interaction between the queries and keys to be a kernel function, we can replace this by the dot product of nonlinear functions of the key and query. This leads naturally to a very efficient implementation for both encoder and decoder architectures.
In this section, we have reviewed three families of modifications that allow the selfattention mechanism to be extended to longer sequences without a quadratic increases in computation. To learn more about this area, consult this review paper.
In the previous sections, we have addressed the questions of how to encode position, and how to extend the transformer to longer sequence lengths. In this section, we shift gears and consider the relationship between the selfattention mechanism and other models. We'll also consider alternatives to the selfattention mechanism.
The first connection that we will draw is between the selfattention decoder and recurrent neural networks (RNNs). In the final part of the previous section, we reinterpreted the dotproduct selfattention mechanism as a kernel function $\mbox{k}[\bullet, \bullet]$:
\begin{equation}
\mathbf{x}_{i}^{\prime} = \frac{\sum_{j=1}^{i}\mbox{k}[\mathbf{x}_{i}\boldsymbol\Phi_{q}, \mathbf{x}_{j}\boldsymbol\Phi_{k}]\mathbf{x}_{j}\boldsymbol\Phi_{v}}{\sum_{j=1}^{i}\mbox{k}[\mathbf{x}_{i}\boldsymbol\Phi_{q}, \mathbf{x}_{j}\boldsymbol\Phi_{k}]} = \frac{\sum_{j=1}^{i} \bf z[\boldsymbol\Phi_{q}\mathbf{x}_{i}]\bf z[\mathbf{x}_{j}\boldsymbol\Phi_{k}]^{T}\mathbf{x}_{j}\boldsymbol\Phi_{v}}{\sum_{j=1}^{i} \bf z[\mathbf{x}_{i}\boldsymbol\Phi_{q}]\bf z[\mathbf{x}_{j}\boldsymbol\Phi_{k}]^{T}}. \tag{21}
\end{equation}
This means that the kernel function can be replaced by the dot product of nonlinear functions $\bf z[\bullet]$ of the queries and keys and this led to the iterative computation in equation 20.
Viewed in this light, the decoder has an obvious mapping to an RNN. Each state is processed sequentially and the quantities $\mathbf{A}_{i}$ and $\mathbf{b}_{i}$ from equation 20 form the hidden state (figure 10). However, it turns out that to exactly replicate dotproduct selfattention requires the function $\bf z[\bullet]$ to map its arguments to an infinite dimensional space. Hence, it is perhaps unsurprising that the transformer architecture outperforms the RNN in practice.
A hypernetwork is a network that is used to predict the parameters of a second network that then performs the main task in hand. In part I of this tutorial, we already saw that the attention matrix can be interpreted as forming the weights of a network that maps the values to the outputs (figure 11). These weights are (i) nonnegative, (ii) sparse (there is no interaction between the different dimensions of the values) and (iii) shared (the same weight is used for every dimension of the interaction between the $i^{th}$ value and the $j^{th}$ output). As such they form a hypernetwork with a particular structure.
Viewed from this perspective, we might consider other mechanisms than dotproduct self attention to create these weights (figure 12). The synthesizer uses a multilayer perceptron $\bf MLP[\bullet]$ to create each row of the $I\times I$ matrix from input $\mathbf{x}_{i}$. This row is then passed through the softmax function to create the attention weights:
\begin{eqnarray}
\mbox{Synthesizer}\left[\mathbf{X} \right] &=&\bf Softmax\left[\bf MLP[\mathbf{X}]\right] \mathbf{X}\boldsymbol\Phi_{v}\nonumber \\
&=&\bf Softmax\left[\bf Relu[\mathbf{X}\boldsymbol\Phi_{1}]\boldsymbol\Phi_{2}]\right] \mathbf{X}\boldsymbol\Phi_{v}\nonumber
\end{eqnarray}
This is interesting since the rows of the attention matrix are no longer computed based on similarities between pairs of tokens, but just from each individual token alone. Surprisingly, it seems to work comparably well to the original dotproduct selfattention mechanism.
A similar idea can be used to generate an attention matrix with convolutional structure. This belongs to the family of dynamic convolutions in which the convolution weights are themselves determined by the data. Part of the network block in the paper Pay less attention uses this approach. One advantage of this scheme is that there is no need for a position encoding; the convolution weights are determined by all of the inputs, and if we permute them, the result will be different.
Finally, it should be noted that linear transformers are also closely related to fast weight memory systems which are intellectual forerunners of hypernetworks.
A different way to think about selfattention is as a routing network. The attention matrix distributes (routes) each of the $I$ computed value vectors to the $I$ outputs. From this viewpoint, there is a connection between selfattention and capsule networks. Roughly speaking, a capsule network is intended to capture hierarchical relations in images, so lower network levels might detect facial parts (noses, mouths), which are then combined (routed) in higher level capsules that represent a face. One major difference is that capsule networks use routing by agreement. In selfattention, the elements $\mathbf{x}_{i}$ compete with each other for how much they contribute to output $j$ (via the softmax operation). In capsule networks, the higher levels of the network compete with each other for inputs from the lower levels.
Once we consider selfattention as a routing network, we can ask the question of whether it is necessary to make this routing dynamic (i.e, dependent on the data). Another variant of the synthesizer removed the dependence of the attention matrix on the inputs entirely and either used predetermined random values or learned values (figure 13a). This performed surprisingly well across a variety of tasks.
Graph convolutional networks consider each input vector $\mathbf{x}_{i}$ to be associated with a node on a known graph, and process these nodes through a series of layers in which each node interacts with its neighbours. As such they have a close relationship to selfattention; they can be viewed as routing networks, but here the routing is determined by the adjacency matrix of the graph (figure 13b) and not the data.
Graph attention networks (figure 13c) combine both mechanisms; the routing depends both on the data (although using additive attention, not dotproduct attention) and the graph structure (which is used to mask the attention matrix in a similar way to in masked selfattention in decoders).
Returning to the original selfattention mechanism, it is now clear that it can be viewed as a graph neural network on the complete graph, where the query tokens are the destination nodes and the key and value tokens are the source nodes.
Linear convolutions of the neighboring inputs in the sequence can be considered a special case of multihead dotproduct self attention with relative position embeddings. For example, consider using additive position embeddings so that the overall selfattention mechanism is given by:
\begin{equation}
{\bf Sa}[\mathbf{X}] =\bf Softmax\left[(\mathbf{X}\boldsymbol\Phi_{q})(\mathbf{X}\boldsymbol\Phi_{k})^{T}+\boldsymbol\Pi\right]\mathbf{X}\boldsymbol\Phi_{v}, \tag{22}
\end{equation}
where the matrix $\boldsymbol\Pi$ has a different learned value $\pi_{ij}$ for each offset $ij$. Now consider setting $\boldsymbol\Phi_{q}=\boldsymbol\Phi_k = \mathbf{0}$ and $\boldsymbol\Phi_{v}=\mathbf{I}$ to yield:
\begin{equation}
{\bf Sa}[\mathbf{X}] =\bf Softmax\left[\boldsymbol\Pi\right]\mathbf{X}\nonumber.
\end{equation}
If we now choose the relative position contributions $\pi_{ij}$ to be very large for one offset $ij$ and small for all of the others, the overall effect will be to create an attention matrix with zeros everywhere except within a single diagonal offset by $ij$ from the center, where the values will be one. When applied to the data $\mathbf{X}$, this has the effect of shifting the rows of the value matrix by $j$. In a multihead attention context, each head could learn a different offset. When the outputs of these heads are recombined using:
\begin{equation}
{\bf MhSa}[\mathbf{X}] = \left[{\bf Sa}_{1}[\mathbf{X}]\;{\bf Sa}_{2}[\mathbf{X}]\;\ldots\;{\bf Sa}_{H}[\mathbf{X}] \right]\boldsymbol\Phi_{c}, \tag{23}
\end{equation}
it is possible to choose $\boldsymbol\Phi_{c}$ so that all of the outputs from the $h^{th}$ self attention mechanism have the same weight and so we have effectively performed a convolution on the rows of $\mathbf{X}$.
To summarize, it is possible for a multihead self attention with relative position embeddings to simulate convolution. This is particularly interesting when the transformer is applied to vision problems where convolutional networks are the standard. Indeed, there is some evidence that this is exactly what transformers are doing in vision tasks.
A notable characteristic of the self attention mechanism and related models is that the processing divides into two paths, one of which is later used to modify the other. In attention, this modification takes the form of premultiplication by the attention matrix. However, there is another family of models which use one path to just modulate the magnitude of the other.
The gated linear unit (figure 14a) is an example of such a gating mechanism. The input $\mathbf{X}$ has a linear transformation $\boldsymbol\Phi_{u1}$ applied to it and the result is passed through a pointwise sigmoid function $\bf Sig[\bullet]$ . This maps the results to between zero and one so that they can be used to modulate the magnitude of the data $\mathbf{X}\boldsymbol\Phi_{u2}$ flowing down the other path, which have been subject to a a different linear transformation. The whole function is hence:
\begin{equation}
\bf GLU[\mathbf{X}] = \bf Sig[\mathbf{X}\boldsymbol\Phi_{u1}]\odot \mathbf{X}\boldsymbol\Phi_{u2}. \tag{24}
\end{equation}
Although the architecture is superficially similar, this is not really equivalent to a transformer, as each input $\mathbf{x}_{i}$ (row of $\mathbf{X}$) is treated independently. The gated MLP addresses this by modifying the architecture to incorporate a learned linear transformation $\boldsymbol\Psi$ that combines together the different inputs:
\begin{equation}
\bf GMLP[\mathbf{X}] = (\bf Sig[\mathbf{X}\boldsymbol\Phi_{u1}]\odot \boldsymbol\Psi\mathbf{X}\boldsymbol\Phi_{u2})\boldsymbol\Phi_{v}. \tag{25}
\end{equation}
as well as a final linear transform $\boldsymbol\Phi_{v}$ that remaps to the original dimensionality. This model again has the advantage that it does not need a position encoding; the inputs are mixed using $\boldsymbol\Psi$ and if we permute their order, the output will not just be a permutation of the input.
Finally, we'll consider the relationship between Hopfield networks and the attention mechanism. A Hopfield network can retrieve a stored memory based on a query via an iteratve procedure in which the query is updated after interaction with the system. They were originally defined for binary vectors, but the modern Hopfield network extends the idea to continuous values.
Ramsauer et al. (2020) show that for a carefully defined Hopfield energy function, the update rule is equivalent to selfattention mechanism. The most natural way to think of this is in terms of encoderdecoder attention. The decoder queries memories from the encoder network. If viewed as a Hopfield network, the querykey attention computes a simple iteration of the memory retrieval. To complete the process, the output of the attention network should be feed back in as a new query until a stable state is reached (figure 15).
In this blog, we have discussed extensions to the basic selfattention mechanism. First, we discussed how to incorporate positional information, and then how to extend the selfattention mechanism to longer sequences. Finally, we have discussed the relationship between selfattention and a number of other models, including RNNs, CNNs, graph convolutional networks and Hopfield networks. We note that some caution is required here. Recent work has suggested that many of the variations of the original model do not necessarily yield consistent performance benefits.
In part III of this blog, we discuss how to train transformers in practice. To make training stable, a number of tricks are required including unusual learning rate scheduled, various forms of normalization, and careful initialization.
1 In fact they also modified the value terms in a similar way although their ablation study suggested that this did not contribute much
]]>In the technical demo, users get to see this interactive system at work.
The value proposition of a project like this is about democratizing datadriven insights by enabling nontechnical users to interact with structured data, using natural language.
"Today, a lot of potentially useful knowledge and insights is trapped in databases, and only technical users can access that information, typically by using SQL. Turing by Borealis AI’s database interface unlocks these insights for nontechnical users, who can query the multitude of databases using natural language and get the results and insights they need."
 Yanshuai Cao, Senior Research Team Lead at Borealis AI
Turing by Borealis AI comes closer than most of technology available today, achieving and holding stateoftheart performance levels, while reducing accuracy issues. Such crossdomain texttoSQL semantic parsers generally have serious accuracy and usability problems, making practical applications a challenge. Unlike in online search, where approximate answers can be good enough, when users query relational databases to glean specific insights, high degree of accuracy is needed to provide value. With Turing by Borealis AI’s technology, a user can look at multiple hypotheses and with the help of explanation Turing by Borealis AI provides, can figure out which of the SQL queries comes closest to the search intent.
Here’s a sample use case: Let's say a nontechnical user is in the business of delivering supplies to gas stations. The user wants to query available databases and find out which stations to contact next, in order to grow the business. How would the user get these business insights, without relying on SQL to do the search across available databases? With Turing by Borealis AI, users can start the search by picking the ‘gas station domain’ and ask: "What are the locations with gas stations owned by companies making over 100 billion in sales?" Under the hood, there is a deep learning model that treats the texttoSQL problem as graphtotree mapping and produces a SQL query, executing it against the database to return the results.
Turing by Borealis AI generates SQL and uses a synchronous contextfree grammar system to provide a highprecision explanation, so that users can make sure the results are trustworthy and match the intent.
Learn more about crossdatabase texttoSQL in this blog, with further details on Turing by Borealis AI in this paper and here.
The team is presenting Turing by Borealis AI and related works: two main conference papers, one demo paper and one workshop paper at the joint conference of the Association for Computational Linguistics and the International Joint Conference on Natural Language Processing (ACLIJCNLP 2021) on August 16, 2021. The team is also aiming to release the core of its semantic parsing at that time.
]]>The current dominant paradigm in natural language processing is to build enormous language models based on the transformer architecture. Models such as GPT3 contain billions of parameters, which collectively describe joint statistics of spans of text and have been extremely successful over a wide range of tasks.
However, these models do not explicitly take advantage of the structure of language; native speakers understand that a sentence is syntactically valid, even if it is meaningless. Consider how Colorless green ideas sleep furiously feels like valid English, whereas Furiously sleep ideas green colorless does not ^{1}. This structure is formally described by a grammar, which is a set of rules that can generate an infinite number of sentences, all of which sound right, even if they mean nothing.
In this blog, we review earlier work that models grammatical structure. We introduce the CYK algorithm which finds the underlying syntactic structure of sentences and forms the basis of many algorithms for linguistic analysis. The algorithms are elegant and interesting for their own sake. However, we also believe that this topic remains important in the age of large transformers. We hypothesize that the future of NLP will consist of merging flexible transformers with linguistically informed algorithms to achieve systematic and compositional generalization in language processing.
Our discussion will focus on contextfree grammars or CFGs. These provide a mathematically precise framework in which sentences are constructed by recursively combining smaller phrases usually referred to as constituents.^{2} Sentences under a CFG are analyzed through a treestructured derivation in which the sentence is recursively generated phrase by phrase (figure 1).
The problem of recovering the underlying structure of a sentence is known as parsing. Unfortunately, natural language is ambiguous and so there may not be a single possible meaning; consider the sentence I saw him with the binoculars. Here, it is unclear whether the subject or the object of the sentence holds the binoculars (figure 2). To cope with this ambiguity, we will need weighted and probabilistic extensions to the context free grammar (referred to as WCFGs and PCFGs respectively). These allow us to compute a number that indicates how "good" each possible interpretation of a sentence is.
In Part I of this series of two blogs, we introduce the notion of a contextfree grammar and consider how to parse sentences using this grammar. We then describe the CYK recognition algorithm which identifies whether the sentence can be parsed under a given grammar. In Part II, we introduce the aforementioned weighted contextfree grammars and show how the CYK algorithm can be adapted to compute different quanties including the most likely sentence structure. In Part III we instroduce probabilistic contextfree grammars, and we present the insideoutside algorithm which efficiently computes the expected counts of the rules in the grammar for all possible analyses of a sentence. These expected counts are used in the EStep of an expectationmaximization procedure for learning the rule weights.
Before tackling these problems, we'll first discuss the properties of a parse tree (figure 3). The root of the tree is labelled as "sentence" or "start". The leaves or terminals of the tree contain the words of the sentence. The parents of these leaves are called preterminals and contain the partofspeech (POS) categories of the words (e.g., verb, noun, adjective, preposition). Words are considered to be from the same category if a sentence is still syntactically valid when they are substituted. For example: The {sad, happy, excited, bored} person in the coffee shop. This is known as the substitution test. Above the preterminals, the word categories are collected together into phrases.
There are three more important things to notice. First, the verb phrase highlighted in magenta has three children. However, there is no theoretical limit to this number. We could easily add the prepositional phrases in the garden and under a tree and so on. The complexity of the sentence is limited in practice by human memory and not by the grammar itself.
Second, the grammatical structure allows for recursion. In this example, a verb phrase is embedded within a second verb phrase, which itself is embedded in a third verb phrase. Finally, we note that the parse tree disambiguates the meaning of the sentence. From a grammatical point of view, it could be that it was the bone that was enjoying every moment. However, it is clear that this is not the case, since the verb phrase corresponding to enjoying is attached to the verb phrase corresponding to eating and not the bone (see also figure 2).
In this section, we present a more formal treatment of contextfree grammars. In the following section, we'll elucidate the main ideas with an example.
A language is a set of strings. Each string is a sequence of terminal symbols. In figure 3 these correspond to individual words, but more generally they may be abstract tokens. The set of terminals $\Sigma=\{\mbox{a,b,c},\ldots\}$ is called an alphabet or lexicon. There is also a set $\mathcal{V}=\{\mbox{A,B,C}\ldots...\}$ of nonterminals, one of which is the special start symbol $S$.
Finally, there are a set $\mathcal{R}$ of production or rewrite rules. These relate the nonterminal symbols to each other and to the terminals. Formally, these grammar rules are a subset of the finite relation $\mathcal{R}\in \mathcal{V} \times (\Sigma \cup \mathcal{V})^*$ where $*$ denotes the Kleene star. Informally, this means that each grammar rule is an ordered pair where the first element is a nonterminal from $\mathcal{V}$ and the second is any possible string containing terminals from $\Sigma$ and nonterminal from $\mathcal{V}$. For example, B$\rightarrow$ab, C$\rightarrow$Baa and A$\rightarrow$AbCa are all production rules.
A context free grammar is the tuple $G=\{\mathcal{V}, \Sigma, \mathcal{R}, S\}$ consisting of the nonterminals $\mathcal{V}$, terminals $\Sigma$, production rules $\mathcal{R}$, and start symbol $S$. The associated contextfree language consists of all possible strings of terminals that are derivable from the grammar.
Informally, the term contextfree means that each production rule starts with a single nonterminal symbol. Contextfree grammars are part of the Chomsky hierarchy of languages which contains (in order of increasing expressiveness) regular, contextfree, contextsensitive, and recursively enumerable grammars. Each differs in the family of production rules that are permitted and the complexity of the associated parsing algorithms (table 1). As we shall see, contextfree languages can be parsed in $O(n^{3})$ time where $n$ is the number of observed terminals. Parsing more expressive grammars in the Chomsky hierarchy has exponential complexity. In fact, contextfree grammars are not considered to be expressive enough to model real languages. Many other types of grammar have been invented that are both more expressive and parseable in polynomial time, but these are beyond the scope of this post.
Language  Recognizer  Parsing Complexity 
Recursively enumerable Contextsensitive Contextfree Regular 
Turing machine Linearbounded automata Pushdown automata Finitestate automata 
decideable PSPACE $O(n^3)$ $O(n)$ 
Table 1. The Chomsky hierarchy of languages. As the grammartype becomes simpler, the required computation model (recognizer) becomes less general and the parsing complexity decreases.
Consider the context free grammar that generated the example in figure 4. Here, the set of nonterminals $\mathcal{V}=\{\mbox{VP, PP, NP, DT, NN, VBZ, IN,}\ldots\}$ contains the start symbol, phrases, and preterminals. The set of terminals $\Sigma=\{$The, dog, is, in, the, garden, $\ldots \}$ contains the words. The production rules in the grammar associated with this example include:
Of course, a full model of English grammar contains many more nonterminals, terminals, and rules than we observed in this single example. The main point is that the tree structure in figure 4 can be created by the repeated application of a finite set of rules.
Later on, we will describe the CYK recognition algorithm. This takes a sentence and a contextfree grammar and determines whether there is a valid parse tree that can explain the sentence in terms of the production rules of the CFG. However, the CYK algorithm assumes that the context free grammar is in Chomsky Normal Form (CNF). A grammar is in CNF if it only contains the following types of rules:
\begin{align}
\tag{binary nonterminal}
\text{A} &\rightarrow \text{B} \; \text{C} \\
\tag{unary terminal}
\text{A} &\rightarrow \text{a} \\
\tag{delete sentence}
\text{S} &\rightarrow \epsilon
\end{align}
where A,B, and C are nonterminals, a is a token, S is the start symbol and $\epsilon$ represents the empty string.
The binary nonterminal rule means that a nonterminal can create exactly two other nonterminals. An example is the rule $S \rightarrow \text{NP} \; \text{VP}$ in figure 4. The unary terminal rule means that a nonterminal can create a single terminal. The rule $\text{NN} \rightarrow$ $\text{dog}$ in figure 4 is an example. The delete sentence rule allows the grammar to create empty strings, but in practice we avoid $\epsilon$productions.
Notice that the parse tree in figure 3 is not in Chomsky Normal Form because it contains the rule $\text{VP} \rightarrow \text{VBG} \; \text{NP} \; \text{VP}$. For the case of natural language processing, there are two main tasks to convert a grammar to CNF:
Both of these operations introduce new nonterminals into the grammar. Indeed, in the former case, we may introduce different numbers of new nonterminals depending on which children we choose combine. It can be shown that in the worstcase scenario, converting CFGs into an equivalent grammar in Chomsky Normal Form results in a quadratic increase in the number of rules. Note also that although the CNF transformation is the most popular, it is not the only, or even the most efficient option.
Given a grammar in Chomsky Normal Form, we can turn our attention to parsing a sentence. The parsing algorithm will return a valid parse tree like the one in figure 6 if the sentence has a valid analysis, or indicate that there is no such valid parse tree.
It follows that one way to characterize a parsing algorithm is that it searches over the set of all possible parse trees. A naive approach might be to exhaustively search through these trees until we find one that obeys all of the rules in the grammar and yields the sentence. In the next section, we'll consider the size of this search space, find that it is very large, and draw the conclusion that this bruteforce approach is intractable.
The parse tree of a sentence of length $n$ consists of a binary tree with $n1$ internal nodes, plus another $n$ nodes connecting the preterminals to the terminals. The number of binary trees with $n$ internal nodes can be calculated via the recursion:
\begin{equation}
C_{n} = \sum_{i=0}^{n1}C_{ni}C_{i}. \tag{1}
\end{equation}
The intuition for this recursion is illustrated in figure 7. This series of intergers are known as the Catalan number and can be written out explicitly as:
\begin{equation}
C_n = \frac{(2n)!}{(n+1)!n!}. \tag{2}
\end{equation}
Needless to say the series grows extremely fast:
\begin{equation}
1, 1, 2, 5, 14, 42, 132, 429, 1430, 4862, 16796, 58786, \ldots \tag{3}
\end{equation}
Consider the example sentence I saw him with the binoculars. Here there are only C_5=42 possible trees, but these must be combined with the nonterminals in the grammar (figure 8). In this example, for each of the 42 trees, each of the six leaves must contain one of four possible parts of speech (DT, NN, P, VBD) and each of the five nonleaves must contain one of four possible clause types (S, NP, VP, PP) and so there are 42 * 4^6 * 4^5 = 176160768 possible parse trees.
Even this minimal example had a very large number of possible explanations. Now consider that (i) the average sentence length written by Charles Dickens was 20 words, with an associated $C_{20}=6,564,120,420$ possible binary trees and (ii) that there are many more parts of speech and clause types in a realistic model of the English language. It's clear that there are an enormous number of possible parses and it is not practical to employ exhaustive search to find the valid ones.
The CYK algorithm (named after inventors John Cocke, Daniel Younger, and Tadao Kasami) was the first polynomial time parsing algorithm that could be applied to ambiguous CFGs (i.e., CFGs that allow multiple derivations for the same string). In its simplest form, the CYK algorithm solves the recognition problem; it determines whether a string $\mathbf{w}$ can be derived from a grammar $G$. In other words, the algorithm takes a sentence and a contextfree grammar and returns TRUE if there is a valid parse tree or FALSE otherwise.
This algorithm sidesteps the need to try every possible tree by exploiting the fact that a complete sentence is made by combining subclauses, or equivalently, a parse tree is made by combining subtrees. A tree is only valid if its subtrees are also valid. The algorithm works from the bottom of the tree upwards, storing possible valid subtrees as it goes and building larger subtrees from these components without the need to recalculate them. As such, CYK is a dynamic programming algorithm.
The CYK algorithm is just a few lines of pseudocode:
0 # Initialize data structure
1 chart[1...n, 1...n, 1...V] := FALSE
2
3 # Use unary rules to find possible parts of speech at preterminals
4 for p := 1 to n # start position
5 for each unary rule A > w_p
6 chart[1, p, A] := TRUE
7
8 # Main parsing loop
9 for l := 2 to n # substring length
10 for p := 1 to nl+1 #start position
11 for s := 1 to l1 # split width
12 for each binary rule A > B C
13 chart[l, p, A] = chart[l, p, A] OR
(chart[s, p, B] AND chart[ls,p+s C])
14
15 return chart[n, 1, S]
The algorithm is simple, but is hard to understand from the code alone. In the next section, we will present a worked example which makes this much easier to comprehend. Before we do that though, let's make some high level observations. The algorithm consists of four sections:
The complexity of the algorithm is easy to discern. Lines 913 contain three for loops depending on the sentence length $n$ (lines 911) and one more depending on the number of grammar rules $R$ (line 12). This gives us a complexity of $\mathcal{O}(n^3 \cdot R)$.
To make the CYK algorithm easier to understand, we'll use the worked example of parsing the sentence I saw him with the binoculars. We already saw in figure 2 that this sentence has two possible meanings. We'll assume the minimal grammar from figure 8 that is sufficient to parse the sentence. In the next four subsections we'll consider the four parts of the algorithm in turn.
Figure 9 shows the chart for our example sentence, which is itself shown in an extra row under the chart. Each element in the chart corresponds to a substring of the sentence. The first index of the chart $l$ represents the length of that substring and the second index $p$ is the starting position. So, the element of the chart at position (4,2) represents the substring that is length four and starts at word two which is saw him with the. We do not use the upper triangular portion of the chart.
The CYK algorithm runs through each of the elements of the chart, starting with strings of length 1 and working through each position and then moving to strings of length 2, and so on, until we finally consider the whole sentence. This explains the loops in lines 9 and 10. The third loop considers possible binary splits of the strings and is indexed by $s$. For position (4,2), the string can be split into saw $$ him with the ($s=1$, blue boxes), saw him $$ with the ($s=2$, green boxes), or saw him with $$ the ($s=3$, red boxes).
Now that we understand the meaning of the chart and how it is indexed, let's run through the algorithm step by step. First we deal with strings of length $l=1$ (i.e., the individual words). We run through each unary rule $A \rightarrow w_p$ in the grammar and set these elements to TRUE in the chart (figure 10). There is only one ambiguity here, which is the word saw which could be a past tense verb or a noun. This process corresponds to lines 56 of the algorithm.
In the main loop, we consider substrings of increasing length starting with pairs of words and working up to the full length of the sentence. For each substring, we determine if there is a rule of the form $\text{A}\rightarrow \text{B}\;\text{C}$ that can derive it.
We start with strings of length $l=2$. These can obviously only be split in one way. For each position, we note in the chart all the nonterminals A that can be expanded to generate the parts of speech B and C in the boxes corresponding to the individual words (figure 11).
In the next outer loop, we consider substrings of length $l=3$ (figure 12). For each position, we search for a rule that can derive the three words. However, now we must also consider two possible ways to split the length 3 substring. For example, for position $(3,2)$ we attempt to derive the substring saw him with. This can be split as saw him $$ with corresponding to positions (2,2)$$(1,4) which contain VP and P respectively. However, there is no rule of the form $\text{A}\rightarrow\text{VP}\;\text{P}$. Likewise, there is no rule that can derive the split saw $$ him with since there was no rule that could derive him with. Consequently, we leave position $(3,2)$ empty. However, at position $(3,4)$, the rule $\text{PP}\rightarrow \text{P}\;\text{NP}$ can be applied as discussed in the legend of figure 12.
We continue this process, working upwards through the chart for longer and longer substrings (figure 13). For each substring length, we consider each position and each possible split and add nonterminals to the chart where we find an applicable rule. We note that position $(5,2)$ in figure 13b corresponding to the substring saw him with the binoculars is particularly interesting. Here there are two possible rules $\text{VP}\rightarrow\text{VP}\;\text{PP}$ and $\text{VP}\rightarrow\text{VBD}\;\text{NP}$ that both come to the conclusion that the substring can be derived by the nonterminal VP. This corresponds to the original ambiguity in the sentence.
When we reach the topmost row of the chart ($l=6$), we are considering the whole sentence. At this point, we discover if the start symbol $S$ can be used to derive the entire string. If there is such a rule, the sentence is valid under the grammar and if there isn't then it is not. This corresponds to the final line of the CYK algorithm pseudocode. For this example, we use the rule $S\rightarrow \text{NP}\;\text{VP}$ explain the entire sting with the noun phrase I and the verb phrase saw him with the binoculars and conclude that the sentence is valid under this context free grammar.
The basic CYK algorithm just returns a binary variable indicating whether the sentence can be parsed or not under a grammar $G$. Often we are interested in retrieving the parse tree(s). Figure 14 superimposes the paths that led to the start symbol in the top left from figures 1113. These paths form a shared parse forest; two trees share the black paths, but the red paths are only in the first tree and the blue paths are only in the second tree. These two trees correspond to the two possible meanings of the sentence (figure 15).
These two figures show that it is trivial to reconstruct the parse tree once we have run the CYK algorithm as long as we cache the inputs to each position in the chart. We simply start from the start symbol at position (6,1) and work back down through the tree. At any point where there are two inputs into a cell, there is an ambiguity and we must enumerate all combinations of these ambiguities to find all the valid parses. This technique is similar to other dynamic programming problems (e.g.: the canonical implementation of the longest common subsequence algorithm computes only the size of the subsequence, but backpointers allow for retrieving the subsequence itself).
The previous example was relatively unambiguous. For a bit of fun, we'll also show the results on the famously difficulttounderstand sentence Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo. Surprisingly, this is a valid English sentence. To comprehend it, you need to know that (i) buffalo is a plural noun describing animals that are also known as bison, (ii) Buffalo is a city, and (iii) buffalo is a verb that means "to indimidate". The meaning of the sentence is thus:
Bison from the city Buffalo that are intimidated by other bison from the city Buffalo, themselves intimidate yet other bison from the city Buffalo.
To make things even harder, we'll assume that the text is written in all lower case, and so each instance of buffalo could correspond to any of the three meanings. Could you come up with a grammar that assigns an intuitive analysis to this sentence? In Figure 16 we provide a minimal, but sufficient grammar that allows the CYK algorithm to find a single and reasonable parse tree for this strange sentence.
In this part of the blog, we have described the CYK algorithm for the recognition problem; the algorithm determines whether a string can be generated by a given grammar. It is a classic example of a dynamic programming algorithm that explores an exponential search space in polynomial time by storing intermediate results. Another way of thinking about the CYK algorithm from a less procedural and more declarative perspective is that it is performing logical deduction. The axioms are the grammar rules and we are presented with facts which are the words. For a given substring length, we deduce new facts applying the rules of the grammar $G$ and facts (or axioms) that we had previous deduced about shorter substrings. We keep applying the rules to reach new facts about which substring is derivable by $G$ with the goal of proving that $S$ derives the sentence.
Note that we have used an unconventional indexing for the chart in our description. For a more typical presentation, consult these slides.
In part II, we will consider assigning probabilities to the production rules, so when the parse is ambiguous, we can assign probabilities to the different meanings. We will also consider the insideoutside algorithm which helps learn these probabilities.
^{1} This famous example was used in Syntactic Structures by Noam Chomsky in 1957 to motivate the independence of syntax and semantics.
^{2} The idea that sentences are recursively built up from smaller coherent parts dates back at least to a Sanskrit sutra of around 4000 verses known as Aṣṭādhyāyī written by Pāṇini probably around the 6th4th century BC.
]]>People communicate in natural language, which is flexible but often vague, whereas computer languages have no room for ambiguity. For a computer to respond to users' questions or commands in natural language, it needs to extract meaning, resolve ambiguity, and translate to executable programs. This is the task of semantic parsing (SP), whose applications include voice assistants, code generation, natural language interfaces to databases (NLDB), and many more. Our Turing by Borealis AI system is an NLDB, a software system enabling users to interact with databases in natural language, as illustrated in Figure 1.
The semantic parsing model powering an NLDB needs to be trained with questions and their corresponding SQL queries. If the model only generalizes to new questions on the training domain, the NLDB cannot be quickly adapted to new databases, so it would not be very useful in practice. Hence, the model somehow needs to generalize to new databases with unseen schema and unseen questions. This is crossdomain or crossdatabase texttoSQL semantic parsing.
The goal of this blog post is to glimpse into how models (like our Turing by Borealis AI system) for this task work without popping the hood. It is suitable for any reader with basic knowledge of machine learning and natural language processing.
We will first give a brief review of SQL that readers can skip if already familiar, then introduce two running examples of texttoSQL prediction. The examples will illustrate some challenges involved in crossdomain semantic parsing and illustrate why simple methods would not succeed. Afterwards, we will describe a highlevel framework that treats crossdatabase texttoSQL as a graphtotree mapping. We will use the two running examples, to show how the framework tackles the challenges that we identified. Finally, we will provide some pointers for interested readers to learn more, including our recent ACL papers (Xu et al., 2021a,b; Norouzi et al., 2021) that respectively set the new stateoftheart accuracy on the Spider texttoSQL benchmark and some code generation problems.
Before showing the examples, let us review some SQL basics. SQL stands for Structured Query Language and is used for storing, manipulating and retrieving data in relational databases. We will just focus on the retrieval here.
Relational databases store information records in tables. The schema of a database describes the structure of the domain: what are the tables, what columns does each table contain, the data type of each column, as well as special roles that some columns play. The first type of special role is a primary key. This is a column or a combination of columns that has to be unique for each data record. The second type of special role is a foreign key, which is a column or combination of columns whose values match the primary key records of another table. Foreign key relations link tables together.
SELECT
QueryA basic SQL query looks like the following SELECT * FROM my_table
, where *
is a reserved token meaning "all columns". This query will return all rows of the table my_table
. The star can be replaced by one or more column names, in which case, the query would only return the mentioned attributes in each row. Slightly more advanced queries will involve filtering condition, expressed using a WHERE
clause: SELECT * FROM my_table WHERE condition
. This query will only return records for which the condition
holds true. The SQL syntax for the actual condition is generally selfexplanatory.
GROUP BY
and HAVING
Sometimes columns could correspond to categorical attributes like "sector". Here, an interesting class of questions involves aggregating some properties associated with each categorical value of the column. For this purpose, we would need the GROUP BY
clause: SELECT MAX(salary), sector FROM my_table GROUP BY sector
, which would find the highest salary per each sector. If we want to filter the categories, we can use the HAVING
clause. For example, we might want to filter out sectors based on their associated statistics, HAVING
is similar to WHERE
but operates on grouped categories instead. For example, SELECT MAX(income), sector FROM my_table GROUP BY sector HAVING AVG(salary) < 50000.
JOIN
Last but not least, the concept of JOIN
needs some explanation. As SQL databases store records in tables, sometimes we need to "merge" corresponding rows of two or more tables. We might need the merged records as the final result or as an intermediate step to compute something else. This requires joining one or more tables with the syntax: SELECT * FROM table1 JOIN table2 ON table1.col_fkey = table2.col_pkey
. The ON
part introduces a condition that is usually an equality relation between the foreign key and primary key columns like in this example but can also be on other columns. This query returns the combination of rows in table1
and rows in table2
whose value in the column col_fkey
of table1
equals to the value of col_pkey
of table2
.
To predict the correct SQL from a natural language question, the model needs to correctly interpret each input word in the context of both the sentence and the schema. Furthermore, it needs to generate a syntactically correct SQL query as the output otherwise the database cannot execute it. To illustrate the challenges more concretely, let's consider two examples for the "Employee_hire_evaluation" database of the Spider benchmark. This database is a development set domain that models would not have seen during training.
The database has the following tables: employee
, shop
, hiring
, evaluation
. Each table has a number of columns:
employee
: employee_id
, name
, age
, city
, with employee_id
being the primary key.shop: shop_id, name, location, district, number_products, manager_name
, with shop_ID
being the primary key.hiring: shop_id, employee_ID, start_from, is_full_time
, with employee_id
being the primary key, and also a foreign key to Employee table's employee_id
, and shop_id
being a foreign key to shop
table's shop_id
.evaluation: employee_id, year_awarded, bonus
, with employee_id
and year_awarded
together as the primary key, and employee_id
as a foreign key referencing Employee table's employee_id
.Question: Which cities have more than one employee under 30?
Correct SQL:
SELECT employee.city
FROM employee
WHERE employee.age < 30
GROUP BY employee.city
HAVING COUNT (*) > 1
Analysis: Besides the general logic of the SQL query, a model needs to infer two conditions from the question, employee.age < 30
and COUNT (*) > 1
. The entities involved in the conditions (tables, columns or the star) are not explicitly mentioned in the text and have to be inferred. The model needs to deduce that "employee under 30" requires column age
, by leveraging two pieces of information. First, it can have some prior common sense knowledge that the expression "employee under [NUMBER]" refers to employee age rather than some other attribute. Second, it could exclude other columns because the value "$30$" is too different from other columns' values based on type or range. For the second condition, the model needs to infer from the entire phrase "Which cities have more than one employee ..." that the condition is on the number of employees in each city, hence requiring GROUP BY
[$\ldots$] HAVING
[$\ldots$]. Finally, it needs to piece the two conditions together as well as the rest of the query using the correct syntax.
Question: What's the average age in each shop?
Correct SQL:
SELECT AVG (employee.age) , shop.shop_id
FROM employee
JOIN hiring
JOIN shop ON employee.employee_id = hiring.employee_id
AND hiring.shop_id = shop.shop_id
GROUP BY shop.shop_id
Analysis: To correctly predict this SQL, not only does the SP model needs to infer correctly from "in each shop" that the output contains GROUP BY shop.shop_id
, it also needs to infer the involvement of tables employee
, hiring
which are not explicitly mentioned like shop
. The table employee
can be inferred based on the need for its age
column. On the other hand, the hiring
table can only be inferred from the need to link between employee.age
and shop.shop_id
.
You might wonder whether some generic or simple approach can already solve this crossdatabase texttoSQL problem. For example, let's consider the sequencetosequence model often used in machine translation. TexttoSQL semantic parsing bears some similarity to machine translation if we view SQL as a foreign language to translate into. However, some crucial differences exist. First, typical training datasets for machine translation larger than those for SQL semantic parsing by two orders of magnitude or even more. Second, in machine translation, partially correct results can still provide partial utility, but for an NLDB, any small mistake in the predicted SQL query could invalidate the result. Third, as we have seen from the examples, the database schema is crucial for correct translation to SQL, which sequencetosequence machine translation models do not consider. For these reasons, typical neural sequencetosequence models do not work well.
Another baseline is shallow semantic parsing in which we simplify the problem and assume that there are a fixed number of user intentions. An intent classifier could then select the template that best corresponds to the user question from a predefined list. Then a model extracts the relevant information from the user question to fill in the template slots. For instance, we can turn the first example into a template whose SQL would have some slots to be filled:
SELECT employee.city
FROM employee
WHERE employee.age [COMP_A] [A]
GROUP BY employee.city
HAVING COUNT (*) [COMP_C] [C]
Given enough training examples of question tagged with its corresponding template ID and slot values, then a model could potentially answer questions like "show me the cities with less than 5 employees over twenty five.", by identifying this template out of many, then predicting that COMP_A
$:=$ <
, A
$:=$ 5
, COMP_C
$:=$ >
, C
$:=$ 25
. This approach is commonly used in voiceassistant and taskoriented dialogue systems. The main drawback is that the templates need to be predefined, so the system cannot generalize to new queries on the fly. Hence this approach is also unsuitable for crossdatabase NLDB in general.
As shown by the two running examples, successful crossdatabase SQL semantic parsing really requires the model to reason using at least three sets of knowledge:
We now describe a general framework for crossdatabase texttoSQL that leverages all of this knowledge. The backbone of the overall system is a neural network with encoderdecoder architecture, which is adapted in various ways to leverage explicit symbolic knowledge.
Motivated by the examples, we see that the model needs to jointly encode the question and schema, considering how words relate to each other within and across the question and the schema. So the input for crossdatabase semantic parsing has an inherent graph structure; the nodes are the tokens in the questions and schema and are linked by different edges. On the output side, to produce grammatically correct SQLs and leverage programminglanguagespecific inductive prior, we treat the prediction problem as generation of the abstract syntax tree (AST) of the program. Hence, we can characterize this task a graphtotree mapping.
Figure 2 illustrates the overall framework for Example One: an encoder consumes the input graph, and a decoder produces the output AST. Joint modelling of question and schema as a graph was popularized by the relationaware transformer (RAT) work (Wang et al., 2019) while TranX (Yin and Neubig, 2018) provides a unified framework for modelling output programs as ASTs. Our Turing by Borealis AI system also follows this overall approach, with many additional innovations that we will not cover here.
As mentioned above, we view each token in the question and schema as a node in a graph. The most basic edge type among the nodes is a generic link between any pair of tokens, reflecting the assumption that apriori any token could provide relevant context to any other token, so a link cannot be ruled out. This essentially yields a fully connected graph. For visual simplicity, we omit these edges from Figure 2.
However, other types of relations carry special meanings and are sparse. These include (i) foreign key relations that link a column in one table to the primary key of another table, (ii) exact string match and partial string match between words in the questions and words in column or table names and (iii) implicit links between a table and its columns. Some of these edges are illustrated in different colours on the input side in Figure 2. Because there can be more than one type of edge between two tokens to be modelled, this input is technically a multigraph.
How do these edges help predict the correct SQL? Let's return to the examples.
In Example One (Figure 2), the word "employee" in the question exactly matches the table name employee
, so a special edge for an exact match is created in this input graph during preprocessing. For a graph neural network or relationaware transformer that can encode a graph by propagating information along edges, this link creates a potential pathway for information from the representation of the columns (employee_ID, name, age, city
) of table employee
to contextualize the representation of the question token "employee", and vice versa. This makes it more likely for employee_ID, name, age, city
to be selected compared to columns in the other tables when predicting a column corresponding to the condition "employee under 30".
The second example is more interesting. The question mentions the table name shop
explicitly, while the table employee
can be easily inferred based on the column mention age
. However, for hiring
there is no textual evidence from the question, direct or indirect, that the SQL query should involve hiring
. The only way to infer is through the foreign key links and the fact that otherwise shop
and employee
are disconnected and cannot be joined. This potential reasoning process is illustrated in Figure 3.
Now that we understand how the (multi)graph structure would help the semantic parsing model, let's formalize what the encoder does at a high level. Let $\mathcal{S}=\{s_1,\dots, s_{\lvert \mathcal{S} \rvert}\}$ denote the schema elements, consisting of tables and their columns, and use $Q=q_1\dots q_{\lvert Q \rvert}$ to denote the sequence of words in the question. Let $\mathcal{G}=\langle\mathcal{V}, \mathcal{E}\rangle$ denote the multigraph with edge sets $\mathcal{E}$. The encoder, $f_{\text{enc}}$, maps $\mathcal{G}$ to a joint representation $ \mathcal{H} =$ $\{\phi^q_1, \ldots,\phi^q_{\lvert Q \rvert} \} \cup \{\phi^s_1, \ldots,\phi^s_{\lvert \mathcal{S} \rvert} \}$. The fully connected portion of the multigraph can be well modelled by a transformer (see [link] for our blog series on Transformers). Indeed, one can flatten the schema into a linear string, with the tokens belonging to different column or table names separated by a special token like "[SEP]" and concatenate this string with the question string before feeding into a pretrained model such as BERT. The use of pretrained BERT (or other variants) here is how implicit common sense knowledge is embodied in the semantic parser. To model information propagation along the special sparse edges of the multigraph, we can then feed the BERT output embeddings into a relationaware transformer (Wang et al., 2019). There are a few subtle details omitted here, which we will give some pointers for at the end of this article.
If we model SQL queries as linear sequences of text tokens on the output side, it is not easy to leverage the SQL grammar knowledge. During inference, one could use a grammar validator program to check if a generated sequence is legal; however, the neural network is still not using this information during training for better generalization. Furthermore, the grammar not only captures what is illegal but also how SQL expressions can be composed. Leveraging this prior knowledge will significantly improve the learning efficiency from a small number of examples. Therefore, we want to cast the problem as generating the abstract syntax tree of SQL queries.
A common approach to predict an abstract syntax tree (AST) is to use a grammarbased transition system like TranX (Yin and Neubig, 2018), which decomposes the generation process of an abstract syntax tree (AST) into a sequence of actions. The neural model learns to predict the action sequence, and the transition system then constructs the AST using the predicted action sequence. Finally, another deterministic routine maps the AST into a linear string format of SQL, aka the surface code (Figure 4).
Figure 5 shows a snippet of the SQL grammar for TranX used by our Turing by Borealis AI system. It is specified in an abstract syntax description language (ASDL). It is similar to a contextfree grammar, but more powerful, with each production rule's righthand side being a function call signature with stronglytyped arguments. The type names are nonterminal symbols in the grammar, for which there are further production rules. This grammar is specific to the programming language of interest, or a subset of features in a programming language, and needs to be developed by a human expert.
stmt = Intersect(query_expr lbody, query_expr rbody) 
The transition system converts between an AST and its ASTconstructing action sequence, leveraging a grammar like the one in Figure 5. The transition system starts at the root of the AST and derives the action sequence by a topdown, lefttoright depthfirst traversal of the tree. At each step, it generates one of the possible parametrized action types.
For crossdomain texttoSQL parsing, the action types can include: (1) ApplyRule[$r$] which applies a production rule $r$ of the grammar to the latest generated node in the AST; (2) Reduce which marks the complete generation of a subtree corresponding to a function call (in the ASDL grammar); (34) SelectTable[$t$] and SelectColumn[$c$] which, respectively, choose a table $t$ and a column $c$ from the database schema $\mathcal{S}$; (5) CopyToken[$k$] which copies a token $q_k$ from the user question $Q$; (6) GenToken[$l$] which generates a token $w_l$ from a vocabulary. In practice, with careful design, it is possible to simplify and avoid SelectTable and GenToken, which is part of the technical novelties in our Turing by Borealis AI system.
Before training, the TranX system first converts the surface SQL code to the AST representation using a deterministic domainspecific routine. Then, leveraging the grammar, it converts the AST into the action sequence (Figure 4). The actual training is then standard maximum likelihood with teacherforcing, which you can read about in this tutorial. At each step, the model predicts the correct action conditioned on the groundtruth partial action sequence up to that point, as well as the encoder representation $\mathcal{H}$ of the question and schema. Most of the action types are parameterized by some argument, for example, production rule $r$ for ApplyRule, column $c$ for SelectColumn. The model first predicts the action type, then conditioned on the groundtruth action type (regardless of the predicted one), predicts the argument.
The inference process builds upon beamsearch, which you can learn more about in this tutorial. The difference here is that the beamsearch is guided by the grammar and the transition system. This grammarguided beamsearch decoding sounds complex and indeed has many tedious implementation details, but it is conceptually simple: at each step of decoding, for each partial sequence in the beam, the transition system tracks all action types and arguments that are legal according to the grammar; the neural net can only select from those options. Once beamsearch produces multiple action sequences, the transition system converts them to ASTs, then converts them to surface SQL code strings using the domainspecific postprocessing routine as illustrated in Figure 5.
Besides the neural attention over the encoder representation, some other weak reasoning using the grammar happens here during beamsearch inference. By tracking multiple partial trees (implicitly, via partial action sequences), a hypothesis scored high at the beginning could drop sharply because its highprobability continuation could violate the grammar. As a result, another partial tree that is less likely at first, could become more plausible and eventually be the top prediction.
Inferring and encoding special edges in the multigraph: we saw some examples of special edges between a question token and schema word, but there could be other types of links. For example, suppose a question word happens to match a database value in some column. In that case, this is evidence that this question word has an implicit relationship to the corresponding column. More generally, these edges are inferred using heuristic preprocessing rules, in a process known as schema linking. The relationaware transformer layers can learn to deal with some degree of noise in the links. For more details, please see the original RAT paper (Wang et al,. 2019).
We also discussed using a pretrained transformer to encode the implicit fullyconnected part of the multigraph, in conjunction with RATbased modelling of the sparse special edges. But the pretrained transformer builds contextualized representation for subword tokens, whereas table and column names are usually phrases. The Appendix section of Xu et al. (2021a) contains more information about how these models can be pieced together.
Modelling tables implicitly through columns: as mentioned previously, it is possible to drop the SelectTable action altogether. The idea is to globally uniquely identify the columns rather than using the column names only. We can add the table representation to all of its column representations on the input encoding side before feeding into the RAT layers. On the output side, we can give each column a globally unique ID for SelectColumn. Then the table can be inferred deterministically from the predicted columns during postprocessing. This design choice simplifies the relation learning for encoding and makes the output action sequences shorter. On some rare occasions, this becomes an oversimplification causing failures for some complex queries, for instance, when there are multiple selfjoins. Please see XU et al., (2021b) for more details.
TranX transition system and leveraging tree structures in the neural decoder: so far, we only showed how TranX works on the high level, but readers interested in using the framework for semantic parsing should consult (Yin and Neubig, 2018) for more details. In particular, the TranX transition system exposes the topology of the AST to the linear action sequence decoding process via something called parent frontier field. The parent does not always correspond to the immediate preceding step in the action sequence. Yet, it is important to directly condition on its representation during decoding, which is known as parent feeding.
Handling values in the question: in Example One, the value $30$ from the question is exactly the token needed in the condition part of the SQL statement, so it can be just copied over. However, in general, this might not always be the case. Most models use a combination of generation and copy attention. But as mentioned earlier, Turing (Xu et al., 2021b) simplifies away the generation and only performs the copy action. The idea is that during training, the model learns to identify the question text span providing evidence for the value, which significantly simplifies the learning problem and reduces overfitting. A heuristic searchbased postprocessor is responsible for producing the actual value to be used in the SQL at inference time.
Training and generalization when the model is deep and the dataset small: using relationaware transformer layers on top of pretrained transformers like BERT or RoBERTa can quickly make the overall model very deep and hard to train. The usual rulesofthumb for optimizing transformers are to use a large batch size, make the model shallower, or both. However, our recent work finds a way to train ultradeep transformers ($48$ layers) using a small batch size and this improves the model generalization, especially for hard cases. This technique allowed us to place No. $1$ on the Spider Leaderboard (Exact Set Match without Values) ^{1}.
Beyond teacherforcing maximum likelihood: other sequence learning methods could also be used in theory, such as scheduled sampling or beamsearch optimization (BSO). See our work on training a globally normalized semantic parsing model using a method similar to BSO (Huang et al., 2021), which works on some simple dataset, but not yet on complex ones like Spider.
Other approaches for semantic parsing: there are other promising approaches that do not follow the framework presented in this blog. For crossdatabase semantic parsing, Rubin and Berant (2021) abandons autoregressive decoding, but instead performs semiautoregressive bottomup semantic parsing. The advantage is that at each step of decoding, the model both conditions on and predicts semantically meaningful subprograms, instead of semanticallyvacuous partial trees. The method performs competitively on Spider, which is impressive; moreover, it potentially has better compositional or outofdistribution generalization. On the other end of the spectrum, if our goal is not crossdomain texttoSQL, but generic code generation, then our recent ACL work (Norouzi et al., 2021) shows that leveraging a large monolingual corpus of programming language source code enables simple transformerbased seqtoseq baseline to perform competitively. Note that this does not contradict our discussion about simple seqtoseq baseline unable to perform well in crossdatabase semantic parsing.
Explaining the queries: an essential feature of Turing by Borealis AI is the ability to explain the predicted queries to nontechnical users. This allows people to use their own judgment to pick out which of the top hypotheses is more likely to be correct. Please check out our paper (Xu et al., 2021b) for more information about the explanation system.
1 As of June022021, the time of publication of this blog. Our entry is "DTFixup SQLSP + RoBERTa (DB content used) Borealis AI".
Each of this series of three blogs focuses on different aspects of the transformer. In Part I, we introduce selfattention, which is the core mechanism that underpins the transformer architecture. We then describe transformers themselves and how they can be used as encoders, decoders, or encoderdecoders using wellknown examples such as BERT and GPT3. This discussion will be suitable for someone who knows machine learning, but who is not familiar with the transformer.
Part II considers how to adapt the transformer to cope with longer sequences, different methods for encoding the positions of elements in the sequence, and other modifications to the basic architecture. We also discuss the relationship between the transformer and other models. This will be suitable for a reader who knows the basics about transformers and wants to learn more.
Transformer models are difficult to train from scratch in practice. Part III details the tricks that are required to ensure that training does not fail. We conclude with a discussion of our recent work on how to modify the training procedure to finetune deep transformers when only sparse training data is available. This discussion will be suitable for practitioners who want to learn more about how to work effectively with transformers.
To motivate the transformer, consider the following passage:
The restaurant refused to serve me a ham sandwich, because it only cooks vegetarian food. In the end, they just gave me two slices of bread. Their ambience was just as good as the food and service.
We would like to build a network that can process this passage into a representation that is suitable for downstream tasks. For example, we might want to classify the review as positive or negative, or answer questions such as "Does the restaurant serve steak?". Two problems immediately present themselves:
First, the input representation will be large. Typically, we might describe each of the 37 words with an embedding vector of length 1024 and so the network input will be of length $37 *1024 = 37888$ even for this small passage. A more realistically sized input might have hundreds or even thousands of words. It's not clear that a standard fullyconnected network would be practical here; it would need a very large number of parameters, and it's not obvious how to adapt such a network to inputs containing different numbers of words. This suggests the need for some kind of parameter sharing that is analogous to the use of convolutions in image processing.
Second, language is fundamentally ambiguous; it is not clear from the syntax alone that the pronoun it refers to the restaurant and not the ham sandwich. To fully understand the text, the word it should somehow be connected to the word restaurant. In the parlance of transformers, the former word should pay attention to the latter. This implies that there must be connections between the words, and that the strength of these connections will depend on the words themselves. Moreover, these connections need to extend across large spans of the text; the word their in the last sentence also refers to the restaurant.
In conclusion, we have argued that a model that can process real world text (i) will use parameter sharing so that it can cope with long input passages of differing lengths, and (ii) will contain connections between word representations that depend on the words themselves. The transformer acquires both of these properties by using dotproduct selfattention.
A standard neural network layer $\bf nn[\bullet]$, takes a $D\times 1$ input $\mathbf{x}$, applies a linear transformation followed by a static nonlinearity like a rectified linear unit (ReLU)
\begin{equation}
\bf nn[\mathbf{x}] = \bf ReLU[\boldsymbol\Phi\tilde{\mathbf{x}}], \tag{1}
\end{equation}
to return a modified output vector. Here, the notation $\tilde{\mathbf{x}}$ indicates that we have appended the constant value 1 to the end of $\mathbf{x}$ so that the parameter matrix $\boldsymbol\Phi$ can also represent the offsets in the linear transformation. For simplicity, we'll assume that we use this trick every time we apply a linear transformation and just write $\boldsymbol\Phi\mathbf{x}$ from now on.
In contrast, a selfattention block $\bf sa[\bullet]$ takes $I$ inputs $\mathbf{x}_{i}$, each of dimension $D\times 1$ and returns $I$ output vectors. In the context of NLP, each of the inputs $\mathbf{x}_{i}$ will represent a word or part of a word. For input $\mathbf{x}_{i}$, the selfattention block returns the weighted sum:
\begin{equation}
\mbox{sa}[\mathbf{x}_{i}] = \sum_{j=1}^{I}a[\mathbf{x}_{i}, \mathbf{x}_{j}]\boldsymbol\Phi_v \mathbf{x}_{j}. \tag{2}
\end{equation}
The sum is over all of the inputs $\{\mathbf{x}_{i}\}_{i=1}^{I}$ after applying the same linear transformation $\boldsymbol\Phi_{v}$ to each. We will refer to the parameters $\boldsymbol\Phi_{v}$ as value weights and the product $\boldsymbol\Phi_v \mathbf{x}_{i}$ as computing the values for the $i^{th}$ input. These values are weighted by the terms $a[\mathbf{x}_{i}, \mathbf{x}_{j}]$ which are scalars that represent the attention of input $\mathbf{x}_{i}$ to input $\mathbf{x}_{j}$.
In the following sections, we will look at this in more detail by breaking this computation down into two parts. First we'll consider the computation of the values and their subsequent weighting as described in equation 2. Then we'll describe how compute the attention weights $a[\mathbf{x}_{i}, \mathbf{x}_{j}]$.
The same value weights $\boldsymbol\Phi_{v}$ are applied to each input $\mathbf{x}_{i}$ and because of this parameter sharing, far fewer parameters are required than if we had used a fullyconnected network (figure 1). Moreover, this part of the computation is easy to extend to different sequence lengths.
The attention weights $a[\mathbf{x}_{i}, \mathbf{x}_{j}]$ combine the values from different inputs. They are also sparse in a sense, since there is only one weight for each ordered pair of inputs $(\mathbf{x}_{i},\mathbf{x}_{j})$, regardless of the size of these inputs. It follows that the number of attention weights increases with the square of the sequence length $I$, but is independent of the length $D$ of each input $\mathbf{x}_{i}$.
In the previous section, we saw that the outputs are the result of two chained linear transformations; the values $\boldsymbol\Phi_{v}\mathbf{x}_{i}$ are computed independently for each input $\mathbf{x}_{i}$ and these vectors are combined linearly by the attention weights $a[\mathbf{x}_{i},\mathbf{x}_{j}]$. However, the overall selfattention computation is nonlinear because the attention weights are themselves nonlinear functions of the input.
More specifically, the attention weight $a[\mathbf{x}_{i},\mathbf{x}_{j}]$ depends on the dotproduct $(\boldsymbol\Phi_{q}\mathbf{x}_{i})^{T}\boldsymbol\Phi_{k}\mathbf{x}_{j}$ between $\mathbf{x}_{i}$ and $\mathbf{x}_{j}$ after each as been transformed by a different linear transformations $\boldsymbol\Phi_{q}$ and $\boldsymbol\Phi_{k}$ respectively. To complete the computation of the attention weight, these dotproduct similarities are passed through a softmax function:
\begin{eqnarray}\label{eq:sattention2}
a[\mathbf{x}_{i},\mathbf{x}_{j}] &=& \mbox{softmax}_{j}\left[(\boldsymbol\Phi_{q}\mathbf{x}_{i})^{T}\boldsymbol\Phi_{k}\mathbf{x}_{j} \right]\nonumber\\
&=& \frac{\exp\left[(\boldsymbol\Phi_{q}\mathbf{x}_{i})^{T}\boldsymbol\Phi_{k}\mathbf{x}_{j} \right]}{\sum_{j=1}^{I}\exp\left[(\boldsymbol\Phi_{q}\mathbf{x}_{i})^{T}\boldsymbol\Phi_{k}\mathbf{x}_{j} \right]} \tag{3}
\end{eqnarray}
and so for each $\mathbf{x}_{i}$ they are positive and sum to one (figure 2). For obvious reasons, this is known as dotproduct selfattention.
The vectors $\boldsymbol\Phi_{q}\mathbf{x}_{i}$ and $\boldsymbol\Phi_{k}\mathbf{x}_{i}$ are known as the queries and keys respectively. These names were inherited from the field of information retrieval and have the following interpretation: the output for input $\mathbf{x}_{i}$ receives a weighted sum of values $\boldsymbol\Phi_v \mathbf{x}_{j}$, where the weights $a[\mathbf{x}_{i}, \mathbf{x}_{j}]$ depend on the similarity between the query vector $\boldsymbol\Phi_q \mathbf{x}_{j}$ and the key vector $\boldsymbol\Phi_k \mathbf{x}_{j}$.
To summarize, we see that for input $\mathbf{x}_{i}$, the output is a weighted sum of the same linear transformation $\boldsymbol\Phi_{v}$ of all of the inputs, where these weights are positive and sum to one. The weights depend on a measure of similarity between input $\mathbf{x}_{i}$ and the other inputs. The computation as a whole is nonlinear due to the dotproduct and softmax operation used to compute these weights. Consequently, there is no need for a pointwise nonlinearity like a ReLU.
Note that this mechanism fulfils the requirements that we laid out earlier. First, there is a single shared set of of parameters $\boldsymbol\Phi_{v},\boldsymbol\Phi_{q},\boldsymbol\Phi_{k}$. This is independent of the number of inputs $I$ and so the network can be applied to different sequence lengths. Second, the connections between the inputs (words) depend on the input representations themselves via the computed attention values.
The above computation can be written in a more compact form if we assume that the $I$ inputs $\mathbf{x}_{i}$ are form the rows of the $I\times D$ matrix $\mathbf{x}$:
\begin{equation}
\mbox{Sa}[\mathbf{x}] = \mbox{Softmax}[\mathbf{X}\boldsymbol\Phi_{q}(\mathbf{X}\boldsymbol\Phi_{k})^{T}]\mathbf{X}\boldsymbol\Phi_{v}. \tag{4}
\end{equation}
where the function $\mbox{Softmax}[\bullet]$ takes a matrix and performs the softmax operation independently on each of its rows (figure 3). Note that here the matrices $\boldsymbol\Phi_{v}, \boldsymbol\Phi_{q}$ and $\boldsymbol\Phi_{k}$ are the transposes of those in the original formulation.
In the previous section, we described the dotproduct selfattention mechanism. Here, we introduce three extensions that are all almost always used in practice.
Observant readers will have noticed that the above mechanism loses some important information; the computation will be the same, regardless of the order of the inputs $\mathbf{x}_{i}$. However, if the inputs correspond to the words in a sentence, it's clear that the order matters. To incorporate information about position, we add a matrix $\boldsymbol\Pi$ which is the same size as the input matrix that encodes this information.
The position matrix $\boldsymbol\Pi$ may either be chosen manually or learned. It may be added to the initial word embeddings only or it may be added at every layer of the network. Sometimes it is only added to $\mathbf{x}$ in the computation of the queries and keys. The contents of this vector and other variations will be discussed in detail in part II of this blog; however, the main idea is that there is unique vector added to each input $\mathbf{x}_{i}$ that lets the system know its position in the sequence.
The dot products in the attention computation may have very large magnitudes. This can move the arguments to the softmax function into a region where the largest value dominates to a large degree and consequently, the associated gradients are very small and the model becomes hard to train. To resolve this issue, it is typical to scale the computed attention values by the square root of dimension $d_{q}$ of the queries and keys (i.e., the number of columns in $\boldsymbol\Phi_{q}$ and $\boldsymbol\Phi_{k}$ which must be the same). This gives:
\begin{equation}
\mbox{Sa}[\mathbf{x}] =\mbox{Softmax}\left[\frac{(\mathbf{X}\boldsymbol\Phi_{q})(\mathbf{X}\boldsymbol\Phi_{k})^{T}}{\sqrt{d_{q}}}\right]\mathbf{X}\boldsymbol\Phi_{v}. \tag{5}
\end{equation}
This is known as scaled dot product selfattention.
Practitioners usually apply multiple selfattention mechanisms in parallel, and this is known as multihead self attention. The $h^{th}$ selfattention mechanism or head can be written as:
\begin{equation}
\mbox{Sa}_{h}[\mathbf{x}] =\mbox{Softmax}\left[\frac{(\mathbf{X}\boldsymbol\Phi_{qh})(\mathbf{X}\boldsymbol\Phi_{kh})^{T}}{\sqrt{d_{q}}}\right]\mathbf{X}\boldsymbol\Phi_{vh}. \tag{6}
\end{equation}
where we have different parameters $\boldsymbol\Phi_{qh}$, $\boldsymbol\Phi_{kh}$ and $\boldsymbol\Phi_{vh}$ for each head. The outputs of these selfattention mechanisms are concatenated and another linear transform $\boldsymbol\Phi_{c}$ is applied to combine them (figure 4):
\begin{equation}
\mbox{MhSa}[\mathbf{X}] = \left[\mbox{Sa}_{1}[\mathbf{X}]\;\mbox{Sa}_{2}[\mathbf{X}]\;\ldots\;\mbox{Sa}_{H}[\mathbf{X}] \right]\boldsymbol\Phi_{c}. \tag{7}
\end{equation}
This appears to be necessary to make the transformer work well in practice. It has been speculated that multiple heads make the selfattention network more robust to bad initializations. The fact that trained models only seem to depend on a subset of the heads lends credence to this speculation.
Selfattention is just one part of a larger transformer layer. This layer consists of a multihead selfattention unit (which allows the word representations to interact with each other) followed by a fully connected network $\mbox{mlp}[\mathbf{x}_{i}]$ (that operates separately on each word representation). Both of these units are residual networks (i.e., their output is added back to the original input). In addition, it is typical to add a LayerNorm operation after both the selfattention and fully connected networks. The complete layer can be described by the following series of operations:
\begin{eqnarray}
\mathbf{x} &\leftarrow& \mathbf{x} + \mbox{MhSa}[\mathbf{x}] \nonumber \\
\mathbf{x} &\leftarrow& \mbox{Layernorm}[\mathbf{x}] \hspace{3cm}\nonumber\\
\mathbf{x}_{i} &\leftarrow& \mathbf{x}_{i}+\mbox{mlp}[\mathbf{x}_{i}] \hspace{3.6cm}\forall\; i\in\{1\ldots I\}\nonumber\\
\mathbf{x} &\leftarrow& \mbox{Layernorm}[\mathbf{x}], \tag{8}
\end{eqnarray}
where the column vectors $\mathbf{x}_{i}$ are transposed and form the rows of the full data matrix $\mathbf{x}$ in the first stage. In a real system, the data would pass through a series of these layers.
Now that we have a good understanding of selfattention and the transformer layer, let's walk through a typical modern NLP processing pipeline.
A text processing pipeline begins with a tokenizer. This splits the text into a vocabulary of smaller constituent units (tokens) that can be processed by the subsequent network. In the discussion above, we have implied that these are words, but there are a several difficulties with this.
One approach would be just to use letters and punctuation marks as the vocabulary, but this would mean splitting text into a large number of very small parts and requiring the subsequent network to relearn the relations between them.
In practice, a compromise between using letters and full words is used, and the final vocabulary will include both common words and short parts of words from which larger and less frequent words can be composed. The vocabulary is computed using a method such as byte pair encoding that uses ideas from text compression methods; essentially it greedily merges commonlyoccurring substrings based on their frequency. This type of approach is known as a subword tokenizer.
Each different token within the vocabulary is mapped to a word embedding. Importantly, the same token always maps to the same embedding. These embeddings are learned along with the rest of unknown parameters in the network. A typical embedding size is 1024 and a typical total vocabulary size is 30,000, and so even before the main network, there are a lot of parameters to learn.
These embeddings are then collected to form the rows of the input matrix $\mathbf{x}$ and the positional encoding $\boldsymbol\Pi$ may be added at this stage.
Finally, the input embedding matrix $\mathbf{X}$ is passed to a series of transformer layers, which we'll refer to as a transformer network from now on. There are three main types of transformer network. First, a transformer network can be used as an encoder. Here, the goal is to transform the text into a representation that can support a variety of language tasks, such as sentiment analysis or question answering. An example of an encoder model is the BERT model.
Second, a transformer network can be used as a decoder. Here, the goal of the network is to generate a new token that continues the input text. An example of a decoder model is GPT3.
Finally, transformer networks can be used to build encoderdecoder models. These are used in sequence to sequence models, which take one text string and convert them to another text string. For example, in machine translation, an input sentence in English might be processed by the encoder. The decoder then generates the translated sentence in French. An example of an encoderdecoder model is the paper where transformers were first introduced.
We'll now consider each of these three variations in turn.
BERT is an encoder model that uses s a vocabulary of 30,000 tokens. The tokens are converted to 1024 dimensional word embeddings and passed through 24 transformer layers. In each of these is a selfattention layer with 16 heads, and for each head the queries, keys, and values are of dimension 64 (i.e., the matrices $\boldsymbol\Phi_{vh},\boldsymbol\Phi_{qh},\boldsymbol\Phi_{kh}$ are of size $1024\times 64$). The dimension of the hidden layer in the neural network layer of the transformer is 4096. The total number of parameters is $\sim 340$ million. This sounds like a lot, but is tiny by modern standards.
Encoder models are trained in two stages. During pretraining, the parameters of the transformer architecture are learned using selfsupervision from a large corpus of text. The goal here is for the model to learn general information about the statistics of language. In the finetuning stage, the resulting network is adapted to solve a particular task, using a smaller body of supervised training data. We'll now discuss each of these stages in turn for the BERT model.
In the pretraining stage, the network is trained using selfsupervision. This allows the use of enormous amounts of data, without the need for manual labels. For BERT, the selfsupervision task consists of predicting missing words from sentences from a large internet corpus (figure 7)^{1}. During training, the maximum input length was 512 tokens and the batch size is 256. The system is trained for 1,000,000 steps which is roughly 50 epochs of the 3.3 billion word corpus.
Trying to predict missing words forces the transformer network to understand something of the syntax of the language. For example, that it might learn that the adjective red is often found before nouns like house or car but never before a verb like shout. It also allows the model to learn some superficial common sense about the world. For example, after training, the model will assign a higher probability to the missing word train in the sentence The <mask> pulled into the station, than it would to the word peanut. However, there are persuasive arguments that the degree of "understanding" that this type of model can ever have is limited.
In the finetuning stage, the parameters of the model are adjusted to specialize it to a particular task. This usually involves adding an extra layer on top of the transformer network, to convert the collection of vectors $\mathbf{x}_{1},\ldots \mathbf{x}_{I}$ associated with the input tokens to the desired format of the output. Examples include:
Text classification: In BERT, there is a special token known as the $<$cls$>$ token (short for classification token) that is placed at the start of each string during pretraining. For text classification tasks like sentiment analysis, the vector associated with this string is mapped to a single number and passed through a logistic sigmoid. This creates a number between 0 and 1 that can be interpreted as the probability that the sentiment is positive and the system is finetuned to maximize this correct probability (figure 8a).
Word classification: In named entity recognition, the goal is to classify each individual word as an entity type (e.g., person, place, organization, or noentity). To this end, the vector $\mathbf{x}_{i}$ associated with each token in the input sequence is mapped to a $K\times 1$ vector where $K$ is the entity type (figure 8a) and the system is fine tuned to maximize these probabilities (figure 8b).
Text span prediction: In the SQuAD 1.1 question answering task, both the question and a passage from Wikipedia containing the answer are input into the system. BERT is then used to predict the text span in the passage that contains the answer. Each token associated with the Wikipedia passage maps to two numbers, that indicate how likely it is that the text span begins and ends at this location. The resulting two sets of numbers are put through two softmax functions and the probability of any text span being the answer can then be derived by combining the probability of starting and ending at the appropriate places.
In this section, we present a highlevel description of GPT3 which is an example of a transformer decoder model. The basic architecture is extremely similar to the encoder model in that it consists of a series of transformer layers that operate on learned word embeddings. However, the goal is different. The encoder aimed to build a representation of the text that could be finetuned to solve a more specific NLP task. However, the decoder has one purpose which is to generate the next token in a provided sequence. By iterating this procedure, the model can produce a body of coherent text.
More specifically, GPT3 constructs a language model. For any sentence it aims to model the joint probability $Pr(t_1,t_2,\ldots t_{N})$ of the $N$ observed tokens and it does this by factorizing this joint probability into an autoregressive sequence:
\begin{equation}
Pr(t_{1},t_{2},\ldots t_{N}) = \prod_{n=1}^{N}Pr(t_{n}t_{1}\ldots t_{n1}). \tag{9}
\end{equation}
This is easiest to understand with a concrete example. Consider the sentence It takes great personal courage to let yourself appear weak. For simplicity, let's assume that the tokens are the full words. The probability of the full sentence is:
$Pr$(It takes great personal courage to let yourself appear weak) $=$
$Pr$(It) $\cdot$ $Pr$(takes$$It) $\cdot$ $Pr$(great$$It takes) $\cdot$ $Pr$(courage$$It takes great) $\cdot$
$Pr$(to$$It takes great courage) $\cdot$ $Pr$(let$$It takes great courage to) $\cdot$
$Pr$(yourself$$It takes great courage to let) $\cdot$
$Pr$(appear$$It takes great courage to let yourself) $\cdot$
$Pr$(weak$$It takes great courage to let yourself appear). (10)
This demonstrates the connection between the probabilistic formulation of the cost function and the next token prediction task.
When we train a decoder model, we aim to maximize the logprobability of the input text under the autoregressive language model. Ideally, we would like to pass in the whole sentence and compute all of the log probabilities and their gradients simultaneously. However, this poses a problem; if we pass in the full sentence, then the term computing $\log$ $[$ $Pr$(great$$It takes) $]$ will have access to both the answer great and also the right context courage to let yourself appear weak.
To see how to avoid this problem, recall that in a transformer network, the tokens only interact in the selfattention layers. This implies that the problem can be resolved by ensuring that the attention to the answer and the right context are zero. This can be achieved by setting the appropriate dot products to negative infinity before they are passed through the $\mbox{softmax}[\bullet]$ function. This idea is known as masked selfattention.
The overall decoder transformer network operates as follows. The input text is tokenized and the tokens are converted to embeddings. The embeddings are passed into the transformer network, but now the transformer layers use masked selfattention so that they can only attend to the current and previous tokens. You can think of each of the output embeddings as representing a partial sentence, and for each the goal is is to predict the next token in the sequence. Consequently, after the transformer layers, a linear layer maps each word embedding to the size of the vocabulary, followed by a $\mbox{softmax}[\bullet]$ function that converts these values to probabilities. We aim to maximize sum of the log probabilities of the next token in the ground truth sequence at every position (figure 9).
To generate from the model, we start with an input sequence of text (which might be just the special $<$start$>$ token) and feed this into the network which then outputs the probability of the next token. We can then either pick the most likely token or sample from this probability distribution. The new extended sequence can be fed back into the decoder network which outputs the probability distribution over the next token and in this way, we can generate large bodies of text. The computation can be made quite efficient as prior embeddings do not interact with subsequent ones due to the masked selfattention and so a lot of the earlier computation can be recycled as we generate subsequent tokens.
In practice, there are many strategies such as beamsearch and topK sampling that can be added to help make the output text more coherent. These are discussed in detail in our previous blog on natural language generation. Here's an example of completing text from the GPT2 model:
Borealis AI is a great place to work because there are a lot of people there that are passionate about this kind of technology, like me. There are some challenges for developers but it also gives us a great opportunity to work on different problems.
where the text provided to the model is in green and the generated text in in blue.
GPT3 applies these ideas on a massive scale. The sequence lengths are 2048 tokens long and since multiple spans of 2048 tokens are processed at once, the total batch size is 3.2 million tokens. There are 96 transformer layers (some of which implement a sparse version of attention), each of which processes a word embedding of size 12288. There are 96 heads in the selfattention layers and the value, query and key dimension is 128. It is trained with 300 billion tokens and learns a total of 175 billion parameters. It is interesting to note that since text is extremely compressible, it actually contains more than enough capacity to memorize the entire data set.
One surprising property of learning a model on this scale is that it can perform many tasks without the need for finetuning. If we provide several examples of correct question answer pairs, and then another question, it can often answer the final question correctly just by completing the sequence. One example of this is correcting English grammar:
Poor English input: I eated the purple berries.
Good English output: I ate the purple berries.
Poor English input: Thank you for picking me as your designer. I’d appreciate it.
Good English output: Thank you for choosing me as your designer. I appreciate it.
Poor English input: The mentioned changes have done. or I did the alteration that you requested. or I changed things you wanted and did the modifications.
Good English output: The requested changes have been made. or I made the alteration that you requested. or I changed things you wanted and made the modifications.
Poor English input: I’d be more than happy to work with you in another project.
Good English output: I'd be more than happy to work with you on another project.
The text containing the paired examples in green was provided as context for GPT3 and the system then generated the correct answer in blue. This phenomenon extends to many situations including generating code snippets based on natural language descriptions, simple arithmetic, translating between languages, answering questions about text passages and many more. Consequently, it is argued that enormous language models are fewshot learners; they can learn to do novel tasks based on just a few examples. However, in practice the performance is erratic and it is not clear the extent to which it is extrapolating from learned examples rather than merely interpolating, or even copying verbatim.
The original transformer paper focused on translation between languages, which is an example of a sequencetosequence task. Their original architecture was an encoderdecoder model that (as the name suggests) combines both encoder and decoder models.
Consider the example of translating from English to French. The encoder receives the sentence in English and processes it through a series of transformer layers to create an output representation for each token. The decoder receives the sentence in French and processes through a series of transformer layers that use masked selfattention. However, these transformer layers also attend to the output of the encoder. Consequently, each French output word conditioned not only on the previous output words, but also on the entire English sentence that it is translating (figure 10).
In practice this is achieved by modifying the transformer layer. The original transformer layer in the decoder (figure 5) consisted of a masked selfattention layer followed by a multilayer perceptron applied individually to each embedding. In between these we now introduce a second attention layer, in which the embeddings attend to the output embeddings from the encoder. This uses a version of selfattention where the queries $\mathbf{X}_{d}\boldsymbol\Phi_{q}$ are computed from the decoder embeddings $\mathbf{X}_{d}$, and the keys $\mathbf{X}_{e}\boldsymbol\Phi_{k}$ and values $\mathbf{X}_{e}\boldsymbol\Phi_{v}$ are generated from the encoder embeddings $\mathbf{X}_{e}$:
\begin{equation}
\mbox{ Sa}[\mathbf{X}_{d},\mathbf{x}_{e}] = \mbox{Softmax}[\mathbf{X}_{d}\boldsymbol\Phi_{q}(\mathbf{X}_{e}\boldsymbol\Phi_{k})^{T}]\mathbf{X}_{e}\boldsymbol\Phi_{v}. \tag{11}
\end{equation}
This is known as encoderdecoder attention (figure 11).
In this blog, we introduced the idea of selfattention and then described how this fits into the transformer architecture. We then presented the encoder, decoder, and encoderdecoder versions of this architecture. We've seen that the transformer operates on sets of highdimensional embeddings. It has a low computational complexity per layer and much of the computation can performed in parallel, using the matrix form. Since every input embedding interacts with every other, it can describe longrange dependencies in text. It is these characteristics that have allowed transformers to be applied in massive systems like GPT3.
In the second part of the blog we will discuss extensions of the basic transformer model. In particular, we will expand on methods to encode the position of tokens and methods to extend transformers to process very long sequences. We'll also discuss how the transformer architecture relates to other models. Finally, in the third part of this series, we will discuss the details of how to train transformer models successfully.
^{1} BERT also used a secondary task which involved predicting whether two sentences were originally adjacent in the text or not, but this only marginally improved performance.
]]>For many of the government and industry partners in attendance at the AI4Good Lab’s Industry Night organised by CIFAR this year, the plan was to provide support, mentorship and guidance to the participants of the AI4Good Lab – a Canadian AI training initiative for womenidentified STEM students. Industry experts would spend time with participants, exploring their fields of study, their goals and ambitions, and future career opportunities in the field.
While COVID19 may have prevented attendees from being together in person, the organizers ensured everyone felt relaxed and comfortable in the virtual booths. The Borealis AI/RBC Amplify booth, for example, featured a ‘virtually comfortable’ Lshape couch, two square stools and a rectangle coffee table. Nonvirtual drinks were, sadly, ‘BYOB.’
The two hostesses of the Borealis AI/RBC Amplify booth, Dr. Eirene Seiradaki, Director of Research Partnerships at Borealis AI and Rachael Rishworth from RBC Amplify, talked with the AI4Good Lab participants about the wide range of learning and career opportunities available as students continue their journey of lifelong learning – from the Borealis AI Fellowships (which support AI researchers at Canadian Universities) and Borealis AI Internships through to the RBC Amplify program, which provides interns with handson prototype development opportunities at the bank.
The AI4Good Lab participants certainly seemed enthusiastic to learn and share. The booth was full for the entire 2 hours – a testament to the quality of the discussions (and, perhaps, the comfort of the virtual furniture?).
Alongside other big names in AI such as CIFAR, AMII, Vector Institute, Google Canada, DeepMind, Accenture and Manulife, the Borealis AI/RBC Amplify team offered participants a view into the wide array of initiatives and opportunities available in the AI space. They also spent time answering the participants’ questions about careers in the field of AI.
Yet it wasn’t just the students of the AI4Good Lab that were learning that night; so, too, were the industry partners and booth hosts and hostesses. Virtual networking lounges were placed in between booths, creating unique spaces that encouraged fruitful discussions among all participants – students, partners and organizers. Hosts and hostesses also visited each other’s booths to talk with ecosystem partners; in fact, rumor has it that Eirene was spotted on one of the ‘virtually hipster’ stools at the Vector Institute booth, taking a few minutes to chat with good friends and their guests at the end of the event.
More importantly, perhaps, the event highlighted the future impact the participants can make in the field and in the world around them. Since the start of the AI4Good Lab program in early May, the femaleidentified students participating in the Lab’s two cohorts in Montreal and Edmonton have been building their AI skills and capabilities, in order to conceptualize, design and develop a prototype of an AI application for social good. It is their ideas, research and development that will shape the debate around the value and ethics of AI in the future.
Ultimately, the AI4Good Industry Night demonstrated that learning is a lifelong and collaborative journey. Industry participants shared their experience and insights; the students and the organizers of the AI4Good Lab shared theirs. Everybody left the event with a renewed sense of optimism, new ideas and new network connections.
On behalf of the attendees of the AI4Good Lab Industry Night, we would like to thank Maya MarcusSells and Yosra Kazemi for organizing a fantastic event in the face of the continued disruption of COVID19.
Below are just a few photos of the event. We are confident the ideas generated there will emerge into view over the coming months and years.
]]>