Borealis AI has recently published a comprehensive set of research tutorials on Large Language Models and Transformers. These tutorials are a valuable resource for those who want to learn more about these technologies. The collection includes both high-level introductions and in-depth technical knowledge, making it suitable for anyone hoping to gain a better understanding of these topics.

As part of our goal to create real-world impact through scientific pursuit, we have created these resources for the AI community. You can find the collection below.

purple background with a gradient web of pinks and purples in the centre.

A High-level Overview of Large Language Models

This blog provides a high-level overview of language models targeted at a general audience with no background in this area.

Featured image of Tutorial #14: Transformers I: Introduction

Transformers I: Introduction

Modern language models depend on the transformer architecture. This blog provides an introduction to this architecture. We describe the self-attention mechanism, including variants like scaled self-attention, multi-head self-attention, masked self-attention, and cross-attention. We discuss how position encodings are added. We describe the encoder, decoder, and encoder-decoder transformer models using the examples of BERT, GPT3, and automatic translation.

Featured image of Tutorial #16: Transformers II: Extensions

Transformers II: Extensions

This blog discusses (i) different ways to incorporate position information into the transformer (ii) how to extend the self-attention mechanism to longer sequence lengths, and (iii) how transformers relate to other architectures, including RNNs, convolutional networks, gating networks, and hypernetworks.

Featured image of Tutorial #17: Transformers III Training

Transformers III: Training

This blog discusses the subtleties of training transformers. This can be trickier than training other architectures and sometimes requires tricks like learning rate warm-up. We discuss how the self-attention architecture, layer normalization, and residual links affect the activation variance to make training challenging.

Featured image of "Tutorial #6: neural natural language generation – decoding algorithms"

Neural Natural Language Generation

These blogs tackle the rarely-discussed topic of decoding from neural models like transformers. This concerns which combinations of tokens we actually choose to form the final sentence. Methods discussed include top-k sampling, nucleus sampling, beam search, and diverse beam search.

Training and Fine-tuning Large Language Models

This blog discusses training and fine-tuning large language models for use in chatbots like Chat-GPT.  It covers topics such as model pre-training, few-shot learning, supervised fine-tuning, reinforcement learning from human feedback (RLHF), and direct preference optimization.

Fast moving square shapes with a smooth streak of light swooping across them.

Speeding up Inference in Transformers

This blog discusses speeding up inference in transformers. Each token generated by a transformer depends on all the preceding ones, so inference slows down in language models as the output gets longer. This blog discusses variations of the self-attention mechanism that make inference more efficient, including attention-free transformers, RWKV, linear transformers, Performers, and the very recent “retentive network”.