Mastering Language Models: A Comprehensive Collection of Tutorials

Borealis AI has recently published a comprehensive set of research tutorials on Large Language Models and Transformers. These tutorials are a valuable resource for those who want to learn more about these technologies. The collection includes both high-level introductions and in-depth technical knowledge, making it suitable for anyone hoping to gain a better understanding of these topics.

As part of our goal to create real-world impact through scientific pursuit, we have created these resources for the AI community. You can find the collection below.

purple background with a gradient web of pinks and purples in the centre.

A High-level Overview of Large Language Models

This blog provides a high-level overview of language models targeted at a general audience with no background in this area.

Transformers I: Introduction

Modern language models depend on the transformer architecture. This blog provides an introduction to this architecture. We describe the self-attention mechanism, including variants like scaled self-attention, multi-head self-attention, masked self-attention, and cross-attention. We discuss how position encodings are added. We describe the encoder, decoder, and encoder-decoder transformer models using the examples of BERT, GPT3, and automatic translation.

Transformers II: Extensions

This blog discusses (i) different ways to incorporate position information into the transformer (ii) how to extend the self-attention mechanism to longer sequence lengths, and (iii) how transformers relate to other architectures, including RNNs, convolutional networks, gating networks, and hypernetworks.

Transformers III: Training

This blog discusses the subtleties of training transformers. This can be trickier than training other architectures and sometimes requires tricks like learning rate warm-up. We discuss how the self-attention architecture, layer normalization, and residual links affect the activation variance to make training challenging.

Neural Natural Language Generation

These blogs tackle the rarely-discussed topic of decoding from neural models like transformers. This concerns which combinations of tokens we actually choose to form the final sentence. Methods discussed include top-k sampling, nucleus sampling, beam search, and diverse beam search.

Training and Fine-tuning Large Language Models

This blog discusses training and fine-tuning large language models for use in chatbots like Chat-GPT. It covers topics such as model pre-training, few-shot learning, supervised fine-tuning, reinforcement learning from human feedback (RLHF), and direct preference optimization.

Fast moving square shapes with a smooth streak of light swooping across them.

Speeding up Inference in Transformers

This blog discusses speeding up inference in transformers. Each token generated by a transformer depends on all the preceding ones, so inference slows down in language models as the output gets longer. This blog discusses variations of the self-attention mechanism that make inference more efficient, including attention-free transformers, RWKV, linear transformers, Performers, and the very recent “retentive network”.

Work with us!

Impressed by the work of the team? Borealis AI is looking to hire for various roles across different teams. Visit our career page now to find the right role for you and join our team!

View open roles

Cookies Settings

Citation