Lecture 20-LLMs from a Probabilistic Perspective 1-Implementing a GPT from Scratch

How we evolved from a transformer to a GPT

From Transformer to GPT

The Original Transformer

The original Transformer architecture (Vaswani et al., 2017) was designed for sequence-to-sequence tasks and uses an framework.

\[P(Y X) = _t P(Y_t Y_{<t}, X)\]

Evolving to a GPT

GPT simplifies the Transformer by dropping the encoder and using a decoder-only model to model input sequences:

\[P(X) = _t P(X_t X_{<t})\] \[_{} _i _t P_(X_{i,t} X_{i,<t})\]

This enforces causal structure over the input, forming a directed probabilistic graphical model.

image

Writing a GPT

Model Configuration

import torch
torch.manual_seed(1337)

# Training hyperparameters
batch_size = 16
max_iters = 5000
eval_interval = 100
learning_rate = 1e-3
eval_iters = 200

# Model hyperparameters
from gpt_config import GPTConfig
config = GPTConfig(
    block_size = 8,
    device = 'cuda' if torch.cuda.is_available() else 'cpu',
    n_embd = 64,
    n_head = 4,
    n_layer = 4,
    dropout = 0.0
)

Load and Encode Dataset

with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()
chars = sorted(list(set(text)))
config.vocab_size = len(chars)

stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s]
decode = lambda l: ''.join([itos[i] for i in l])

Train/Test Split and Block Sampling

data = torch.tensor(encode(text), dtype=torch.long)
n = int(0.9*len(data))
train_data = data[:n]
val_data = data[n:]

x = train_data[:config.block_size]
y = train_data[1:config.block_size+1]
for t in range(config.block_size):
    context = x[:t+1]
    target = y[t]
    print(f"For input {context}, target is: {target}")

Helper Functions

def get_batch(split):
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - config.block_size, (batch_size,))
    x = torch.stack([data[i:i+config.block_size] for i in ix])
    y = torch.stack([data[i+1:i+config.block_size+1] for i in ix])
    return x.to(config.device), y.to(config.device)

@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

Training the Model

from gpt_zero import GPT
model = GPT(config)
m = model.to(config.device)
print(sum(p.numel() for p in m.parameters())/1e6, 'M parameters')

optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

for iter in range(max_iters):
    if iter % eval_interval == 0 or iter == max_iters - 1:
        losses = estimate_loss()
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    xb, yb = get_batch('train')
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

context = torch.zeros((1, 1), dtype=torch.long, device=config.device)
print(decode(m.generate(context, max_new_tokens=2000)[0].tolist()))

Training Output

WARWICK:
Yeart, their to you's my 'tcknow your turrothose...
ROMEO:
Wwell-Bethal,
Be lords!   

Scaling from “GPT” to GPT-4

Overview

This section summarizes the evolution of the GPT family, from a simple, scratch-built model to GPT-4. Each stage reflects major increases in model size, dataset quality, training techniques, and inference strategies. As the architecture scaled, so did the model’s capabilities—enabling GPT to move from character-level toy outputs to world-class language generation and multimodal reasoning.

Glossary of Key Terms

From “GPT” to GPT-1

From GPT-1 to GPT-2

From GPT-2 to GPT-3

From GPT-3 to GPT-4

Mixture of Experts (MoE)

Instead of using a single feedforward layer (FFN) at each transformer block, MoE introduces multiple ``expert’’ FFNs and uses a routing mechanism to dynamically select a subset for each input.

[H]

![image](assets/img/notes/lecture-20/moe.png)

[H]

![image](assets/img/notes/lecture-20/trans_decode.png)

Mixture of Experts: Probabilistic View

Model the output as a weighted sum of predictions from multiple expert networks.

MoE: A Unifying Framework for Ensembles

Let the predictive distribution be modeled as: \(P(Y X) = _{m} g_m(X) P_m(Y X)\)

MoE: Error Analysis

Let: \(P(Y X) = _{m} g_m(X) P_m(Y X)\)

Define the expected prediction (mean function) of the ensemble: \((x) := [Y X = x] = _{m} g_m(x) f_m(x)\)

Compare two types of errors:

Will minimizing ensemble error $(x)$ also minimize average expert error $(x)$?

MoE: Diversity vs. Error

[H]

![image](assets/img/notes/lecture-20/three_lines.png)

MoE in Large Language Models (LLMs)

[H]

![image](assets/img/notes/lecture-20/gpt_process.png)

Summary Tables

From Transformer to GPT

{ l l l }

& & \

Architecture & Encoder-decoder (full) & Decoder-only
Attention & Full self-attention & Masked (causal) self-attention
Positional encoding & Sinusoidal (original) & Learned positional embeddings
Output & Task-specific & Next-token prediction
Training objective & Flexible (e.g., translation) & Language modeling (autoregressive)
Inference & Depends on task & Greedy / sampling for text gen \

From GPT-1 to GPT-4

- Context length: 512 $$ 128{,}000
- Layers: 12 $$ $>$96
- Attention heads: 12 $$ $>$96
- Embedding dimension: 768 $$ $>$12{,}288
- Vocabulary size: 40k $$ $>$50k tokens