Lecture 20-LLMs from a Probabilistic Perspective 1-Implementing a GPT from Scratch

How we evolved from a transformer to a GPT

From Transformer to GPT

The Original Transformer

The original Transformer architecture (Vaswani et al., 2017) was designed for sequence-to-sequence tasks and uses an framework.

: Maps an input sequence to a contextualized representation.
: Produces outputs token-by-token using the encoder output and previously generated tokens.
: Mechanism for learning dependencies within a sequence.
: Fully relies on self-attention in both encoder and decoder for translation tasks.

\[P(Y X) = _t P(Y_t Y_{<t}, X)\]

Evolving to a GPT

GPT simplifies the Transformer by dropping the encoder and using a decoder-only model to model input sequences:

\[P(X) = _t P(X_t X_{<t})\] \[_{} _i _t P_(X_{i,t} X_{i,<t})\]

This enforces causal structure over the input, forming a directed probabilistic graphical model.

Writing a GPT

Model Configuration

import torch
torch.manual_seed(1337)

# Training hyperparameters
batch_size = 16
max_iters = 5000
eval_interval = 100
learning_rate = 1e-3
eval_iters = 200

# Model hyperparameters
from gpt_config import GPTConfig
config = GPTConfig(
    block_size = 8,
    device = 'cuda' if torch.cuda.is_available() else 'cpu',
    n_embd = 64,
    n_head = 4,
    n_layer = 4,
    dropout = 0.0
)

Load and Encode Dataset

with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()
chars = sorted(list(set(text)))
config.vocab_size = len(chars)

stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s]
decode = lambda l: ''.join([itos[i] for i in l])

Train/Test Split and Block Sampling

data = torch.tensor(encode(text), dtype=torch.long)
n = int(0.9*len(data))
train_data = data[:n]
val_data = data[n:]

x = train_data[:config.block_size]
y = train_data[1:config.block_size+1]
for t in range(config.block_size):
    context = x[:t+1]
    target = y[t]
    print(f"For input {context}, target is: {target}")

Helper Functions

def get_batch(split):
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - config.block_size, (batch_size,))
    x = torch.stack([data[i:i+config.block_size] for i in ix])
    y = torch.stack([data[i+1:i+config.block_size+1] for i in ix])
    return x.to(config.device), y.to(config.device)

@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

Training the Model

from gpt_zero import GPT
model = GPT(config)
m = model.to(config.device)
print(sum(p.numel() for p in m.parameters())/1e6, 'M parameters')

optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

for iter in range(max_iters):
    if iter % eval_interval == 0 or iter == max_iters - 1:
        losses = estimate_loss()
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    xb, yb = get_batch('train')
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

context = torch.zeros((1, 1), dtype=torch.long, device=config.device)
print(decode(m.generate(context, max_new_tokens=2000)[0].tolist()))

Training Output

WARWICK:
Yeart, their to you's my 'tcknow your turrothose...
ROMEO:
Wwell-Bethal,
Be lords!   

Scaling from “GPT” to GPT-4

Overview

This section summarizes the evolution of the GPT family, from a simple, scratch-built model to GPT-4. Each stage reflects major increases in model size, dataset quality, training techniques, and inference strategies. As the architecture scaled, so did the model’s capabilities—enabling GPT to move from character-level toy outputs to world-class language generation and multimodal reasoning.

Glossary of Key Terms

Breaks input text into smaller units. GPT evolved from character-level to Byte-Pair Encoding (BPE), and later to multimodal tokenization.
Activation functions that determine how neurons in the model fire. GELU is smoother and generally performs better in transformers.
Size of the vector used to represent each token.
Parallel attention mechanisms in each transformer block that allow the model to focus on different parts of the input simultaneously.
Maximum number of tokens the model can consider at once.
Total number of learnable weights in the model. Higher parameter counts generally allow for more expressive power.
Text corpus used to train the model. Increased in size and diversity across versions.
Algorithm used to adjust the model weights during training. Adam is widely used, often with modifications like learning rate warmup and weight decay.
Method for generating output text. Greedy selects the highest-probability token each time, while top-$k$ introduces diversity.
A sparsely activated architecture where only subsets of the model are used per input, improving scalability.
Reinforcement Learning from Human Feedback—a training method that aligns model outputs with human preferences.

From “GPT” to GPT-1

- : Characters $$ Byte-Pair Encoding (BPE)
- : ReLU $$ GELU
- : Tied input/output embeddings
- (117M params):
- Layers: 4 $$ 12
- Attention heads: 4 $$ 12
- Context length: 32 $$ 512
- Vocabulary: 65 $$ 40,000 tokens
- Embedding dim: 64 $$ 768
- Dataset: TinyShakespeare (1MB) $$ BookCorpus (5GB)
- Initialization/normalization: Default $$ Tuned
- Optimizer: Adam $$ Adam + warmup + weight decay
- Sampling: Greedy $$ Top-$k$

From GPT-1 to GPT-2

- Layers: 12 $$ 48
- Heads: 12 $$ 25
- Embedding dim: 768 $$ 1600
- Context length: 512 $$ 1024
- Vocab size: 40k $$ 50k tokens
BookCorpus (5GB) $$ WebText (40GB)

From GPT-2 to GPT-3

- Layers: 48 $$ 96
- Heads: 25 $$ 96
- Embedding dim: 1600 $$ 12{,}288
- Context length: 1024 $$ 2048
WebText (40GB) $Common Crawl + books, code, etc. ($570GB)

From GPT-3 to GPT-4

- Likely incorporates Mixture-of-Experts (MoE)
- Tokenizer expanded for multimodal inputs (e.g., images)
- Scale:
  - Parameters: 175B $$ estimated $>$1T
  - Context length: 2048 $$ 128{,}000 tokens
- WebText + books, Wikipedia, code, etc. ($570GB)$ much larger, proprietary dataset (details undisclosed)
- $13 trillion tokens ($50TB)
- Reinforcement learning from human feedback (RLHF) + system-level safety mechanisms

Mixture of Experts (MoE)

Instead of using a single feedforward layer (FFN) at each transformer block, MoE introduces multiple ``expert’’ FFNs and uses a routing mechanism to dynamically select a subset for each input.

A analyzes each input token and assigns it to the most relevant expert(s).
Only a small subset of experts (e.g., 1 or 2 out of 4) is activated per token.
This allows for , reducing the cost while increasing model capacity.

[H]

![image](assets/img/notes/lecture-20/moe.png)

In a standard transformer (left), each token passes through a single FFN after self-attention.
In an MoE transformer (right), this FFN is replaced with (the experts).
The router selects a few experts to process the token, and the outputs are combined.
Result: Higher model capacity with the same or lower computational cost at inference time.

[H]

![image](assets/img/notes/lecture-20/trans_decode.png)

Mixture of Experts: Probabilistic View

Model the output as a weighted sum of predictions from multiple expert networks.

Let $P(Y X) = _{m} g_m(X) P_m(Y X)$ where:
- ( P_m(Y X) ): prediction from expert ( m )
- ( g_m(X) ): gating function output (i.e., weight) for expert ( m )
Subject to constraints: $_m g_m(X) = 1, g_m(X) 0 m, X$ ensuring a valid probability distribution over experts.
A computes ( g_m(X) ), deciding how much each expert contributes based on the input.
The uses these weights to sample or activate experts probabilistically.
This framework can be trained via the , where:
- E-step estimates expert responsibilities ( g_m(X) )
- M-step updates expert and gating parameters

MoE: A Unifying Framework for Ensembles

Let the predictive distribution be modeled as: $P(Y X) = _{m} g_m(X) P_m(Y X)$

$g_m(X)$ is a .
$g_m(X) = {M}$ is a across experts.
$g_m(X) = _m$ is a , constant across inputs.

MoE: Error Analysis

Let: $P(Y X) = _{m} g_m(X) P_m(Y X)$

Define the expected prediction (mean function) of the ensemble: $(x) := [Y X = x] = _{m} g_m(x) f_m(x)$

Compare two types of errors:

\[(x) := (Y - (x))^2\]
\[(x) := {M} _m (Y - f_m(x))^2\]

Will minimizing ensemble error $(x)$ also minimize average expert error $(x)$?

MoE: Diversity vs. Error

Ensemble error: ((x) = (Y - (x))^2)
Average expert error: ((x) = {M} _m (Y - f_m(x))^2)
As increases (x-axis = \% unique training data per expert):
- ((x)) (solid line): Ensemble error is minimized at moderate diversity.
- ((x)) (dotted line): Average expert error increases with diversity.
- Dashed line: Represents disagreement (variance) between experts’ outputs.

[H]

![image](assets/img/notes/lecture-20/three_lines.png)

Expert predictions can individually overfit, but if their errors are uncorrelated, the ensemble prediction can be robust and accurate.
Diverse (even slightly overfitted) experts cancel out each other’s errors when averaged.
or purposeful variation among experts improves ensemble generalization.

MoE in Large Language Models (LLMs)

Input ( ) is passed through a , which outputs activations ( (x) ).
The router computes ( G(x) ), representing how much to weight each expert.
A small number of experts (e.g., top-1 or top-2) are activated based on ( G(x) ).
Only the selected FFNNs are evaluated; their outputs ( E(x) ) are weighted by ( G(x) ) and aggregated into the final output ( y ).

[H]

![image](assets/img/notes/lecture-20/gpt_process.png)

Sparse activation means only a few experts need to be loaded and executed per token, reducing compute and memory.
Enables use of massive models without needing to evaluate all parameters at once.
Without proper regularization, some experts dominate while others are underused.
As shown in the plot, more experts (e.g., 128) can reduce training loss but may hurt validation performance due to overfitting or imbalance.
Validation loss increases for larger expert counts, indicating a generalization gap.
Solution approaches may include load balancing, expert dropout, or routing noise.

Summary Tables

From Transformer to GPT

{

}

& & \

Architecture & Encoder-decoder (full) & Decoder-only
Attention & Full self-attention & Masked (causal) self-attention
Positional encoding & Sinusoidal (original) & Learned positional embeddings
Output & Task-specific & Next-token prediction
Training objective & Flexible (e.g., translation) & Language modeling (autoregressive)
Inference & Depends on task & Greedy / sampling for text gen \

From GPT-1 to GPT-4

Broad range; largest grew from 1.5B $$ $>$1T parameters

- Context length: 512 $$ 128{,}000
- Layers: 12 $$ $>$96
- Attention heads: 12 $$ $>$96
- Embedding dimension: 768 $$ $>$12{,}288
- Vocabulary size: 40k $$ $>$50k tokens

Tokenizer: Supports multimodal inputs (e.g., images)
Architecture includes:
Dataset: BookCorpus (5GB) $Private corpus of 13T tokens ($50TB)
Alignment: Reinforcement learning from human feedback (RLHF)