Attention

Why your inference costs scale quadratically

Attention is the mechanism that lets each token look at every other token to understand context. It's why 'bank' means different things in 'river bank' vs 'bank account' — and why doubling your context length quadruples your compute cost.

Why this matters

Attention is the bottleneck. It's why long-context models are expensive and why context window management is an engineering discipline, not an afterthought.

Prerequisites

Tokenization

Why your LLM sees 'ChatGPT' as 3 tokens, not 1 word

Embeddings

How meaning becomes math

Leads to

Context Windows

The budget your model runs on

Inference Optimization

Making models fast and cheap