Why This Paper Matters
Before Transformers, sequence models like RNNs and LSTMs processed tokens one at a time. This made training slow and caused long-range dependencies to fade. This paper introduced a fully attention-based architecture that processes all tokens in parallel.
Every modern LLM - GPT, BERT, LLaMA, Gemini - is built on this foundation.
1. Core Idea
The paper proposes self-attention as a replacement for recurrence. Instead of processing sequentially, every token can attend to every other token simultaneously.
"The cat sat on the mat because it was tired"
When processing "it", the model learns to attend most strongly to "cat" - capturing the reference without sequential processing.
2. Architecture
The Transformer has two main components:
Encoder
Reads the input sequence and produces contextual representations. Uses multi-head self-attention and feed-forward layers.
Decoder
Generates output tokens one at a time. Uses masked self-attention (can only look at previous tokens) plus cross-attention to the encoder output.
Multi-Head Attention
Instead of one attention function, the model runs multiple attention heads in parallel. Each head can learn different patterns - one might track syntax, another might track semantics.
Attention(Q, K, V) = softmax(QK^T / √d_k) V
- Q (Query) - What am I looking for?
- K (Key) - What do I contain?
- V (Value) - What information do I provide?
Positional Encoding
Since there's no recurrence, the model has no sense of word order. The paper adds sinusoidal positional encodings to the input embeddings so the model knows where each token sits in the sequence.
3. Key Results
- Machine Translation - Achieved new SOTA on WMT 2014 English-to-German (28.4 BLEU) and English-to-French (41.0 BLEU)
- Training Speed - Trained in 3.5 days on 8 GPUs, significantly faster than RNN-based models
- Parallelization - Self-attention enables full parallelism during training, unlike sequential RNNs
4. What I Learned
- Attention replaces recurrence - You don't need sequential processing for sequence tasks. Attention alone is enough.
- Scaling works - The architecture is simple enough to scale to billions of parameters. This simplicity is what made GPT possible.
- Positional encoding is a design choice - The original paper used fixed sinusoidal encodings, but later work (RoPE, ALiBi) showed learned or relative encodings work better at scale.
- Multi-head attention = multiple perspectives - Running parallel attention heads lets the model capture different types of relationships simultaneously.
5. Improvements Since
- BERT (2018) - Encoder-only Transformer for understanding tasks
- GPT (2018-2024) - Decoder-only Transformer for generation
- RoPE - Rotary positional embeddings for better length generalization
- FlashAttention - Hardware-aware attention for faster training
- KV Cache - Optimized inference by caching key-value pairs
- Grouped Query Attention - Reduces memory by sharing KV heads
Sources: