Attention Is All You Need

Why This Paper Matters

Before Transformers, sequence models like RNNs and LSTMs processed tokens one at a time. This made training slow and caused long-range dependencies to fade. This paper introduced a fully attention-based architecture that processes all tokens in parallel.

Every modern LLM - GPT, BERT, LLaMA, Gemini - is built on this foundation.

1. Core Idea

The paper proposes self-attention as a replacement for recurrence. Instead of processing sequentially, every token can attend to every other token simultaneously.

"The cat sat on the mat because it was tired"

When processing "it", the model learns to attend most strongly to "cat" - capturing the reference without sequential processing.

2. Architecture

The Transformer has two main components:

Encoder
Reads the input sequence and produces contextual representations. Uses multi-head self-attention and feed-forward layers.

Decoder
Generates output tokens one at a time. Uses masked self-attention (can only look at previous tokens) plus cross-attention to the encoder output.

Multi-Head Attention

Instead of one attention function, the model runs multiple attention heads in parallel. Each head can learn different patterns - one might track syntax, another might track semantics.

Attention(Q, K, V) = softmax(QK^T / √d_k) V

Q (Query) - What am I looking for?
K (Key) - What do I contain?
V (Value) - What information do I provide?

Positional Encoding

Since there's no recurrence, the model has no sense of word order. The paper adds sinusoidal positional encodings to the input embeddings so the model knows where each token sits in the sequence.

3. Key Results

Machine Translation - Achieved new SOTA on WMT 2014 English-to-German (28.4 BLEU) and English-to-French (41.0 BLEU)
Training Speed - Trained in 3.5 days on 8 GPUs, significantly faster than RNN-based models
Parallelization - Self-attention enables full parallelism during training, unlike sequential RNNs

4. What I Learned

Attention replaces recurrence - You don't need sequential processing for sequence tasks. Attention alone is enough.
Scaling works - The architecture is simple enough to scale to billions of parameters. This simplicity is what made GPT possible.
Positional encoding is a design choice - The original paper used fixed sinusoidal encodings, but later work (RoPE, ALiBi) showed learned or relative encodings work better at scale.
Multi-head attention = multiple perspectives - Running parallel attention heads lets the model capture different types of relationships simultaneously.

5. Improvements Since

BERT (2018) - Encoder-only Transformer for understanding tasks
GPT (2018-2024) - Decoder-only Transformer for generation
RoPE - Rotary positional embeddings for better length generalization
FlashAttention - Hardware-aware attention for faster training
KV Cache - Optimized inference by caching key-value pairs
Grouped Query Attention - Reduces memory by sharing KV heads

Sources: