The Transformer changed everything. Published in 2017, it became the foundation of every major language model today.
Let's break it down from the ground up.
What is NLP?
Natural Language Processing (NLP) is about making computers understand and work with text.
Types of NLP Tasks
Classification
Input: text → Output: single prediction
- Sentiment analysis (positive/negative/neutral)
- Intent detection ("set an alarm")
- Language detection
- Topic modeling
Multi-Classification
Input: text → Output: multiple predictions
- Named Entity Recognition (labeling words as person, location, time)
- Part-of-speech tagging
Generation
Input: text → Output: text
- Machine translation
- Question answering (like ChatGPT)
- Summarization
- Code generation
1. Tokenization
Models understand numbers, not words. We need to convert text into tokens.
Methods
Word Level
Split by words: ["a", "cute", "cat"]
Problem: "bear" and "bears" become different tokens. Hard to leverage root meanings.
Character Level
Split by characters: ["a", " ", "c", "u", "t", "e"]
Problem: Sequences become very long. Slow processing.
Subword Level (Most Common)
Split smartly: ["un", "happi", "ness"]
Best of both worlds. Handles unknown words. Leverages word roots.
Vocabulary size: ~50,000 tokens for English models, ~100,000+ for multilingual.
2. Word Embeddings
Tokens need numerical representations.
One-Hot Encoding
Each word gets a unique vector with a single 1.
"cat" → [1, 0, 0, 0, ...]
"dog" → [0, 1, 0, 0, ...]
Problem: All vectors are orthogonal. No semantic similarity captured.
Learned Embeddings
We want: similar words → similar vectors.
Word2Vec (2013)
Train a model to predict the next word. The hidden layer becomes the word embedding.
"king" - "man" + "woman" ≈ "queen"
This works because embeddings capture meaning, not just identity.
How It Works
- Take a word, convert to one-hot
- Pass through a small neural network
- Output: probability of next word
- Train on lots of text
- The hidden layer = word embedding
Embedding size is typically 256-1024 dimensions. Trade-off between richness and speed.
3. Recurrent Neural Networks (RNNs)
Word embeddings alone don't capture sentence meaning. Order matters.
"Dog bites man" ≠ "Man bites dog"
How RNNs Work
Process text one word at a time. Maintain a hidden state that updates with each word.
h₀ → [word₁] → h₁ → [word₂] → h₂ → ... → hₙ
The final hidden state represents the entire sentence.
Problems with RNNs
Vanishing Gradients
Long sequences lose information from early words.
Sequential Processing
Can't parallelize. Each step depends on the previous one. Slow training.
4. The Transformer
Published in 2017: "Attention Is All You Need"
The key innovation: Self-Attention
Self-Attention
Instead of processing sequentially, look at all words at once.
Each word asks: "Which other words should I pay attention to?"
"The cat sat on the mat because it was tired"
When processing "it", self-attention learns that "it" refers to "cat", not "mat".
Architecture
Encoder
Reads the input. Creates rich representations.
Decoder
Generates output. Attends to encoder's output.
Why It Works
- Parallel processing - All positions computed at once
- Long-range dependencies - Any word can attend to any other word
- Scalable - More layers, more parameters, better performance
Brief History
- 1980s - RNNs, LSTMs invented
- 2013 - Word2Vec shows meaningful embeddings
- 2017 - Transformer published
- 2018 - BERT, GPT-1
- 2020s - GPT-3, ChatGPT, scaling laws
The Transformer enabled the LLM revolution.
Evaluation Metrics
For Classification
- Accuracy - % correct
- Precision - Of predicted positives, how many are actually positive?
- Recall - Of actual positives, how many did we find?
- F1 Score - Harmonic mean of precision and recall
For Generation
- BLEU - How well does output match reference?
- ROUGE - Captures different aspects of text similarity
- Perplexity - How "surprised" is the model by its predictions? Lower = better.
Key Takeaways
- Tokenization = Converting text to numbers (subword is best)
- Embeddings = Dense vector representations of words
- RNNs = Sequential processing (slow, loses long-range info)
- Transformer = Parallel attention-based processing
- Self-Attention = Each word attends to all other words
Sources:
