Transformer

The Transformer changed everything. Published in 2017, it became the foundation of every major language model today.

Let's break it down from the ground up.

What is NLP?

Natural Language Processing (NLP) is about making computers understand and work with text.

Types of NLP Tasks

Classification
Input: text → Output: single prediction

Sentiment analysis (positive/negative/neutral)
Intent detection ("set an alarm")
Language detection
Topic modeling

Multi-Classification
Input: text → Output: multiple predictions

Named Entity Recognition (labeling words as person, location, time)
Part-of-speech tagging

Generation
Input: text → Output: text

Machine translation
Question answering (like ChatGPT)
Summarization
Code generation

1. Tokenization

Models understand numbers, not words. We need to convert text into tokens.

Methods

Word Level
Split by words: ["a", "cute", "cat"]

Problem: "bear" and "bears" become different tokens. Hard to leverage root meanings.

Character Level
Split by characters: ["a", " ", "c", "u", "t", "e"]

Problem: Sequences become very long. Slow processing.

Subword Level (Most Common)
Split smartly: ["un", "happi", "ness"]

Best of both worlds. Handles unknown words. Leverages word roots.

Vocabulary size: ~50,000 tokens for English models, ~100,000+ for multilingual.

2. Word Embeddings

Tokens need numerical representations.

One-Hot Encoding

Each word gets a unique vector with a single 1.

"cat" → [1, 0, 0, 0, ...]
"dog" → [0, 1, 0, 0, ...]

Problem: All vectors are orthogonal. No semantic similarity captured.

Learned Embeddings

We want: similar words → similar vectors.

Word2Vec (2013)
Train a model to predict the next word. The hidden layer becomes the word embedding.

"king" - "man" + "woman" ≈ "queen"

This works because embeddings capture meaning, not just identity.

How It Works

Take a word, convert to one-hot
Pass through a small neural network
Output: probability of next word
Train on lots of text
The hidden layer = word embedding

Embedding size is typically 256-1024 dimensions. Trade-off between richness and speed.

3. Recurrent Neural Networks (RNNs)

Word embeddings alone don't capture sentence meaning. Order matters.

"Dog bites man" ≠ "Man bites dog"

How RNNs Work

Process text one word at a time. Maintain a hidden state that updates with each word.

h₀ → [word₁] → h₁ → [word₂] → h₂ → ... → hₙ

The final hidden state represents the entire sentence.

Problems with RNNs

Vanishing Gradients
Long sequences lose information from early words.

Sequential Processing
Can't parallelize. Each step depends on the previous one. Slow training.

4. The Transformer

Published in 2017: "Attention Is All You Need"

The key innovation: Self-Attention

Self-Attention

Instead of processing sequentially, look at all words at once.

Each word asks: "Which other words should I pay attention to?"

"The cat sat on the mat because it was tired"

When processing "it", self-attention learns that "it" refers to "cat", not "mat".

Architecture

Encoder
Reads the input. Creates rich representations.

Decoder
Generates output. Attends to encoder's output.

Why It Works

Parallel processing - All positions computed at once
Long-range dependencies - Any word can attend to any other word
Scalable - More layers, more parameters, better performance

Brief History

1980s - RNNs, LSTMs invented
2013 - Word2Vec shows meaningful embeddings
2017 - Transformer published
2018 - BERT, GPT-1
2020s - GPT-3, ChatGPT, scaling laws

The Transformer enabled the LLM revolution.

Evaluation Metrics

For Classification

Accuracy - % correct
Precision - Of predicted positives, how many are actually positive?
Recall - Of actual positives, how many did we find?
F1 Score - Harmonic mean of precision and recall

For Generation

BLEU - How well does output match reference?
ROUGE - Captures different aspects of text similarity
Perplexity - How "surprised" is the model by its predictions? Lower = better.

Key Takeaways

Tokenization = Converting text to numbers (subword is best)
Embeddings = Dense vector representations of words
RNNs = Sequential processing (slow, loses long-range info)
Transformer = Parallel attention-based processing
Self-Attention = Each word attends to all other words

Sources: