Transformer

January 30, 2026

1 / 1
Transformer

The Transformer changed everything. Published in 2017, it became the foundation of every major language model today.

Let's break it down from the ground up.

What is NLP?

Natural Language Processing (NLP) is about making computers understand and work with text.

Types of NLP Tasks

Classification
Input: text → Output: single prediction

  • Sentiment analysis (positive/negative/neutral)
  • Intent detection ("set an alarm")
  • Language detection
  • Topic modeling

Multi-Classification
Input: text → Output: multiple predictions

  • Named Entity Recognition (labeling words as person, location, time)
  • Part-of-speech tagging

Generation
Input: text → Output: text

  • Machine translation
  • Question answering (like ChatGPT)
  • Summarization
  • Code generation

1. Tokenization

Models understand numbers, not words. We need to convert text into tokens.

Methods

Word Level
Split by words: ["a", "cute", "cat"]

Problem: "bear" and "bears" become different tokens. Hard to leverage root meanings.

Character Level
Split by characters: ["a", " ", "c", "u", "t", "e"]

Problem: Sequences become very long. Slow processing.

Subword Level (Most Common)
Split smartly: ["un", "happi", "ness"]

Best of both worlds. Handles unknown words. Leverages word roots.

Vocabulary size: ~50,000 tokens for English models, ~100,000+ for multilingual.

2. Word Embeddings

Tokens need numerical representations.

One-Hot Encoding

Each word gets a unique vector with a single 1.

"cat"  [1, 0, 0, 0, ...]
"dog"  [0, 1, 0, 0, ...]

Problem: All vectors are orthogonal. No semantic similarity captured.

Learned Embeddings

We want: similar words → similar vectors.

Word2Vec (2013)
Train a model to predict the next word. The hidden layer becomes the word embedding.

"king" - "man" + "woman"  "queen"

This works because embeddings capture meaning, not just identity.

How It Works

  1. Take a word, convert to one-hot
  2. Pass through a small neural network
  3. Output: probability of next word
  4. Train on lots of text
  5. The hidden layer = word embedding

Embedding size is typically 256-1024 dimensions. Trade-off between richness and speed.

3. Recurrent Neural Networks (RNNs)

Word embeddings alone don't capture sentence meaning. Order matters.

"Dog bites man" ≠ "Man bites dog"

How RNNs Work

Process text one word at a time. Maintain a hidden state that updates with each word.

h₀  [word₁]  h₁  [word₂]  h₂  ...  hₙ

The final hidden state represents the entire sentence.

Problems with RNNs

Vanishing Gradients
Long sequences lose information from early words.

Sequential Processing
Can't parallelize. Each step depends on the previous one. Slow training.

4. The Transformer

Published in 2017: "Attention Is All You Need"

The key innovation: Self-Attention

Self-Attention

Instead of processing sequentially, look at all words at once.

Each word asks: "Which other words should I pay attention to?"

"The cat sat on the mat because it was tired"

When processing "it", self-attention learns that "it" refers to "cat", not "mat".

Architecture

Encoder
Reads the input. Creates rich representations.

Decoder
Generates output. Attends to encoder's output.

Why It Works

  • Parallel processing - All positions computed at once
  • Long-range dependencies - Any word can attend to any other word
  • Scalable - More layers, more parameters, better performance

Brief History

  • 1980s - RNNs, LSTMs invented
  • 2013 - Word2Vec shows meaningful embeddings
  • 2017 - Transformer published
  • 2018 - BERT, GPT-1
  • 2020s - GPT-3, ChatGPT, scaling laws

The Transformer enabled the LLM revolution.

Evaluation Metrics

For Classification

  • Accuracy - % correct
  • Precision - Of predicted positives, how many are actually positive?
  • Recall - Of actual positives, how many did we find?
  • F1 Score - Harmonic mean of precision and recall

For Generation

  • BLEU - How well does output match reference?
  • ROUGE - Captures different aspects of text similarity
  • Perplexity - How "surprised" is the model by its predictions? Lower = better.

Key Takeaways

  • Tokenization = Converting text to numbers (subword is best)
  • Embeddings = Dense vector representations of words
  • RNNs = Sequential processing (slow, loses long-range info)
  • Transformer = Parallel attention-based processing
  • Self-Attention = Each word attends to all other words

Sources: