Tokenization and Byte-Pair Encoding
Tokenization is the process of converting raw text into sequences of discrete tokens (integer IDs) that neural networks can process. The choice of tokenization strategy significantly impacts model performance, vocabulary size, and computational efficiency.
Why Tokenization?
Neural networks operate on numeric tensors, not text. Tokenization bridges this gap by:
- Converting text to numbers: Map strings to integer IDs
- Creating a vocabulary: Define the set of valid tokens
- Enabling learning: Let the model learn embeddings for each token
- Handling unknown inputs: Deal with words/characters not seen during training
Tokenization Strategies
Character-Level Tokenization
Treat each character as a token.
Example: "Hello" → ['H', 'e', 'l', 'l', 'o'] → [72, 101, 108, 108, 111]
Pros:
- Very small vocabulary (26 letters + punctuation ≈ 100 tokens)
- No out-of-vocabulary (OOV) problems
- Works for any language
Cons:
- Very long sequences (one token per character)
- Harder to learn semantic relationships
- Computationally expensive (quadratic attention cost)
Word-Level Tokenization
Treat each word as a token.
Example: "Hello, world!" → ['Hello', ',', 'world', '!'] → [5492, 11, 8495, 0]
Pros:
- Natural semantic units
- Shorter sequences than character-level
- Intuitive and interpretable
Cons:
- Huge vocabulary (100K+ words for English)
- Out-of-vocabulary problems for rare words
- Different forms of same word are separate tokens (
run,running,ran) - Poor multilingual support
Subword Tokenization (BPE)
Best of both worlds: Frequent words become single tokens, rare words split into subwords.
Example:
"Hello"→['Hello'](common word, single token)"tokenization"→['token', 'ization'](less common, split)"antidisestablishmentarianism"→['anti', 'dis', 'establish', 'ment', 'arian', 'ism']
Pros:
- Balanced vocabulary size (32K-50K tokens typical)
- No OOV issues (can always fall back to characters)
- Learns frequent patterns (like suffixes, prefixes)
- Good multilingual support
- Used by GPT, BERT, T5, and most modern LLMs
Cons:
- More complex implementation
- Token boundaries may not align with linguistic units
Byte-Pair Encoding (BPE) Algorithm
BPE is a compression algorithm adapted for tokenization. It builds vocabulary by iteratively merging the most frequent character pairs.
Algorithm Steps
- Initialize: Start with vocabulary of individual characters
- Count pairs: Find the most frequent adjacent pair of tokens
- Merge pair: Create new token by combining the pair
- Repeat: Continue until vocabulary reaches desired size
Example Walkthrough
Corpus: "low low low lower lowest"
Step 0 (characters only):
Tokens: ['l', 'o', 'w', ' ', 'l', 'o', 'w', ' ', 'l', 'o', 'w', ' ',
'l', 'o', 'w', 'e', 'r', ' ', 'l', 'o', 'w', 'e', 's', 't']
Vocab: ['l', 'o', 'w', ' ', 'e', 'r', 's', 't']Step 1 (merge most frequent pair 'l', 'o'):
Tokens: ['lo', 'w', ' ', 'lo', 'w', ' ', 'lo', 'w', ' ',
'lo', 'w', 'e', 'r', ' ', 'lo', 'w', 'e', 's', 't']
Vocab: ['l', 'o', 'w', ' ', 'e', 'r', 's', 't', 'lo']Step 2 (merge 'lo', 'w'):
Tokens: ['low', ' ', 'low', ' ', 'low', ' ', 'low', 'e', 'r', ' ', 'low', 'e', 's', 't']
Vocab: ['l', 'o', 'w', ' ', 'e', 'r', 's', 't', 'lo', 'low']Step 3 (merge 'low', 'e'):
Tokens: ['low', ' ', 'low', ' ', 'low', ' ', 'lowe', 'r', ' ', 'lowe', 's', 't']
Vocab: ['l', 'o', 'w', ' ', 'e', 'r', 's', 't', 'lo', 'low', 'lowe']And so on, until vocabulary reaches target size.
Implementation with TikToken
Modern implementations use efficient C-based libraries like OpenAI’s TikToken:
import tiktoken
# Load GPT-2 tokenizer
enc = tiktoken.get_encoding("gpt2")
# Encode text to token IDs
text = "Hello, world!"
tokens = enc.encode(text)
print(tokens) # [15496, 11, 995, 0]
# Decode token IDs back to text
decoded = enc.decode(tokens)
print(decoded) # "Hello, world!"
# See individual tokens
for token_id in tokens:
print(f"{token_id}: {enc.decode([token_id])!r}")
# 15496: 'Hello'
# 11: ','
# 995: ' world'
# 0: '!'Special Tokens
Most tokenizers include special tokens:
# GPT-2 special token
enc.encode("<|endoftext|>") # [50256]
# Used for:
# - End of sequence marker
# - Separating different documents during training
# - Padding (in some implementations)Tokenization Properties
Vocabulary Size
Typical sizes for modern models:
- GPT-2: 50,257 tokens
- GPT-3: 50,257 tokens (same as GPT-2)
- GPT-4: ~100,000 tokens (improved multilingual support)
- BERT: 30,522 tokens (WordPiece, similar to BPE)
Trade-offs:
- Smaller vocab: Longer sequences, more context needed
- Larger vocab: Shorter sequences, sparser embeddings, more memory
Compression Ratio
How many characters become one token:
text = "The quick brown fox jumps over the lazy dog"
tokens = enc.encode(text)
compression_ratio = len(text) / len(tokens)
# English text typically: 3-4 characters per tokenImplications:
- Model with 2048 token context ≈ 6000-8000 characters of English text
- Compression ratio varies by language (worse for non-Latin scripts)
Out-of-Vocabulary Handling
BPE can tokenize any string by falling back to characters:
# Extremely rare word
text = "supercalifragilisticexpialidocious"
tokens = enc.encode(text)
# Splits into subwords: ['super', 'cal', 'ifr', 'ag', 'ilistic', 'exp', 'ial', 'id', 'oc', 'ious']
# No token is "unknown"Domain-Specific Tokenization
Training Custom Tokenizers
For specialized domains (medical, legal, code), you can train custom BPE:
from tokenizers import Tokenizer, models, trainers
# Initialize BPE model
tokenizer = Tokenizer(models.BPE())
# Train on domain-specific corpus
trainer = trainers.BpeTrainer(vocab_size=32000, special_tokens=["<|endoftext|>"])
tokenizer.train_from_iterator(medical_texts_iterator, trainer=trainer)
# Save for reuse
tokenizer.save("medical_tokenizer.json")Benefits:
- Better compression for domain-specific terms
- Preserves important domain vocabulary as single tokens
- Example: Medical tokenizer keeps “diabetes” as one token, not split into subwords
Structured Data Tokenization
For non-text sequences (event logs, medical codes, DNA):
# Treat each discrete event as a "word"
events = ["LOGIN", "VIEW_PAGE", "ADD_TO_CART", "CHECKOUT", "LOGOUT"]
event_vocab = {event: idx for idx, event in enumerate(set(events))}
event_ids = [event_vocab[e] for e in events]
# [0, 1, 2, 3, 4]
# Can apply BPE to learn frequent event patterns
# "LOGIN" + "VIEW_PAGE" might become single token if frequentCommon Issues and Solutions
Tokenization Artifacts
Problem: Inconsistent tokenization can confuse models.
# Space before word affects tokenization
enc.encode("Hello") # [15496]
enc.encode(" Hello") # [18435] # Different token!Solution: Normalize input text before tokenization.
Long Tokens
Problem: Some tokens are very long subwords.
enc.encode("antidisestablishmentarianism") # May be 5-6 tokensImpact: Models see these as atomic units, may miss internal structure.
Multilingual Challenges
Problem: BPE trained on English over-represents English.
# English: ~3-4 chars/token
enc.encode("Hello world") # 2 tokens, 11 chars → 5.5 chars/token
# Arabic: worse compression
enc.encode("مرحبا بالعالم") # 6+ tokens, 12 chars → ~2 chars/tokenSolution: Train on balanced multilingual corpus (like GPT-4 tokenizer).
Key Insights
- Subword tokenization is standard: BPE balances vocabulary size and sequence length
- Tokenizer affects model behavior: Different tokenizers create different token boundaries
- No true OOV: BPE can encode any text by falling back to bytes/characters
- Domain matters: Custom tokenizers improve performance on specialized text
- Multilingual is hard: Need balanced training data for fair cross-lingual compression
Related Concepts
- GPT Architecture - Uses token embeddings as input
- Text Generation - Generates token IDs that are decoded to text
- Language Model Training - Trains on tokenized sequences
Learning Resources
Papers
- Neural Machine Translation of Rare Words with Subword Units (Sennrich et al., 2016) - Original BPE for NLP paper
- SentencePiece: A simple and language independent approach to subword tokenization (Kudo & Richardson, 2018) - Alternative BPE implementation
Implementations
- OpenAI TikToken - Fast BPE tokenizer used by GPT models
- Hugging Face Tokenizers - Rust-based fast tokenizers library
- SentencePiece - Google’s language-independent tokenizer
Articles
- The Illustrated Word2Vec - Jay Alammar on word embeddings
- Hugging Face Tokenizers Course - Comprehensive tokenization guide
- Let’s build the GPT Tokenizer - Andrej Karpathy builds BPE from scratch
Interactive Tools
- OpenAI Tokenizer - Visualize how GPT tokenizes text
- TikToken Playground - Interactive BPE tokenization explorer