Tokenization and Byte-Pair Encoding

Tokenization is the process of converting raw text into sequences of discrete tokens (integer IDs) that neural networks can process. The choice of tokenization strategy significantly impacts model performance, vocabulary size, and computational efficiency.

Why Tokenization?

Neural networks operate on numeric tensors, not text. Tokenization bridges this gap by:

Converting text to numbers: Map strings to integer IDs
Creating a vocabulary: Define the set of valid tokens
Enabling learning: Let the model learn embeddings for each token
Handling unknown inputs: Deal with words/characters not seen during training

Tokenization Strategies

Character-Level Tokenization

Treat each character as a token.

Example: "Hello" → ['H', 'e', 'l', 'l', 'o'] → [72, 101, 108, 108, 111]

Pros:

Very small vocabulary (26 letters + punctuation ≈ 100 tokens)
No out-of-vocabulary (OOV) problems
Works for any language

Cons:

Very long sequences (one token per character)
Harder to learn semantic relationships
Computationally expensive (quadratic attention cost)

Word-Level Tokenization

Treat each word as a token.

Example: "Hello, world!" → ['Hello', ',', 'world', '!'] → [5492, 11, 8495, 0]

Pros:

Natural semantic units
Shorter sequences than character-level
Intuitive and interpretable

Cons:

Huge vocabulary (100K+ words for English)
Out-of-vocabulary problems for rare words
Different forms of same word are separate tokens (run, running, ran)
Poor multilingual support

Subword Tokenization (BPE)

Best of both worlds: Frequent words become single tokens, rare words split into subwords.

Example:

"Hello" → ['Hello'] (common word, single token)
"tokenization" → ['token', 'ization'] (less common, split)
"antidisestablishmentarianism" → ['anti', 'dis', 'establish', 'ment', 'arian', 'ism']

Pros:

Balanced vocabulary size (32K-50K tokens typical)
No OOV issues (can always fall back to characters)
Learns frequent patterns (like suffixes, prefixes)
Good multilingual support
Used by GPT, BERT, T5, and most modern LLMs

Cons:

More complex implementation
Token boundaries may not align with linguistic units

Byte-Pair Encoding (BPE) Algorithm

BPE is a compression algorithm adapted for tokenization. It builds vocabulary by iteratively merging the most frequent character pairs.

Algorithm Steps

Initialize: Start with vocabulary of individual characters
Count pairs: Find the most frequent adjacent pair of tokens
Merge pair: Create new token by combining the pair
Repeat: Continue until vocabulary reaches desired size

Example Walkthrough

Corpus: "low low low lower lowest"

Step 0 (characters only):


Tokens: ['l', 'o', 'w', ' ', 'l', 'o', 'w', ' ', 'l', 'o', 'w', ' ',
         'l', 'o', 'w', 'e', 'r', ' ', 'l', 'o', 'w', 'e', 's', 't']
Vocab: ['l', 'o', 'w', ' ', 'e', 'r', 's', 't']

Step 1 (merge most frequent pair 'l', 'o'):


Tokens: ['lo', 'w', ' ', 'lo', 'w', ' ', 'lo', 'w', ' ',
         'lo', 'w', 'e', 'r', ' ', 'lo', 'w', 'e', 's', 't']
Vocab: ['l', 'o', 'w', ' ', 'e', 'r', 's', 't', 'lo']

Step 2 (merge 'lo', 'w'):


Tokens: ['low', ' ', 'low', ' ', 'low', ' ', 'low', 'e', 'r', ' ', 'low', 'e', 's', 't']
Vocab: ['l', 'o', 'w', ' ', 'e', 'r', 's', 't', 'lo', 'low']

Step 3 (merge 'low', 'e'):


Tokens: ['low', ' ', 'low', ' ', 'low', ' ', 'lowe', 'r', ' ', 'lowe', 's', 't']
Vocab: ['l', 'o', 'w', ' ', 'e', 'r', 's', 't', 'lo', 'low', 'lowe']

And so on, until vocabulary reaches target size.

Implementation with TikToken

Modern implementations use efficient C-based libraries like OpenAI’s TikToken:


import tiktoken
 
# Load GPT-2 tokenizer
enc = tiktoken.get_encoding("gpt2")
 
# Encode text to token IDs
text = "Hello, world!"
tokens = enc.encode(text)
print(tokens)  # [15496, 11, 995, 0]
 
# Decode token IDs back to text
decoded = enc.decode(tokens)
print(decoded)  # "Hello, world!"
 
# See individual tokens
for token_id in tokens:
    print(f"{token_id}: {enc.decode([token_id])!r}")
# 15496: 'Hello'
# 11: ','
# 995: ' world'
# 0: '!'

Special Tokens

Most tokenizers include special tokens:


# GPT-2 special token
enc.encode("<|endoftext|>")  # [50256]
 
# Used for:
# - End of sequence marker
# - Separating different documents during training
# - Padding (in some implementations)

Tokenization Properties

Vocabulary Size

Typical sizes for modern models:

GPT-2: 50,257 tokens
GPT-3: 50,257 tokens (same as GPT-2)
GPT-4: ~100,000 tokens (improved multilingual support)
BERT: 30,522 tokens (WordPiece, similar to BPE)

Trade-offs:

Smaller vocab: Longer sequences, more context needed
Larger vocab: Shorter sequences, sparser embeddings, more memory

Compression Ratio

How many characters become one token:


text = "The quick brown fox jumps over the lazy dog"
tokens = enc.encode(text)
compression_ratio = len(text) / len(tokens)
# English text typically: 3-4 characters per token

Implications:

Model with 2048 token context ≈ 6000-8000 characters of English text
Compression ratio varies by language (worse for non-Latin scripts)

Out-of-Vocabulary Handling

BPE can tokenize any string by falling back to characters:


# Extremely rare word
text = "supercalifragilisticexpialidocious"
tokens = enc.encode(text)
# Splits into subwords: ['super', 'cal', 'ifr', 'ag', 'ilistic', 'exp', 'ial', 'id', 'oc', 'ious']
# No token is "unknown"

Domain-Specific Tokenization

Training Custom Tokenizers

For specialized domains (medical, legal, code), you can train custom BPE:


from tokenizers import Tokenizer, models, trainers
 
# Initialize BPE model
tokenizer = Tokenizer(models.BPE())
 
# Train on domain-specific corpus
trainer = trainers.BpeTrainer(vocab_size=32000, special_tokens=["<|endoftext|>"])
tokenizer.train_from_iterator(medical_texts_iterator, trainer=trainer)
 
# Save for reuse
tokenizer.save("medical_tokenizer.json")

Benefits:

Better compression for domain-specific terms
Preserves important domain vocabulary as single tokens
Example: Medical tokenizer keeps “diabetes” as one token, not split into subwords

Structured Data Tokenization

For non-text sequences (event logs, medical codes, DNA):


# Treat each discrete event as a "word"
events = ["LOGIN", "VIEW_PAGE", "ADD_TO_CART", "CHECKOUT", "LOGOUT"]
event_vocab = {event: idx for idx, event in enumerate(set(events))}
event_ids = [event_vocab[e] for e in events]
# [0, 1, 2, 3, 4]
 
# Can apply BPE to learn frequent event patterns
# "LOGIN" + "VIEW_PAGE" might become single token if frequent

Common Issues and Solutions

Tokenization Artifacts

Problem: Inconsistent tokenization can confuse models.


# Space before word affects tokenization
enc.encode("Hello")      # [15496]
enc.encode(" Hello")     # [18435]  # Different token!

Solution: Normalize input text before tokenization.

Long Tokens

Problem: Some tokens are very long subwords.


enc.encode("antidisestablishmentarianism")  # May be 5-6 tokens

Impact: Models see these as atomic units, may miss internal structure.

Multilingual Challenges

Problem: BPE trained on English over-represents English.


# English: ~3-4 chars/token
enc.encode("Hello world")  # 2 tokens, 11 chars → 5.5 chars/token
 
# Arabic: worse compression
enc.encode("مرحبا بالعالم")  # 6+ tokens, 12 chars → ~2 chars/token

Solution: Train on balanced multilingual corpus (like GPT-4 tokenizer).

Key Insights

Subword tokenization is standard: BPE balances vocabulary size and sequence length
Tokenizer affects model behavior: Different tokenizers create different token boundaries
No true OOV: BPE can encode any text by falling back to bytes/characters
Domain matters: Custom tokenizers improve performance on specialized text
Multilingual is hard: Need balanced training data for fair cross-lingual compression

GPT Architecture - Uses token embeddings as input
Text Generation - Generates token IDs that are decoded to text
Language Model Training - Trains on tokenized sequences

Learning Resources

Papers

Neural Machine Translation of Rare Words with Subword Units (Sennrich et al., 2016) - Original BPE for NLP paper
SentencePiece: A simple and language independent approach to subword tokenization (Kudo & Richardson, 2018) - Alternative BPE implementation

Implementations

OpenAI TikToken - Fast BPE tokenizer used by GPT models
Hugging Face Tokenizers - Rust-based fast tokenizers library
SentencePiece - Google’s language-independent tokenizer

Articles

The Illustrated Word2Vec - Jay Alammar on word embeddings
Hugging Face Tokenizers Course - Comprehensive tokenization guide
Let’s build the GPT Tokenizer - Andrej Karpathy builds BPE from scratch

Interactive Tools

OpenAI Tokenizer - Visualize how GPT tokenizes text
TikToken Playground - Interactive BPE tokenization explorer

Tokenization and Byte-Pair Encoding

Why Tokenization?

Tokenization Strategies

Character-Level Tokenization

Word-Level Tokenization

Subword Tokenization (BPE)

Byte-Pair Encoding (BPE) Algorithm

Algorithm Steps

Example Walkthrough

Implementation with TikToken

Special Tokens

Tokenization Properties

Vocabulary Size

Compression Ratio

Out-of-Vocabulary Handling

Domain-Specific Tokenization

Training Custom Tokenizers

Structured Data Tokenization

Common Issues and Solutions

Tokenization Artifacts

Long Tokens

Multilingual Challenges

Key Insights

Related Concepts

Learning Resources

Papers

Implementations

Articles

Interactive Tools