Skip to Content
LibraryConceptsTokenization

Tokenization and Byte-Pair Encoding

Tokenization is the process of converting raw text into sequences of discrete tokens (integer IDs) that neural networks can process. The choice of tokenization strategy significantly impacts model performance, vocabulary size, and computational efficiency.

Why Tokenization?

Neural networks operate on numeric tensors, not text. Tokenization bridges this gap by:

  1. Converting text to numbers: Map strings to integer IDs
  2. Creating a vocabulary: Define the set of valid tokens
  3. Enabling learning: Let the model learn embeddings for each token
  4. Handling unknown inputs: Deal with words/characters not seen during training

Tokenization Strategies

Character-Level Tokenization

Treat each character as a token.

Example: "Hello"['H', 'e', 'l', 'l', 'o'][72, 101, 108, 108, 111]

Pros:

  • Very small vocabulary (26 letters + punctuation ≈ 100 tokens)
  • No out-of-vocabulary (OOV) problems
  • Works for any language

Cons:

  • Very long sequences (one token per character)
  • Harder to learn semantic relationships
  • Computationally expensive (quadratic attention cost)

Word-Level Tokenization

Treat each word as a token.

Example: "Hello, world!"['Hello', ',', 'world', '!'][5492, 11, 8495, 0]

Pros:

  • Natural semantic units
  • Shorter sequences than character-level
  • Intuitive and interpretable

Cons:

  • Huge vocabulary (100K+ words for English)
  • Out-of-vocabulary problems for rare words
  • Different forms of same word are separate tokens (run, running, ran)
  • Poor multilingual support

Subword Tokenization (BPE)

Best of both worlds: Frequent words become single tokens, rare words split into subwords.

Example:

  • "Hello"['Hello'] (common word, single token)
  • "tokenization"['token', 'ization'] (less common, split)
  • "antidisestablishmentarianism"['anti', 'dis', 'establish', 'ment', 'arian', 'ism']

Pros:

  • Balanced vocabulary size (32K-50K tokens typical)
  • No OOV issues (can always fall back to characters)
  • Learns frequent patterns (like suffixes, prefixes)
  • Good multilingual support
  • Used by GPT, BERT, T5, and most modern LLMs

Cons:

  • More complex implementation
  • Token boundaries may not align with linguistic units

Byte-Pair Encoding (BPE) Algorithm

BPE is a compression algorithm adapted for tokenization. It builds vocabulary by iteratively merging the most frequent character pairs.

Algorithm Steps

  1. Initialize: Start with vocabulary of individual characters
  2. Count pairs: Find the most frequent adjacent pair of tokens
  3. Merge pair: Create new token by combining the pair
  4. Repeat: Continue until vocabulary reaches desired size

Example Walkthrough

Corpus: "low low low lower lowest"

Step 0 (characters only):

Tokens: ['l', 'o', 'w', ' ', 'l', 'o', 'w', ' ', 'l', 'o', 'w', ' ', 'l', 'o', 'w', 'e', 'r', ' ', 'l', 'o', 'w', 'e', 's', 't'] Vocab: ['l', 'o', 'w', ' ', 'e', 'r', 's', 't']

Step 1 (merge most frequent pair 'l', 'o'):

Tokens: ['lo', 'w', ' ', 'lo', 'w', ' ', 'lo', 'w', ' ', 'lo', 'w', 'e', 'r', ' ', 'lo', 'w', 'e', 's', 't'] Vocab: ['l', 'o', 'w', ' ', 'e', 'r', 's', 't', 'lo']

Step 2 (merge 'lo', 'w'):

Tokens: ['low', ' ', 'low', ' ', 'low', ' ', 'low', 'e', 'r', ' ', 'low', 'e', 's', 't'] Vocab: ['l', 'o', 'w', ' ', 'e', 'r', 's', 't', 'lo', 'low']

Step 3 (merge 'low', 'e'):

Tokens: ['low', ' ', 'low', ' ', 'low', ' ', 'lowe', 'r', ' ', 'lowe', 's', 't'] Vocab: ['l', 'o', 'w', ' ', 'e', 'r', 's', 't', 'lo', 'low', 'lowe']

And so on, until vocabulary reaches target size.

Implementation with TikToken

Modern implementations use efficient C-based libraries like OpenAI’s TikToken:

import tiktoken # Load GPT-2 tokenizer enc = tiktoken.get_encoding("gpt2") # Encode text to token IDs text = "Hello, world!" tokens = enc.encode(text) print(tokens) # [15496, 11, 995, 0] # Decode token IDs back to text decoded = enc.decode(tokens) print(decoded) # "Hello, world!" # See individual tokens for token_id in tokens: print(f"{token_id}: {enc.decode([token_id])!r}") # 15496: 'Hello' # 11: ',' # 995: ' world' # 0: '!'

Special Tokens

Most tokenizers include special tokens:

# GPT-2 special token enc.encode("<|endoftext|>") # [50256] # Used for: # - End of sequence marker # - Separating different documents during training # - Padding (in some implementations)

Tokenization Properties

Vocabulary Size

Typical sizes for modern models:

  • GPT-2: 50,257 tokens
  • GPT-3: 50,257 tokens (same as GPT-2)
  • GPT-4: ~100,000 tokens (improved multilingual support)
  • BERT: 30,522 tokens (WordPiece, similar to BPE)

Trade-offs:

  • Smaller vocab: Longer sequences, more context needed
  • Larger vocab: Shorter sequences, sparser embeddings, more memory

Compression Ratio

How many characters become one token:

text = "The quick brown fox jumps over the lazy dog" tokens = enc.encode(text) compression_ratio = len(text) / len(tokens) # English text typically: 3-4 characters per token

Implications:

  • Model with 2048 token context ≈ 6000-8000 characters of English text
  • Compression ratio varies by language (worse for non-Latin scripts)

Out-of-Vocabulary Handling

BPE can tokenize any string by falling back to characters:

# Extremely rare word text = "supercalifragilisticexpialidocious" tokens = enc.encode(text) # Splits into subwords: ['super', 'cal', 'ifr', 'ag', 'ilistic', 'exp', 'ial', 'id', 'oc', 'ious'] # No token is "unknown"

Domain-Specific Tokenization

Training Custom Tokenizers

For specialized domains (medical, legal, code), you can train custom BPE:

from tokenizers import Tokenizer, models, trainers # Initialize BPE model tokenizer = Tokenizer(models.BPE()) # Train on domain-specific corpus trainer = trainers.BpeTrainer(vocab_size=32000, special_tokens=["<|endoftext|>"]) tokenizer.train_from_iterator(medical_texts_iterator, trainer=trainer) # Save for reuse tokenizer.save("medical_tokenizer.json")

Benefits:

  • Better compression for domain-specific terms
  • Preserves important domain vocabulary as single tokens
  • Example: Medical tokenizer keeps “diabetes” as one token, not split into subwords

Structured Data Tokenization

For non-text sequences (event logs, medical codes, DNA):

# Treat each discrete event as a "word" events = ["LOGIN", "VIEW_PAGE", "ADD_TO_CART", "CHECKOUT", "LOGOUT"] event_vocab = {event: idx for idx, event in enumerate(set(events))} event_ids = [event_vocab[e] for e in events] # [0, 1, 2, 3, 4] # Can apply BPE to learn frequent event patterns # "LOGIN" + "VIEW_PAGE" might become single token if frequent

Common Issues and Solutions

Tokenization Artifacts

Problem: Inconsistent tokenization can confuse models.

# Space before word affects tokenization enc.encode("Hello") # [15496] enc.encode(" Hello") # [18435] # Different token!

Solution: Normalize input text before tokenization.

Long Tokens

Problem: Some tokens are very long subwords.

enc.encode("antidisestablishmentarianism") # May be 5-6 tokens

Impact: Models see these as atomic units, may miss internal structure.

Multilingual Challenges

Problem: BPE trained on English over-represents English.

# English: ~3-4 chars/token enc.encode("Hello world") # 2 tokens, 11 chars → 5.5 chars/token # Arabic: worse compression enc.encode("مرحبا بالعالم") # 6+ tokens, 12 chars → ~2 chars/token

Solution: Train on balanced multilingual corpus (like GPT-4 tokenizer).

Key Insights

  1. Subword tokenization is standard: BPE balances vocabulary size and sequence length
  2. Tokenizer affects model behavior: Different tokenizers create different token boundaries
  3. No true OOV: BPE can encode any text by falling back to bytes/characters
  4. Domain matters: Custom tokenizers improve performance on specialized text
  5. Multilingual is hard: Need balanced training data for fair cross-lingual compression

Learning Resources

Papers

  • Neural Machine Translation of Rare Words with Subword Units (Sennrich et al., 2016) - Original BPE for NLP paper
  • SentencePiece: A simple and language independent approach to subword tokenization (Kudo & Richardson, 2018) - Alternative BPE implementation

Implementations

Articles

Interactive Tools