Text Generation Strategies
Text generation is the process of using a trained language model to produce new text. Different generation strategies balance between quality, diversity, and computational cost. The choice of strategy significantly impacts the characteristics of generated text.
Autoregressive Generation Process
All GPT-family models generate text autoregressively (one token at a time):
- Start: Begin with a prompt (context tokens)
- Forward pass: Feed context through model to get next token distribution
- Selection: Sample or select the next token using a strategy
- Append: Add selected token to context
- Repeat: Continue until max length or stop token reached
Pseudocode:
generated = prompt_tokens
for step in range(max_new_tokens):
logits = model(generated) # Forward pass
next_token = select(logits[-1]) # Selection strategy
generated.append(next_token) # Append to sequence
if next_token == end_token:
break
return generatedGreedy Decoding
Always select the token with highest probability.
Implementation
def generate_greedy(model, idx, max_new_tokens, block_size):
"""Generate text by always choosing the most likely next token."""
for _ in range(max_new_tokens):
# Crop context to block_size if needed
idx_cond = idx if idx.size(1) <= block_size else idx[:, -block_size:]
# Get predictions
logits = model(idx_cond) # (B, T, vocab_size)
# Focus only on last position
logits = logits[:, -1, :] # (B, vocab_size)
# Select most likely token
idx_next = torch.argmax(logits, dim=-1, keepdim=True) # (B, 1)
# Append to sequence
idx = torch.cat([idx, idx_next], dim=1)
return idxProperties
Advantages:
- Deterministic: same input always produces same output
- Fast and simple
- Good for tasks requiring consistency
Disadvantages:
- Often produces repetitive text
- Gets stuck in loops (“very very very very…”)
- Not how humans write (humans don’t always pick most likely word)
- Ignores probability distribution beyond argmax
When to use: Deterministic tasks, short completions, when reproducibility is critical.
Sampling from Distribution
Instead of always picking max probability, sample from the full distribution.
Implementation
def generate_sample(model, idx, max_new_tokens, block_size, temperature=1.0):
"""Generate text by sampling from the distribution."""
for _ in range(max_new_tokens):
idx_cond = idx if idx.size(1) <= block_size else idx[:, -block_size:]
logits = model(idx_cond)
logits = logits[:, -1, :] / temperature # Apply temperature
# Convert logits to probabilities
probs = F.softmax(logits, dim=-1)
# Sample from distribution
idx_next = torch.multinomial(probs, num_samples=1)
idx = torch.cat([idx, idx_next], dim=1)
return idxProperties
Advantages:
- Stochastic: different outputs each time
- More diverse and creative
- Explores the probability space
- More human-like text
Disadvantages:
- Can produce incoherent text
- May select very unlikely tokens
- Non-deterministic (harder to debug/reproduce)
When to use: Creative writing, diverse outputs, exploration.
Temperature Scaling
Temperature controls the randomness of the probability distribution.
Formula
where are the logits.
Effects of Temperature
Low Temperature (, e.g., 0.5):
- Sharpens distribution (increases probability of likely tokens)
- More deterministic behavior
- More coherent and focused text
- Less diversity
- Good for: factual text, code generation, question answering
Temperature = 1:
- Uses original model distribution
- Balanced behavior
- Default/baseline
High Temperature (, e.g., 1.5):
- Flattens distribution (gives more probability to unlikely tokens)
- More random behavior
- More diverse and creative text
- Less coherent
- Good for: creative writing, brainstorming, exploration
Examples
Prompt: "The future of AI is"
T=0.5 (conservative):
"The future of AI is the development of artificial intelligence
systems that can perform complex tasks..."
T=1.0 (balanced):
"The future of AI is to create intelligent systems that augment
human capabilities and solve global challenges..."
T=1.5 (creative):
"The future of AI is somewhere between neural networks dancing
with quantum butterflies and recursive dreams..."Extreme Cases
T → 0: Approaches greedy decoding (always picks max) T → ∞: Approaches uniform distribution (all tokens equally likely)
Top-k Sampling
Sample from only the top-k most likely tokens, setting all others to zero probability.
Implementation
def generate_topk(model, idx, max_new_tokens, block_size, k=40, temperature=1.0):
"""Generate text by sampling from top-k tokens."""
for _ in range(max_new_tokens):
idx_cond = idx if idx.size(1) <= block_size else idx[:, -block_size:]
logits = model(idx_cond)
logits = logits[:, -1, :] / temperature
# Keep only top k logits
top_logits, top_indices = torch.topk(logits, k, dim=-1)
# Sample from top k
probs = F.softmax(top_logits, dim=-1)
idx_next_local = torch.multinomial(probs, num_samples=1)
# Map back to vocabulary indices
idx_next = torch.gather(top_indices, -1, idx_next_local)
idx = torch.cat([idx, idx_next], dim=1)
return idxProperties
Advantages:
- Filters out very low-probability tokens
- Prevents sampling nonsensical words
- Good balance between diversity and quality
- Simple to understand and implement
Disadvantages:
- Fixed k may be too restrictive or permissive
- Doesn’t adapt to model’s confidence
- May cut off reasonable alternatives
Typical values: k=40-50 for general text generation
Top-p (Nucleus) Sampling
Sample from the smallest set of tokens whose cumulative probability exceeds .
Implementation
def generate_topp(model, idx, max_new_tokens, block_size, p=0.9, temperature=1.0):
"""Generate text using nucleus (top-p) sampling."""
for _ in range(max_new_tokens):
idx_cond = idx if idx.size(1) <= block_size else idx[:, -block_size:]
logits = model(idx_cond)
logits = logits[:, -1, :] / temperature
# Sort logits in descending order
sorted_logits, sorted_indices = torch.sort(logits, descending=True, dim=-1)
# Compute cumulative probabilities
cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)
# Remove tokens with cumulative probability above threshold
sorted_indices_to_remove = cumulative_probs > p
# Shift right to keep first token above threshold
sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone()
sorted_indices_to_remove[..., 0] = False
# Set removed tokens to -inf
sorted_logits[sorted_indices_to_remove] = float('-inf')
# Sample from filtered distribution
probs = F.softmax(sorted_logits, dim=-1)
idx_next_local = torch.multinomial(probs, num_samples=1)
# Map back to vocabulary indices
idx_next = torch.gather(sorted_indices, -1, idx_next_local)
idx = torch.cat([idx, idx_next], dim=1)
return idxProperties
Advantages:
- Adaptive: Number of tokens considered varies by model confidence
- When model is confident (sharp distribution), consider few tokens
- When model is uncertain (flat distribution), consider more tokens
- Better than fixed top-k in many scenarios
Disadvantages:
- Slightly more complex to implement
- May still occasionally sample poor tokens
Typical values: p=0.9 or p=0.95
Example
Logits: [5.0, 4.5, 2.0, 1.0, 0.5, ...]
Probs: [0.60, 0.30, 0.05, 0.02, 0.01, ...]
Cumulative: [0.60, 0.90, 0.95, 0.97, 0.98, ...]
With p=0.9:
- Include tokens 1 and 2 (cumulative = 0.90)
- Renormalize: [0.67, 0.33]
- Sample from these two tokens onlyBeam Search
Maintain multiple candidate sequences (beams), exploring the most promising paths.
Implementation
def beam_search(model, idx, max_new_tokens, block_size, beam_width=5):
"""Generate text using beam search."""
# Start with initial sequence
beams = [(idx, 0.0)] # (sequence, log_probability)
for _ in range(max_new_tokens):
candidates = []
for seq, score in beams:
# Crop to block size
seq_cond = seq if seq.size(1) <= block_size else seq[:, -block_size:]
# Get predictions
logits = model(seq_cond)
logits = logits[:, -1, :]
log_probs = F.log_softmax(logits, dim=-1)
# Get top beam_width tokens
top_log_probs, top_indices = torch.topk(log_probs, beam_width, dim=-1)
# Create new candidate sequences
for i in range(beam_width):
new_seq = torch.cat([seq, top_indices[:, i:i+1]], dim=1)
new_score = score + top_log_probs[0, i].item()
candidates.append((new_seq, new_score))
# Keep top beam_width candidates
beams = sorted(candidates, key=lambda x: x[1], reverse=True)[:beam_width]
# Return best sequence
return beams[0][0]Properties
Advantages:
- Explores multiple paths simultaneously
- Often produces more coherent text than sampling
- Good for tasks requiring consistency (translation, summarization)
- Can find globally better sequences
Disadvantages:
- times slower than greedy (for beam width )
- Higher memory usage
- Can produce generic, “safe” text
- May miss creative alternatives
When to use: Translation, summarization, code generation, when quality > diversity.
Comparison of Methods
| Method | Diversity | Coherence | Speed | Deterministic | Use Case |
|---|---|---|---|---|---|
| Greedy | Very Low | High | Fastest | Yes | Deterministic tasks, debugging |
| Sampling | Very High | Low-Medium | Fast | No | Creative writing, exploration |
| Top-k | Medium | Medium-High | Fast | No | General text generation |
| Top-p | Medium | Medium-High | Fast | No | General text generation (recommended) |
| Beam Search | Low | Very High | Slow | Yes | Translation, summarization |
KV Cache Optimization
During generation, cache key and value matrices to avoid recomputing attention for previous tokens.
Implementation
def generate_with_kv_cache(model, idx, max_new_tokens):
"""Generate text with KV caching for efficiency."""
past_kvs = None
for _ in range(max_new_tokens):
if past_kvs is None:
# First iteration: process full context
logits, past_kvs = model(idx, use_cache=True)
logits = logits[:, -1, :]
else:
# Subsequent iterations: only process new token
logits, past_kvs = model(idx[:, -1:], past_kv=past_kvs, use_cache=True)
logits = logits[:, -1, :]
# Sample next token
idx_next = torch.multinomial(F.softmax(logits, dim=-1), num_samples=1)
idx = torch.cat([idx, idx_next], dim=1)
return idxBenefits
- 2-3× faster generation
- Constant memory per step (instead of linear growth)
- Essential for long-context generation
- Standard in production LLMs (GPT-4, Claude, Llama)
How It Works
Without cache: Recompute attention for entire sequence each step
- Step 1: Compute attention for tokens [0]
- Step 2: Compute attention for tokens [0, 1]
- Step 3: Compute attention for tokens [0, 1, 2]
- Total: operations
With cache: Only compute attention for new token
- Step 1: Compute K, V for token [0], cache them
- Step 2: Compute K, V for token [1], concatenate with cache
- Step 3: Compute K, V for token [2], concatenate with cache
- Total: operations
Practical Recommendations
General Text Generation
# Recommended: top-p with moderate temperature
temperature = 0.8
top_p = 0.95Creative Writing
# Higher temperature, top-p for diversity
temperature = 1.2
top_p = 0.95Code Generation
# Lower temperature, top-k for focus
temperature = 0.2
top_k = 10Factual Question Answering
# Very low temperature, greedy or small top-k
temperature = 0.1
top_k = 5
# Or just greedyConversational AI
# Moderate settings for natural conversation
temperature = 0.9
top_p = 0.9Key Insights
- No free lunch: Trade-off between diversity and quality
- Temperature is powerful: Single parameter dramatically changes behavior
- Top-p > top-k: Adaptive nucleus sampling often works better than fixed top-k
- Context matters: Best strategy depends on use case
- KV cache is essential: Must-have for production deployments
- Repetition penalty: Many systems add penalties for repeated tokens (not covered here)
Related Concepts
- GPT Architecture - The model used for generation
- Causal Attention - Enables autoregressive generation
- Language Model Training - How models learn to generate
- Tokenization - Converting tokens back to text
Learning Resources
Papers
- The Curious Case of Neural Text Degeneration (Holtzman et al., 2019) - Introduces nucleus (top-p) sampling
- Hierarchical Neural Story Generation (Fan et al., 2018) - Analysis of generation strategies
- Beam Search Strategies for Neural Machine Translation (Freitag & Al-Onaizan, 2017)
Implementation Guides
- Hugging Face Generation Guide - Comprehensive generation strategies
- nanoGPT - See
generate()method for clean implementation - The Illustrated GPT-2 - Visual guide to generation
Video Tutorials
- Andrej Karpathy - Let’s Build GPT - Implements generation from scratch
- Stanford CS224N - Text Generation - Lecture on generation strategies
Interactive Tools
- OpenAI Playground - Experiment with temperature and other parameters
- Transformer Explainer - Visualize generation process