Skip to Content
LibraryConceptsText Generation

Text Generation Strategies

Text generation is the process of using a trained language model to produce new text. Different generation strategies balance between quality, diversity, and computational cost. The choice of strategy significantly impacts the characteristics of generated text.

Autoregressive Generation Process

All GPT-family models generate text autoregressively (one token at a time):

  1. Start: Begin with a prompt (context tokens)
  2. Forward pass: Feed context through model to get next token distribution
  3. Selection: Sample or select the next token using a strategy
  4. Append: Add selected token to context
  5. Repeat: Continue until max length or stop token reached

Pseudocode:

generated = prompt_tokens for step in range(max_new_tokens): logits = model(generated) # Forward pass next_token = select(logits[-1]) # Selection strategy generated.append(next_token) # Append to sequence if next_token == end_token: break return generated

Greedy Decoding

Always select the token with highest probability.

Implementation

def generate_greedy(model, idx, max_new_tokens, block_size): """Generate text by always choosing the most likely next token.""" for _ in range(max_new_tokens): # Crop context to block_size if needed idx_cond = idx if idx.size(1) <= block_size else idx[:, -block_size:] # Get predictions logits = model(idx_cond) # (B, T, vocab_size) # Focus only on last position logits = logits[:, -1, :] # (B, vocab_size) # Select most likely token idx_next = torch.argmax(logits, dim=-1, keepdim=True) # (B, 1) # Append to sequence idx = torch.cat([idx, idx_next], dim=1) return idx

Properties

Advantages:

  • Deterministic: same input always produces same output
  • Fast and simple
  • Good for tasks requiring consistency

Disadvantages:

  • Often produces repetitive text
  • Gets stuck in loops (“very very very very…”)
  • Not how humans write (humans don’t always pick most likely word)
  • Ignores probability distribution beyond argmax

When to use: Deterministic tasks, short completions, when reproducibility is critical.

Sampling from Distribution

Instead of always picking max probability, sample from the full distribution.

Implementation

def generate_sample(model, idx, max_new_tokens, block_size, temperature=1.0): """Generate text by sampling from the distribution.""" for _ in range(max_new_tokens): idx_cond = idx if idx.size(1) <= block_size else idx[:, -block_size:] logits = model(idx_cond) logits = logits[:, -1, :] / temperature # Apply temperature # Convert logits to probabilities probs = F.softmax(logits, dim=-1) # Sample from distribution idx_next = torch.multinomial(probs, num_samples=1) idx = torch.cat([idx, idx_next], dim=1) return idx

Properties

Advantages:

  • Stochastic: different outputs each time
  • More diverse and creative
  • Explores the probability space
  • More human-like text

Disadvantages:

  • Can produce incoherent text
  • May select very unlikely tokens
  • Non-deterministic (harder to debug/reproduce)

When to use: Creative writing, diverse outputs, exploration.

Temperature Scaling

Temperature TT controls the randomness of the probability distribution.

Formula

P(xi)=exp(zi/T)jexp(zj/T)P(x_i) = \frac{\exp(z_i / T)}{\sum_j \exp(z_j / T)}

where ziz_i are the logits.

Effects of Temperature

Low Temperature (T<1T < 1, e.g., 0.5):

  • Sharpens distribution (increases probability of likely tokens)
  • More deterministic behavior
  • More coherent and focused text
  • Less diversity
  • Good for: factual text, code generation, question answering

Temperature = 1:

  • Uses original model distribution
  • Balanced behavior
  • Default/baseline

High Temperature (T>1T > 1, e.g., 1.5):

  • Flattens distribution (gives more probability to unlikely tokens)
  • More random behavior
  • More diverse and creative text
  • Less coherent
  • Good for: creative writing, brainstorming, exploration

Examples

Prompt: "The future of AI is" T=0.5 (conservative): "The future of AI is the development of artificial intelligence systems that can perform complex tasks..." T=1.0 (balanced): "The future of AI is to create intelligent systems that augment human capabilities and solve global challenges..." T=1.5 (creative): "The future of AI is somewhere between neural networks dancing with quantum butterflies and recursive dreams..."

Extreme Cases

T → 0: Approaches greedy decoding (always picks max) T → ∞: Approaches uniform distribution (all tokens equally likely)

Top-k Sampling

Sample from only the top-k most likely tokens, setting all others to zero probability.

Implementation

def generate_topk(model, idx, max_new_tokens, block_size, k=40, temperature=1.0): """Generate text by sampling from top-k tokens.""" for _ in range(max_new_tokens): idx_cond = idx if idx.size(1) <= block_size else idx[:, -block_size:] logits = model(idx_cond) logits = logits[:, -1, :] / temperature # Keep only top k logits top_logits, top_indices = torch.topk(logits, k, dim=-1) # Sample from top k probs = F.softmax(top_logits, dim=-1) idx_next_local = torch.multinomial(probs, num_samples=1) # Map back to vocabulary indices idx_next = torch.gather(top_indices, -1, idx_next_local) idx = torch.cat([idx, idx_next], dim=1) return idx

Properties

Advantages:

  • Filters out very low-probability tokens
  • Prevents sampling nonsensical words
  • Good balance between diversity and quality
  • Simple to understand and implement

Disadvantages:

  • Fixed k may be too restrictive or permissive
  • Doesn’t adapt to model’s confidence
  • May cut off reasonable alternatives

Typical values: k=40-50 for general text generation

Top-p (Nucleus) Sampling

Sample from the smallest set of tokens whose cumulative probability exceeds pp.

Implementation

def generate_topp(model, idx, max_new_tokens, block_size, p=0.9, temperature=1.0): """Generate text using nucleus (top-p) sampling.""" for _ in range(max_new_tokens): idx_cond = idx if idx.size(1) <= block_size else idx[:, -block_size:] logits = model(idx_cond) logits = logits[:, -1, :] / temperature # Sort logits in descending order sorted_logits, sorted_indices = torch.sort(logits, descending=True, dim=-1) # Compute cumulative probabilities cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1) # Remove tokens with cumulative probability above threshold sorted_indices_to_remove = cumulative_probs > p # Shift right to keep first token above threshold sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone() sorted_indices_to_remove[..., 0] = False # Set removed tokens to -inf sorted_logits[sorted_indices_to_remove] = float('-inf') # Sample from filtered distribution probs = F.softmax(sorted_logits, dim=-1) idx_next_local = torch.multinomial(probs, num_samples=1) # Map back to vocabulary indices idx_next = torch.gather(sorted_indices, -1, idx_next_local) idx = torch.cat([idx, idx_next], dim=1) return idx

Properties

Advantages:

  • Adaptive: Number of tokens considered varies by model confidence
  • When model is confident (sharp distribution), consider few tokens
  • When model is uncertain (flat distribution), consider more tokens
  • Better than fixed top-k in many scenarios

Disadvantages:

  • Slightly more complex to implement
  • May still occasionally sample poor tokens

Typical values: p=0.9 or p=0.95

Example

Logits: [5.0, 4.5, 2.0, 1.0, 0.5, ...] Probs: [0.60, 0.30, 0.05, 0.02, 0.01, ...] Cumulative: [0.60, 0.90, 0.95, 0.97, 0.98, ...] With p=0.9: - Include tokens 1 and 2 (cumulative = 0.90) - Renormalize: [0.67, 0.33] - Sample from these two tokens only

Maintain multiple candidate sequences (beams), exploring the most promising paths.

Implementation

def beam_search(model, idx, max_new_tokens, block_size, beam_width=5): """Generate text using beam search.""" # Start with initial sequence beams = [(idx, 0.0)] # (sequence, log_probability) for _ in range(max_new_tokens): candidates = [] for seq, score in beams: # Crop to block size seq_cond = seq if seq.size(1) <= block_size else seq[:, -block_size:] # Get predictions logits = model(seq_cond) logits = logits[:, -1, :] log_probs = F.log_softmax(logits, dim=-1) # Get top beam_width tokens top_log_probs, top_indices = torch.topk(log_probs, beam_width, dim=-1) # Create new candidate sequences for i in range(beam_width): new_seq = torch.cat([seq, top_indices[:, i:i+1]], dim=1) new_score = score + top_log_probs[0, i].item() candidates.append((new_seq, new_score)) # Keep top beam_width candidates beams = sorted(candidates, key=lambda x: x[1], reverse=True)[:beam_width] # Return best sequence return beams[0][0]

Properties

Advantages:

  • Explores multiple paths simultaneously
  • Often produces more coherent text than sampling
  • Good for tasks requiring consistency (translation, summarization)
  • Can find globally better sequences

Disadvantages:

  • kk times slower than greedy (for beam width kk)
  • Higher memory usage
  • Can produce generic, “safe” text
  • May miss creative alternatives

When to use: Translation, summarization, code generation, when quality > diversity.

Comparison of Methods

MethodDiversityCoherenceSpeedDeterministicUse Case
GreedyVery LowHighFastestYesDeterministic tasks, debugging
SamplingVery HighLow-MediumFastNoCreative writing, exploration
Top-kMediumMedium-HighFastNoGeneral text generation
Top-pMediumMedium-HighFastNoGeneral text generation (recommended)
Beam SearchLowVery HighSlowYesTranslation, summarization

KV Cache Optimization

During generation, cache key and value matrices to avoid recomputing attention for previous tokens.

Implementation

def generate_with_kv_cache(model, idx, max_new_tokens): """Generate text with KV caching for efficiency.""" past_kvs = None for _ in range(max_new_tokens): if past_kvs is None: # First iteration: process full context logits, past_kvs = model(idx, use_cache=True) logits = logits[:, -1, :] else: # Subsequent iterations: only process new token logits, past_kvs = model(idx[:, -1:], past_kv=past_kvs, use_cache=True) logits = logits[:, -1, :] # Sample next token idx_next = torch.multinomial(F.softmax(logits, dim=-1), num_samples=1) idx = torch.cat([idx, idx_next], dim=1) return idx

Benefits

  • 2-3× faster generation
  • Constant memory per step (instead of linear growth)
  • Essential for long-context generation
  • Standard in production LLMs (GPT-4, Claude, Llama)

How It Works

Without cache: Recompute attention for entire sequence each step

  • Step 1: Compute attention for tokens [0]
  • Step 2: Compute attention for tokens [0, 1]
  • Step 3: Compute attention for tokens [0, 1, 2]
  • Total: O(n2)O(n^2) operations

With cache: Only compute attention for new token

  • Step 1: Compute K, V for token [0], cache them
  • Step 2: Compute K, V for token [1], concatenate with cache
  • Step 3: Compute K, V for token [2], concatenate with cache
  • Total: O(n)O(n) operations

Practical Recommendations

General Text Generation

# Recommended: top-p with moderate temperature temperature = 0.8 top_p = 0.95

Creative Writing

# Higher temperature, top-p for diversity temperature = 1.2 top_p = 0.95

Code Generation

# Lower temperature, top-k for focus temperature = 0.2 top_k = 10

Factual Question Answering

# Very low temperature, greedy or small top-k temperature = 0.1 top_k = 5 # Or just greedy

Conversational AI

# Moderate settings for natural conversation temperature = 0.9 top_p = 0.9

Key Insights

  1. No free lunch: Trade-off between diversity and quality
  2. Temperature is powerful: Single parameter dramatically changes behavior
  3. Top-p > top-k: Adaptive nucleus sampling often works better than fixed top-k
  4. Context matters: Best strategy depends on use case
  5. KV cache is essential: Must-have for production deployments
  6. Repetition penalty: Many systems add penalties for repeated tokens (not covered here)

Learning Resources

Papers

  • The Curious Case of Neural Text Degeneration (Holtzman et al., 2019) - Introduces nucleus (top-p) sampling
  • Hierarchical Neural Story Generation (Fan et al., 2018) - Analysis of generation strategies
  • Beam Search Strategies for Neural Machine Translation (Freitag & Al-Onaizan, 2017)

Implementation Guides

Video Tutorials

Interactive Tools