Text Generation Strategies

Text generation is the process of using a trained language model to produce new text. Different generation strategies balance between quality, diversity, and computational cost. The choice of strategy significantly impacts the characteristics of generated text.

Autoregressive Generation Process

All GPT-family models generate text autoregressively (one token at a time):

Start: Begin with a prompt (context tokens)
Forward pass: Feed context through model to get next token distribution
Selection: Sample or select the next token using a strategy
Append: Add selected token to context
Repeat: Continue until max length or stop token reached

Pseudocode:


generated = prompt_tokens
for step in range(max_new_tokens):
    logits = model(generated)           # Forward pass
    next_token = select(logits[-1])     # Selection strategy
    generated.append(next_token)         # Append to sequence
    if next_token == end_token:
        break
return generated

Greedy Decoding

Always select the token with highest probability.

Implementation


def generate_greedy(model, idx, max_new_tokens, block_size):
    """Generate text by always choosing the most likely next token."""
    for _ in range(max_new_tokens):
        # Crop context to block_size if needed
        idx_cond = idx if idx.size(1) <= block_size else idx[:, -block_size:]
 
        # Get predictions
        logits = model(idx_cond)  # (B, T, vocab_size)
 
        # Focus only on last position
        logits = logits[:, -1, :]  # (B, vocab_size)
 
        # Select most likely token
        idx_next = torch.argmax(logits, dim=-1, keepdim=True)  # (B, 1)
 
        # Append to sequence
        idx = torch.cat([idx, idx_next], dim=1)
 
    return idx

Properties

Advantages:

Deterministic: same input always produces same output
Fast and simple
Good for tasks requiring consistency

Disadvantages:

Often produces repetitive text
Gets stuck in loops (“very very very very…”)
Not how humans write (humans don’t always pick most likely word)
Ignores probability distribution beyond argmax

When to use: Deterministic tasks, short completions, when reproducibility is critical.

Sampling from Distribution

Instead of always picking max probability, sample from the full distribution.

Implementation


def generate_sample(model, idx, max_new_tokens, block_size, temperature=1.0):
    """Generate text by sampling from the distribution."""
    for _ in range(max_new_tokens):
        idx_cond = idx if idx.size(1) <= block_size else idx[:, -block_size:]
 
        logits = model(idx_cond)
        logits = logits[:, -1, :] / temperature  # Apply temperature
 
        # Convert logits to probabilities
        probs = F.softmax(logits, dim=-1)
 
        # Sample from distribution
        idx_next = torch.multinomial(probs, num_samples=1)
 
        idx = torch.cat([idx, idx_next], dim=1)
 
    return idx

Properties

Advantages:

Stochastic: different outputs each time
More diverse and creative
Explores the probability space
More human-like text

Disadvantages:

Can produce incoherent text
May select very unlikely tokens
Non-deterministic (harder to debug/reproduce)

When to use: Creative writing, diverse outputs, exploration.

Temperature Scaling

Temperature $T$ controls the randomness of the probability distribution.

Formula

$P(x_i) = \frac{\exp(z_i / T)}{\sum_j \exp(z_j / T)}$

where $z_i$ are the logits.

Effects of Temperature

Low Temperature ( $T < 1$ , e.g., 0.5):

Sharpens distribution (increases probability of likely tokens)
More deterministic behavior
More coherent and focused text
Less diversity
Good for: factual text, code generation, question answering

Temperature = 1:

Uses original model distribution
Balanced behavior
Default/baseline

High Temperature ( $T > 1$ , e.g., 1.5):

Flattens distribution (gives more probability to unlikely tokens)
More random behavior
More diverse and creative text
Less coherent
Good for: creative writing, brainstorming, exploration

Examples


Prompt: "The future of AI is"

T=0.5 (conservative):
"The future of AI is the development of artificial intelligence
systems that can perform complex tasks..."

T=1.0 (balanced):
"The future of AI is to create intelligent systems that augment
human capabilities and solve global challenges..."

T=1.5 (creative):
"The future of AI is somewhere between neural networks dancing
with quantum butterflies and recursive dreams..."

Extreme Cases

T → 0: Approaches greedy decoding (always picks max) T → ∞: Approaches uniform distribution (all tokens equally likely)

Top-k Sampling

Sample from only the top-k most likely tokens, setting all others to zero probability.

Implementation


def generate_topk(model, idx, max_new_tokens, block_size, k=40, temperature=1.0):
    """Generate text by sampling from top-k tokens."""
    for _ in range(max_new_tokens):
        idx_cond = idx if idx.size(1) <= block_size else idx[:, -block_size:]
 
        logits = model(idx_cond)
        logits = logits[:, -1, :] / temperature
 
        # Keep only top k logits
        top_logits, top_indices = torch.topk(logits, k, dim=-1)
 
        # Sample from top k
        probs = F.softmax(top_logits, dim=-1)
        idx_next_local = torch.multinomial(probs, num_samples=1)
 
        # Map back to vocabulary indices
        idx_next = torch.gather(top_indices, -1, idx_next_local)
 
        idx = torch.cat([idx, idx_next], dim=1)
 
    return idx

Properties

Advantages:

Filters out very low-probability tokens
Prevents sampling nonsensical words
Good balance between diversity and quality
Simple to understand and implement

Disadvantages:

Fixed k may be too restrictive or permissive
Doesn’t adapt to model’s confidence
May cut off reasonable alternatives

Typical values: k=40-50 for general text generation

Top-p (Nucleus) Sampling

Sample from the smallest set of tokens whose cumulative probability exceeds $p$ .

Implementation


def generate_topp(model, idx, max_new_tokens, block_size, p=0.9, temperature=1.0):
    """Generate text using nucleus (top-p) sampling."""
    for _ in range(max_new_tokens):
        idx_cond = idx if idx.size(1) <= block_size else idx[:, -block_size:]
 
        logits = model(idx_cond)
        logits = logits[:, -1, :] / temperature
 
        # Sort logits in descending order
        sorted_logits, sorted_indices = torch.sort(logits, descending=True, dim=-1)
 
        # Compute cumulative probabilities
        cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)
 
        # Remove tokens with cumulative probability above threshold
        sorted_indices_to_remove = cumulative_probs > p
 
        # Shift right to keep first token above threshold
        sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone()
        sorted_indices_to_remove[..., 0] = False
 
        # Set removed tokens to -inf
        sorted_logits[sorted_indices_to_remove] = float('-inf')
 
        # Sample from filtered distribution
        probs = F.softmax(sorted_logits, dim=-1)
        idx_next_local = torch.multinomial(probs, num_samples=1)
 
        # Map back to vocabulary indices
        idx_next = torch.gather(sorted_indices, -1, idx_next_local)
 
        idx = torch.cat([idx, idx_next], dim=1)
 
    return idx

Properties

Advantages:

Adaptive: Number of tokens considered varies by model confidence
When model is confident (sharp distribution), consider few tokens
When model is uncertain (flat distribution), consider more tokens
Better than fixed top-k in many scenarios

Disadvantages:

Slightly more complex to implement
May still occasionally sample poor tokens

Typical values: p=0.9 or p=0.95

Example


Logits: [5.0, 4.5, 2.0, 1.0, 0.5, ...]
Probs:  [0.60, 0.30, 0.05, 0.02, 0.01, ...]
Cumulative: [0.60, 0.90, 0.95, 0.97, 0.98, ...]

With p=0.9:
- Include tokens 1 and 2 (cumulative = 0.90)
- Renormalize: [0.67, 0.33]
- Sample from these two tokens only

Beam Search

Maintain multiple candidate sequences (beams), exploring the most promising paths.

Implementation


def beam_search(model, idx, max_new_tokens, block_size, beam_width=5):
    """Generate text using beam search."""
    # Start with initial sequence
    beams = [(idx, 0.0)]  # (sequence, log_probability)
 
    for _ in range(max_new_tokens):
        candidates = []
 
        for seq, score in beams:
            # Crop to block size
            seq_cond = seq if seq.size(1) <= block_size else seq[:, -block_size:]
 
            # Get predictions
            logits = model(seq_cond)
            logits = logits[:, -1, :]
            log_probs = F.log_softmax(logits, dim=-1)
 
            # Get top beam_width tokens
            top_log_probs, top_indices = torch.topk(log_probs, beam_width, dim=-1)
 
            # Create new candidate sequences
            for i in range(beam_width):
                new_seq = torch.cat([seq, top_indices[:, i:i+1]], dim=1)
                new_score = score + top_log_probs[0, i].item()
                candidates.append((new_seq, new_score))
 
        # Keep top beam_width candidates
        beams = sorted(candidates, key=lambda x: x[1], reverse=True)[:beam_width]
 
    # Return best sequence
    return beams[0][0]

Properties

Advantages:

Explores multiple paths simultaneously
Often produces more coherent text than sampling
Good for tasks requiring consistency (translation, summarization)
Can find globally better sequences

Disadvantages:

$k$ times slower than greedy (for beam width $k$ )
Higher memory usage
Can produce generic, “safe” text
May miss creative alternatives

When to use: Translation, summarization, code generation, when quality > diversity.

Comparison of Methods

Method	Diversity	Coherence	Speed	Deterministic	Use Case
Greedy	Very Low	High	Fastest	Yes	Deterministic tasks, debugging
Sampling	Very High	Low-Medium	Fast	No	Creative writing, exploration
Top-k	Medium	Medium-High	Fast	No	General text generation
Top-p	Medium	Medium-High	Fast	No	General text generation (recommended)
Beam Search	Low	Very High	Slow	Yes	Translation, summarization

KV Cache Optimization

During generation, cache key and value matrices to avoid recomputing attention for previous tokens.

Implementation


def generate_with_kv_cache(model, idx, max_new_tokens):
    """Generate text with KV caching for efficiency."""
    past_kvs = None
 
    for _ in range(max_new_tokens):
        if past_kvs is None:
            # First iteration: process full context
            logits, past_kvs = model(idx, use_cache=True)
            logits = logits[:, -1, :]
        else:
            # Subsequent iterations: only process new token
            logits, past_kvs = model(idx[:, -1:], past_kv=past_kvs, use_cache=True)
            logits = logits[:, -1, :]
 
        # Sample next token
        idx_next = torch.multinomial(F.softmax(logits, dim=-1), num_samples=1)
        idx = torch.cat([idx, idx_next], dim=1)
 
    return idx

Benefits

2-3× faster generation
Constant memory per step (instead of linear growth)
Essential for long-context generation
Standard in production LLMs (GPT-4, Claude, Llama)

How It Works

Without cache: Recompute attention for entire sequence each step

Step 1: Compute attention for tokens [0]
Step 2: Compute attention for tokens [0, 1]
Step 3: Compute attention for tokens [0, 1, 2]
Total: $O(n^2)$ operations

With cache: Only compute attention for new token

Step 1: Compute K, V for token [0], cache them
Step 2: Compute K, V for token [1], concatenate with cache
Step 3: Compute K, V for token [2], concatenate with cache
Total: $O(n)$ operations

Practical Recommendations

General Text Generation


# Recommended: top-p with moderate temperature
temperature = 0.8
top_p = 0.95

Creative Writing


# Higher temperature, top-p for diversity
temperature = 1.2
top_p = 0.95

Code Generation


# Lower temperature, top-k for focus
temperature = 0.2
top_k = 10

Factual Question Answering


# Very low temperature, greedy or small top-k
temperature = 0.1
top_k = 5
# Or just greedy

Conversational AI


# Moderate settings for natural conversation
temperature = 0.9
top_p = 0.9

Key Insights

No free lunch: Trade-off between diversity and quality
Temperature is powerful: Single parameter dramatically changes behavior
Top-p > top-k: Adaptive nucleus sampling often works better than fixed top-k
Context matters: Best strategy depends on use case
KV cache is essential: Must-have for production deployments
Repetition penalty: Many systems add penalties for repeated tokens (not covered here)

GPT Architecture - The model used for generation
Causal Attention - Enables autoregressive generation
Language Model Training - How models learn to generate
Tokenization - Converting tokens back to text

Learning Resources

Papers

The Curious Case of Neural Text Degeneration (Holtzman et al., 2019) - Introduces nucleus (top-p) sampling
Hierarchical Neural Story Generation (Fan et al., 2018) - Analysis of generation strategies
Beam Search Strategies for Neural Machine Translation (Freitag & Al-Onaizan, 2017)

Implementation Guides

Hugging Face Generation Guide - Comprehensive generation strategies
nanoGPT - See generate() method for clean implementation
The Illustrated GPT-2 - Visual guide to generation

Video Tutorials

Andrej Karpathy - Let’s Build GPT - Implements generation from scratch
Stanford CS224N - Text Generation - Lecture on generation strategies

Interactive Tools

OpenAI Playground - Experiment with temperature and other parameters
Transformer Explainer - Visualize generation process

Text Generation Strategies

Autoregressive Generation Process

Greedy Decoding

Implementation

Properties

Sampling from Distribution

Implementation

Properties

Temperature Scaling

Formula

Effects of Temperature

Examples

Extreme Cases

Top-k Sampling

Implementation

Properties

Top-p (Nucleus) Sampling

Implementation

Properties

Example

Beam Search

Implementation

Properties

Comparison of Methods

KV Cache Optimization

Implementation

Benefits

How It Works

Practical Recommendations

General Text Generation

Creative Writing

Code Generation

Factual Question Answering

Conversational AI

Key Insights

Related Concepts

Learning Resources

Papers

Implementation Guides

Video Tutorials

Interactive Tools