1.3 How LLMs Generate Text
AI-generated content may contain errors. Always verify against official sources.
1.3 How LLMs Generate Text
Key Concepts: Autoregressive generation · Temperature · Top-p · Sampling strategies
Official Docs: OpenAI — Text Generation · Hugging Face — Generation Strategies
Autoregressive Generation
LLMs generate one token at a time, appending each new token to the context before predicting the next one.
Prompt: "The capital of France is"
Step 1: model sees prompt → predicts " Paris"
Step 2: appends " Paris", sees updated → predicts "."
Step 3: "." triggers stop → generation ends
At each step the model outputs a probability distribution over the entire vocabulary. A sampling strategy selects the next token from that distribution.
Temperature
Temperature reshapes the probability distribution before sampling:
$$p_i = \frac{\exp(z_i / T)}{\sum_j \exp(z_j / T)}$$
| Temperature | Effect | Use case |
|---|---|---|
0.0 | Deterministic (always picks highest probability) | Code, JSON, facts |
0.3–0.5 | Focused, low variety | QA, summarisation |
0.7–1.0 | Balanced | General chat |
> 1.0 | High creativity, less coherent | Brainstorming |
Top-p (Nucleus Sampling)
Top-p keeps only the smallest set of tokens whose cumulative probability ≥ p, then re-normalises.
# top_p = 0.9 example
# Probs: {" Paris": 0.72, " Lyon": 0.13, " Rome": 0.08, ...}
# Cumulative: 0.72 0.85 0.93 ← cut here
# Only sample from: [" Paris", " Lyon", " Rome"]
top_p = 1.0— full vocabulary (default)top_p = 0.9— trim low-probability tail
OpenAI API Example
from openai import OpenAI
client = OpenAI() # reads OPENAI_API_KEY from environment
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Name three European capitals."}
],
temperature=0.3,
top_p=1.0,
max_tokens=128,
)
print(response.choices[0].message.content)
Key Takeaways
- Generation is one token at a time — no look-ahead
temperature=0→ deterministic; higher → more varied- Adjust either temperature or top-p, not both at once
- Use
temperature=0for structured/factual tasks
Common Mistakes
- Setting both temperature AND top-p — OpenAI's own guide says to alter one and leave the other at its default. Setting both simultaneously makes behaviour hard to reason about.
- Using high temperature for factual tasks — temperature > 0.5 on tasks requiring exact facts (dates, numbers, names) significantly increases hallucination risk.
- Assuming
temperature=0is always reproducible — OpenAI states that determinism attemperature=0is best-effort, not guaranteed across API versions. - Ignoring
max_tokens— without a limit, the model may generate very long responses and incur unexpected costs.
Quick Quiz
Q1. A student sets temperature=2.0 for a medical Q&A bot. What problem will they encounter?
A1. High temperature flattens the probability distribution, causing the model to frequently pick unlikely tokens — producing incoherent or factually wrong answers. For medical use, temperature=0 is appropriate.
Q2. What does top_p=0.1 mean in practice?
A2. Only the top tokens whose cumulative probability reaches 10% are considered. This is very restrictive — the model almost always picks the single most likely token.
Q3. Why is text generation called "autoregressive"?
A3. Each token is generated by feeding the previous output back as input — the model regresses on its own outputs.
Q4. What parameter controls the maximum length of the generated response?
A4. max_tokens (OpenAI) or max_new_tokens (Hugging Face).
Student Exercise
Exercise 1.4 — Temperature exploration
Send the same prompt "Write a one-sentence story about a robot." to gpt-4o-mini with temperatures 0.0, 0.5, 1.0, and 1.5. Run each 3 times. Record how diverse the outputs are and what happens at extreme values.
Exercise 1.5 — Top-p vs temperature
Choose a creative writing task. Compare outputs from: temperature=1.0, top_p=1.0 vs temperature=1.0, top_p=0.5. Write 2 sentences explaining the observable difference.
Further Reading
- 📘 OpenAI Text Generation Guide
- 📘 Hugging Face Generation Strategies
- 📄 The Curious Case of Neural Text Degeneration (Holtzman et al., 2020 — nucleus sampling paper)
Next → 1.4 Model Landscape
Further Reading
Next → 1.4 Model Landscape