Skip to main content

1.3 How LLMs Generate Text

AI-Generated Content

AI-generated content may contain errors. Always verify against official sources.

1.3 How LLMs Generate Text

Key Concepts: Autoregressive generation · Temperature · Top-p · Sampling strategies

Official Docs: OpenAI — Text Generation · Hugging Face — Generation Strategies


Autoregressive Generation

LLMs generate one token at a time, appending each new token to the context before predicting the next one.

Prompt: "The capital of France is"

Step 1: model sees prompt → predicts " Paris"
Step 2: appends " Paris", sees updated → predicts "."
Step 3: "." triggers stop → generation ends

At each step the model outputs a probability distribution over the entire vocabulary. A sampling strategy selects the next token from that distribution.


Temperature

Temperature reshapes the probability distribution before sampling:

$$p_i = \frac{\exp(z_i / T)}{\sum_j \exp(z_j / T)}$$

TemperatureEffectUse case
0.0Deterministic (always picks highest probability)Code, JSON, facts
0.3–0.5Focused, low varietyQA, summarisation
0.7–1.0BalancedGeneral chat
> 1.0High creativity, less coherentBrainstorming

Top-p (Nucleus Sampling)

Top-p keeps only the smallest set of tokens whose cumulative probability ≥ p, then re-normalises.

# top_p = 0.9 example
# Probs: {" Paris": 0.72, " Lyon": 0.13, " Rome": 0.08, ...}
# Cumulative: 0.72 0.85 0.93 ← cut here
# Only sample from: [" Paris", " Lyon", " Rome"]
  • top_p = 1.0 — full vocabulary (default)
  • top_p = 0.9 — trim low-probability tail

OpenAI API Example

from openai import OpenAI

client = OpenAI() # reads OPENAI_API_KEY from environment

response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Name three European capitals."}
],
temperature=0.3,
top_p=1.0,
max_tokens=128,
)

print(response.choices[0].message.content)

Key Takeaways

  • Generation is one token at a time — no look-ahead
  • temperature=0 → deterministic; higher → more varied
  • Adjust either temperature or top-p, not both at once
  • Use temperature=0 for structured/factual tasks

Common Mistakes

Common Mistakes
  1. Setting both temperature AND top-p — OpenAI's own guide says to alter one and leave the other at its default. Setting both simultaneously makes behaviour hard to reason about.
  2. Using high temperature for factual tasks — temperature > 0.5 on tasks requiring exact facts (dates, numbers, names) significantly increases hallucination risk.
  3. Assuming temperature=0 is always reproducible — OpenAI states that determinism at temperature=0 is best-effort, not guaranteed across API versions.
  4. Ignoring max_tokens — without a limit, the model may generate very long responses and incur unexpected costs.

Quick Quiz

Test Your Understanding

Q1. A student sets temperature=2.0 for a medical Q&A bot. What problem will they encounter?
A1. High temperature flattens the probability distribution, causing the model to frequently pick unlikely tokens — producing incoherent or factually wrong answers. For medical use, temperature=0 is appropriate.

Q2. What does top_p=0.1 mean in practice?
A2. Only the top tokens whose cumulative probability reaches 10% are considered. This is very restrictive — the model almost always picks the single most likely token.

Q3. Why is text generation called "autoregressive"?
A3. Each token is generated by feeding the previous output back as input — the model regresses on its own outputs.

Q4. What parameter controls the maximum length of the generated response?
A4. max_tokens (OpenAI) or max_new_tokens (Hugging Face).


Student Exercise

Exercise 1.4 — Temperature exploration
Send the same prompt "Write a one-sentence story about a robot." to gpt-4o-mini with temperatures 0.0, 0.5, 1.0, and 1.5. Run each 3 times. Record how diverse the outputs are and what happens at extreme values.

Exercise 1.5 — Top-p vs temperature
Choose a creative writing task. Compare outputs from: temperature=1.0, top_p=1.0 vs temperature=1.0, top_p=0.5. Write 2 sentences explaining the observable difference.


Further Reading

Next → 1.4 Model Landscape

Further Reading

Next → 1.4 Model Landscape