1.3 How LLMs Generate Text
AI-Generated Content
AI-generated content may contain errors. Always verify against official sources.
1.3 How LLMs Generate Text
Key Concepts: Autoregressive generation · Temperature · Top-p · Sampling strategies
Official Docs: OpenAI — Text Generation · Hugging Face — Generation Strategies
Autoregressive Generation
LLMs generate one token at a time, appending each new token to the context before predicting the next one.
Prompt: "The capital of France is"
Step 1: model sees prompt → predicts " Paris"
Step 2: appends " Paris", sees updated → predicts "."
Step 3: "." triggers stop → generation ends
At each step the model outputs a probability distribution over the entire vocabulary. A sampling strategy selects the next token from that distribution.
Temperature
Temperature reshapes the probability distribution before sampling:
$$p_i = \frac{\exp(z_i / T)}{\sum_j \exp(z_j / T)}$$
| Temperature | Effect | Use case |
|---|---|---|
0.0 | Deterministic (always picks highest probability) | Code, JSON, facts |
0.3–0.5 | Focused, low variety | QA, summarisation |
0.7–1.0 | Balanced | General chat |
> 1.0 | High creativity, less coherent | Brainstorming |
Top-p (Nucleus Sampling)
Top-p keeps only the smallest set of tokens whose cumulative probability ≥ p, then re-normalises.
# top_p = 0.9 example
# Probs: {" Paris": 0.72, " Lyon": 0.13, " Rome": 0.08, ...}
# Cumulative: 0.72 0.85 0.93 ← cut here
# Only sample from: [" Paris", " Lyon", " Rome"]
top_p = 1.0— full vocabulary (default)top_p = 0.9— trim low-probability tail
OpenAI API Example
from openai import OpenAI
client = OpenAI() # reads OPENAI_API_KEY from environment
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Name three European capitals."}
],
temperature=0.3,
top_p=1.0,
max_tokens=128,
)
print(response.choices[0].message.content)
Key Takeaways
- Generation is one token at a time — no look-ahead
temperature=0→ deterministic; higher → more varied- Adjust either temperature or top-p, not both at once
- Use
temperature=0for structured/factual tasks
Further Reading
Next → 1.4 Model Landscape