Understanding Temperature and Top P in Large Language Models a Technical Deep Dive
This article delves into the technical mechanisms behind Temperature and Top-p, two key parameters that control the predictability and creativity of text generated by Large Language Models (LLMs). In our previous post, The Role of Temperature and Top-p in LLM Accuracy and Creativity, we explored how these settings influence the LLM’s output by affecting the token selection process. We saw that higher values produce more varied and creative text, while lower values yield more deterministic and predictable results, each suitable for different applications.
In this follow-up, we’ll examine how these values affect the token selection algorithm under the hood. While you can use LLMs through APIs without knowing the inner workings of Temperature and Top-p, understanding the details will give you a deeper intuition and let you fine-tune your applications for optimal results.
Logits and Softmax: From Raw Scores to Probabilities #
When the LLM generates the list of potential tokens, they give each token a raw value called a logit. This value can be either positive or negative, and its actual limits depends on the implementation of the LLM. It represents the model’s confidence in each token being the next word in the sequence.
Re-using the example from our previous article where the LLM was predicting the next token for the text ‘The quick brown fox jumps over the’. Using raw logits instead of the calculated probabilities might yield the following table.
Token | Logit |
---|---|
lazy | 2.0000 |
quick | 1.0000 |
tired | 0.0000 |
slow | -1.0000 |
clumsy | -2.0000 |
awkward | -2.0000 |
Then, in the next step, the LLM uses an algorithm called Softmax to convert the logit into a probability score. The Softmax function works by raising ’e’ to the power of each logit (eLogit) and then normalizes these exponentiated values so they sum to 1.
Softmax is crucial because it transforms the raw logits into a probability distribution, making it possible for the decoding strategy to sample the next token.
With a Temperature of 1.5 and a Top-p score of 0.75, the distribution might look like this.
Token | Logit | Probability (after Softmax) |
---|---|---|
lazy | 2.0000 | 0.6291 |
quick | 1.0000 | 0.2314 |
tired | 0.0000 | 0.0851 |
slow | -1.0000 | 0.0313 |
clumsy | -2.0000 | 0.0115 |
awkward | -2.0000 | 0.0115 |
In this example, there’s about 63% chance that the next token would be lazy
, and only ~1% chance that the next token would be clumsy
or awkward
.
Temperature: Scaling Logits to Control Randomness #
But before the LLM calculates the probabilities, it applies the selected Temperature. It’s applied by dividing the original logit by the Temperature.
Dividing by a value less than 1 (like 0.2) increases the difference between the logits, making larger logits even larger and smaller logits even smaller (more negative). And conversely, dividing by a value greater than 1 (like 1.5) decreases the magnitude of the logits, making the differences between them smaller.
Here’s the example from above with a really low Temperature of 0.2.
Token | Logit | Logit/Temperature | Probability |
---|---|---|---|
lazy | 2.0000 | 10.0000 | 0.9933 |
quick | 1.0000 | 5.0000 | 0.0067 |
tired | 0.0000 | 0.0000 | 0.0000 |
slow | -1.0000 | -5.0000 | 0.0000 |
clumsy | -2.0000 | -10.0000 | 0.0000 |
awkward | -2.0000 | -10.0000 | 0.0000 |
As we can see, that almost completely removes all but the most likely token from the selection. And when we have a large Temperature, we get less difference between the numbers, which turns into a more even probability distribution.
Token | Logit | Logit/Temperature | Probability |
---|---|---|---|
lazy | 2.0000 | 1.3333 | 0.4875 |
quick | 1.0000 | 0.6667 | 0.2503 |
tired | 0.0000 | 0.0000 | 0.1285 |
slow | -1.0000 | -0.6667 | 0.0660 |
clumsy | -2.0000 | -1.3333 | 0.0339 |
awkward | -2.0000 | -1.3333 | 0.0339 |
The default value for Temperature is 1, which would keep the original logits. Also note that while some services allow 0 for Temperature, setting Temperature to exactly 0 is mathematically problematic (division by zero). Most APIs will interpret 0 as a very low temperature for practical purposes, leading to highly deterministic output. However, using a very small value like 0.0001 can give you more predictable and consistent behavior across different systems
Top-p (Nucleus Sampling): Focusing on the Most Probable Tokens #
Top-p sampling, or nucleus sampling, is applied after the probabilities has been calculated with Softmax. It works by summing the most probable tokens, until the cumulative probability score is over the Top-p value. The probabilities of the selected tokens are then adjusted (renormalized) so that they again add up to 1. This ensures we still have a valid probability distribution for sampling.
So for the example above with a Temperature of 1.5, and a Top-p score of 0.75 we would get
Token | Probability | Cumulative P | Top-P Status | Adjusted P |
---|---|---|---|---|
lazy | 0.4875 | 0.4875 | ✓ | 0.5627 |
quick | 0.2503 | 0.7378 | ✓ | 0.2889 |
tired | 0.1285 | 0.8663 | ✓ | 0.1483 |
slow | 0.0660 | 0.9323 | × | 0.0000 |
clumsy | 0.0339 | 0.9661 | × | 0.0000 |
awkward | 0.0339 | 1.0000 | × | 0.0000 |
Only the first 3 tokens would be included in the filtered list. Those would then get a new probability score, so the total adds up to 1.
Token | Probability |
---|---|
lazy | 0.5627 |
quick | 0.2889 |
tired | 0.1483 |
It’s important to note that Temperature and Top-p work in conjunction. Temperature modifies the initial probability distribution, and then Top-p sampling filters tokens from this modified distribution.
Putting it all Together: How Temperature and Top-p Control LLM Output #
To summarize how Temperature and Top-p control LLM output, here’s a step-by-step breakdown:
- Generate Logits: The LLM first produces a list of potential next tokens, assigning each a raw score called a Logit. These logits represent the model’s initial prediction of token likelihood.
- Apply Temperature: The Temperature value is applied by dividing each Logit. This adjusts the distribution:
- Lower Temperature (e.g., 0.2) sharpens the distribution, making high-probability tokens even more likely and low-probability tokens even less likely.
- Higher Temperature (e.g., 1.5) flattens the distribution, making token probabilities more even.
- Convert to Probabilities (Softmax): The Softmax function then transforms the adjusted Logits into probabilities. These probabilities now sum up to 1 and represent the likelihood of each token being selected.
- Apply Top-p Filtering: If Top-p sampling is used, the algorithm sorts the tokens by probability and selects the smallest set of most probable tokens whose cumulative probability exceeds the Top-p value. Tokens outside this “nucleus” are discarded.
- Renormalize Probabilities: The probabilities of the selected tokens within the Top-p nucleus are then renormalized so they sum back up to 1.
- Sample Next Token: Finally, the algorithm randomly samples the next token from the resulting probability distribution. Tokens with higher adjusted probabilities are more likely to be chosen, but randomness is still involved.
We can visualize this through a couple of examples. First, with a low Temperature (0.2) and a low Top-p (0.5).
Token | Logit | Logit/Temperature | Probability | Cumulative P | Top-P Status | Adjusted P |
---|---|---|---|---|---|---|
lazy | 2.0000 | 10.0000 | 0.9933 | 0.9933 | ✓ | 1.0000 |
quick | 1.0000 | 5.0000 | 0.0067 | 1.0000 | × | 0.0000 |
tired | 0.0000 | 0.0000 | 0.0000 | 1.0000 | × | 0.0000 |
slow | -1.0000 | -5.0000 | 0.0000 | 1.0000 | × | 0.0000 |
clumsy | -2.0000 | -10.0000 | 0.0000 | 1.0000 | × | 0.0000 |
awkward | -2.0000 | -10.0000 | 0.0000 | 1.0000 | × | 0.0000 |
As seen above, that combination will use lazy
every time. Conversely, we would get a more random output with a high Temperature (1.5) and a high Top-p (0.95).
Token | Logit | Logit/Temperature | Probability | Cumulative P | Top-P Status | Adjusted P |
---|---|---|---|---|---|---|
lazy | 2.0000 | 1.3333 | 0.4875 | 0.4875 | ✓ | 0.5229 |
quick | 1.0000 | 0.6667 | 0.2503 | 0.7378 | ✓ | 0.2685 |
tired | 0.0000 | 0.0000 | 0.1285 | 0.8663 | ✓ | 0.1378 |
slow | -1.0000 | -0.6667 | 0.0660 | 0.9323 | ✓ | 0.0708 |
clumsy | -2.0000 | -1.3333 | 0.0339 | 0.9661 | × | 0.0000 |
awkward | -2.0000 | -1.3333 | 0.0339 | 1.0000 | × | 0.0000 |
In this example, lazy
will only be selected ~50% of all outputs.
By experimenting with these parameters, developers and data scientists can tailor LLM-driven or generative AI applications — ranging from chatbots to content generation — so the resulting text is either precise and factual or highly creative, depending on the use case. Subtle adjustments to Temperature and Top-p often make a significant difference in user experience and output quality.
Also read the third and final part Playing with Temperature and Top-p in Open AI’s API where we look at actual probabilities from the Open AI API.