The Role of Temperature and Top-p in LLM Accuracy and Creativity
To finely control the output of large language models (LLMs), developers often utilize two key parameters: Temperature and Top-p. These parameters allow us to fine-tune the desired level of predictability and creativity of an LLM’s responses, making them suitable for a wider range of applications.
Lower values are great for accuracy-focused tasks like code generation or information retrieval, ensuring more predictable results. In contrast, higher values inject randomness, sparking creativity in tasks like copywriting, story generation, or brainstorming.
How LLMs Predict the Next Word #
At their core, LLMs operate by predicting the next word in a sequence. They do this by generating a list of potential tokens, words or parts of words, that could logically follow the current text. Each token receives a probability score, reflecting its likelihood of appearing next based on the LLM’s training data. The model then uses a decoding strategy, a specific algorithm, to select the next token from this ranked list. This is where Temperature and Top-p come into play, guiding the decoding strategy in selecting the next token.
Consider this example: the LLM is trying to complete the sentence ‘The quick brown fox jumps over the’. Below is a potential set of next tokens and their probabilities.
Token | Probability |
---|---|
lazy | 0.75 |
sleeping | 0.10 |
tired | 0.05 |
fast | 0.03 |
small | 0.02 |
large | 0.01 |
red | 0.01 |
green | 0.01 |
blue | 0.01 |
agile | 0.005 |
clumsy | 0.005 |
In this instance, “lazy” holds the highest probability, making it the most likely choice, while “clumsy” has a near-zero probability.
Temperature: Controlling Randomness in Output #
Temperature controls the randomness of the output by adjusting the probability distribution of tokens. Lower values increase the probability of the most likely tokens, leading to more predictable outputs, while higher values distribute the probability more evenly, introducing more variability and surprise.
For instance, with a low Temperature value like 0.2, the probability distribution might shift to something like this:
Token | Probability |
---|---|
lazy | 0.999 |
sleeping | 0.001 |
tired | 0.000 |
fast | 0.000 |
small | 0.000 |
large | 0.000 |
red | 0.000 |
green | 0.000 |
blue | 0.000 |
agile | 0.000 |
clumsy | 0.000 |
Observe how, at this low Temperature, the probability overwhelmingly concentrates on “lazy,” virtually eliminating other options, while with a higher Temperature of 1.5, the probabilities might change to something like this:
Token | Probability |
---|---|
lazy | 0.457 |
sleeping | 0.157 |
tired | 0.078 |
fast | 0.031 |
small | 0.015 |
large | 0.009 |
red | 0.009 |
green | 0.009 |
blue | 0.009 |
agile | 0.006 |
clumsy | 0.006 |
As you can see, the higher Temperature flattens the probability distribution, making the choice of the next token less obvious.
Applying Temperature for Different Tasks #
For creative tasks like writing stories or brainstorming ideas, a higher Temperature is your friend. It encourages the LLM to occasionally pick less common tokens, potentially leading your text in unexpected and interesting directions. Consider the sentence “The horse’s mane was”. With a higher temperature, the model might offer a wider range of possibilities:
Token | Probability |
---|---|
long | 0.60 |
flowing | 0.30 |
silky | 0.05 |
gossamer | 0.03 |
braided | 0.01 |
unruly | 0.005 |
If ’long’ is chosen, the sentence might become “The horse’s mane was long, flowing gracefully in the wind” — a standard description. However, if ‘unruly’ is selected, the sentence could transform into “The horse’s mane was unruly, a tangled, rebellious mass of dark, thick hair” — a more striking and unusual description.
However, for tasks demanding accuracy and predictability, such as generating technical documentation or code, a lower Temperature is generally preferred. This encourages the LLM to stick to the most probable and straightforward tokens. For example, this list for the snippet “public void greet”:
Token | Probability |
---|---|
User | 0.85 |
Customer | 0.05 |
Client | 0.04 |
Visitor | 0.03 |
Patron | 0.02 |
Individual | 0.01 |
Where public void greetUser(String name)
is preferable to public void greetIndividual(String name)
in most case.
In essence, adjust the Temperature based on your goal: higher for exploration and creativity, lower for precision and accuracy.
Top-p: Focusing on the Most Likely Tokens #
Top-p, also known as nucleus sampling, limits token selection to the most probable options, based on their combined probabilities. So, with a Top-p value of 1, the probability list looks exactly the same.
Token | Probability |
---|---|
lazy | 0.75 |
sleeping | 0.10 |
tired | 0.05 |
fast | 0.03 |
small | 0.02 |
large | 0.01 |
red | 0.01 |
green | 0.01 |
blue | 0.01 |
agile | 0.005 |
clumsy | 0.005 |
But putting it to 0.9 would limit it to only the top choices:
Token | Probability | New Probability |
---|---|---|
lazy | 0.75 | 0.833 |
sleeping | 0.10 | 0.111 |
tired | 0.05 | 0.056 |
Note: After the token list is truncated based on the Top-p value, the probabilities of the remaining tokens are re-normalized to ensure they still sums to 1, as seen in the third column.
Temperature adjusts the likelihood of all tokens, making low-probability tokens more or less likely. In contrast, Top-p acts as a filter, entirely removing less probable tokens from consideration. Therefore, combining a lower Top-p value with a higher Temperature can lead to output that is more varied than with a low Temperature alone, but still generally stays within the realm of the more probable tokens.
Conclusion #
To summarize, Temperature and Top-p are key parameters that determine the balance between randomness and determinism in LLM text generation. For accuracy-driven tasks, lower values are preferred; for creative endeavors, higher values are ideal.
Keep in mind that the examples presented here are somewhat exaggerated to clearly illustrate the effects of these parameters. In practice, the differences between subtle adjustments might be less dramatic, especially in shorter responses.