Skip to main content

Playing with Temperature and Top-p in Open AI's API

In the previous two articles, we explored the impact of Temperature and Top-p on large language model (LLM) output: The Role of Temperature and Top-p in LLM Accuracy and Creativity provided a practical overview, while Understanding Temperature and Top-p in Large Language Models - A Technical Deep Dive delved into the technical details.

In both articles, we used hypothetical examples to illustrate these concepts. However, certain APIs, like OpenAI’s ChatGPT, provide access to the actual token probabilities. This allows us to examine real-world examples and validate our understanding of these parameters.

OpenAI refers to these values as logprobs in their API, representing the logarithmic probabilities computed as log(softmax(logit)). This approach mitigates the risk of extremely small probabilities being rounded down to zero during computation.

Random tokens in a deterministic system #

Large Language Models are designed to be deterministic up to the token sampling stage. However, when interacting with the OpenAI API, you might notice that the probability values can vary between API calls, even with identical inputs.

I can’t tell for certain why that is. I wasn’t able to find any official explanation, or any conclusive unofficial explanations either. Perhaps it has to do with floating-point arithmetic differing between different hardware in Open AIs large data centers, or perhaps it’s some intentional change that Open AI added to keep their actual weights a secret.

The important thing to note is that the reported probability values can sometimes vary noticeably between API calls, and we should probably run multiple times and get an average if we’re interested in some value here.

Fetching probabilities #

To get the probabilities we have to set logprobs to true, and set how many we want to see for each token with top_logprobs (between 1 and 20). In the example below, we also use max_completion_tokens to limit the response to a single token, although you can request probabilities for multiple tokens at once.

To convert the returned log probabilities (logprobs) back to standard probabilities, you need to compute the exponential function:

probability = e^(logprob).

import openai
import math

response = openai.chat.completions.create(
    model="gpt-4o",
    messages=[
        {
            "role": "system",
            "content": "The quick brown fox jumps over the",
        },
    ],
    logprobs=True,
    top_logprobs=10,
    max_completion_tokens=1,
)

content = response.choices[0].logprobs.content[0]

print(f"Selected token: '{content.token}', Probability: {math.exp(content.logprob)}")

for e in content.top_logprobs:
    print(f"Token: '{e.token}', Probability: {math.exp(e.logprob)}")

The API call above might yield a probability list similar to this:

TokenProbability
lazy0.92566
The0.05918
Sorry0.00801
I’m0.00179
Lazy0.00084
l0.00066
the0.00040
I0.00031
0.00031
What0.00031

As expected, ’lazy’ emerges as the most probable next token.

Understanding Raw Probabilities #

It’s important to understand that these probabilities are derived directly from the model’s raw logits, prior to the application of Temperature or Top-p. Consequently, modifying the Temperature setting won’t alter these initial probabilities, and adjusting Top-p won’t restrict the number of tokens returned.

Applying Temperature and Top-p #

To apply the Temperature parameter to logprobs, we first divide each logprob by the Temperature value and then exponentiate the result:

new_probability = e^(logprob / temperature).

Top-p filtering involves selecting tokens in descending order of probability until their cumulative probability exceeds the Top-p threshold. Importantly, the token that pushes the cumulative probability over the Top-p value is included in the filtered set.

After applying Temperature scaling and Top-p filtering, we need to renormalize the probabilities to ensure they sum to 1. That is done by calculating the total sum of all probabilities and divide each probability by that value.

It will look something like this.

def print_temperature_top_p(logprobs, temperature=1.0, top_p=1.0):
    temperature_probs = {}

    # 1. Convert logprobs to probabilities and apply Temperature scaling
    for obj in logprobs:
        temperature_probs[obj.token] = math.exp(obj.logprob / temperature)

    # 2. Normalize probabilities after Temperature scaling
    total_prob = sum(temperature_probs.values())
    temperature_probs = {k: v / total_prob for k, v in temperature_probs.items()}

    # 3. Apply Top-p (nucleus) filtering
    cumulative_prob = 0
    top_p_filtered_probs = {}
    for token, prob in temperature_probs.items():
        if cumulative_prob <= top_p:
            top_p_filtered_probs[token] = prob
        else:
            break

        cumulative_prob += prob

    # 4. Normalize probabilities of the tokens that passed the Top-p filter
    total_prob = sum(top_p_filtered_probs.values())
    top_p_filtered_probs = {k: v / total_prob for k, v in top_p_filtered_probs.items()}

    # 5. Print the results
    print(
        "Token | Original Probability | After Temperature | Top-p Filtered Probability"
    )
    for obj in logprobs:
        original_prob = math.exp(obj.logprob)
        temp_prob = temperature_probs.get(obj.token, 0)
        final_prob = top_p_filtered_probs.get(obj.token, 0)
        print(
            f"{obj.token:<20} | {original_prob:.4f} | {temp_prob:.4f} | {final_prob:.4f}"
        )

You can use this function with the content.top_logprobs obtained from the previous code example, like this:

print_temperature_top_p(content.top_logprobs, temperature=0.2, top_p=0.5)
print_temperature_top_p(content.top_logprobs, temperature=1.5, top_p=0.95)

One output for temperature=0.2 and top_p=0.5 is shown below:

TokenProbabilityAfter TemperatureCumulative PTop-p StatusFinal P
lazy0.925661.000001.000001.00000
The0.059180.000001.00000×0.00000
Sorry0.008010.000001.00000×0.00000
I’m0.001790.000001.00000×0.00000
Lazy0.000840.000001.00000×0.00000
l0.000660.000001.00000×0.00000
the0.000400.000001.00000×0.00000
I0.000310.000001.00000×0.00000
0.000310.000001.00000×0.00000
What0.000310.000001.00000×0.00000

As expected, with a low Temperature and Top-p, the model becomes highly deterministic, assigning virtually all probability mass to the ’lazy’ token.

And for temperature=1.5 and top_p=0.95:

TokenProbabilityAfter TemperatureCumulative PTop-p StatusFinal P
lazy0.925660.796760.796760.83193
The0.059180.127390.924150.13301
Sorry0.008010.033580.957730.03506
I’m0.001790.012350.97008×0.00000
Lazy0.000840.007490.97757×0.00000
l0.000660.006340.98392×0.00000
the0.000400.004540.98846×0.00000
I0.000310.003850.99231×0.00000
0.000310.003850.99615×0.00000
What0.000310.003851.00000×0.00000

With a higher Temperature and Top-p, the probability distribution becomes more diffuse, allowing for more diverse token selection.

Jupyter Notebook #

For further exploration, I’ve created a Jupyter Notebook that provides a starting point for experimenting with Temperature, Top-p, and OpenAI’s log probabilities. It includes the code examples used in this article.

https://github.com/Lundgren/llm-notebooks/blob/master/notebooks/playing-with-temperature-and-top-p-in-open-ais-api.ipynb