LLMs are known to show creativity and produce diverse outputs. But how do they do it?
This post explores various methods to achieve diversity and creative unpredictability in LLM outputs.
Determinism in LLM Outputs
Large Language Models (LLMs), by design, exhibit non-deterministic behavior. This means that they do not generate identical responses to the same input across different instances. As a result, each query, even if repeated under similar conditions, can yield varied outcomes. This variability is fundamental to the design of LLMs, enhancing their ability to produce diverse and creative outputs.
Take for example these two responses:
Input: Answer in 10 words. What is an apple?.
Response 1: A juicy, sweet, and crunchy fruit that grows on trees.
Response 2: A juicy, sweet, and crunchy fruit that’s often eaten fresh.
At the core of the difference is the way the next token are sampled.
To illustrate this point, let’s take an example of “A quick brown fox”.
Given these words as input, model will generate next token in the sequence which according to common knowledge, is “jumps”. So, the model will assign highest probability to “jumps” token.
Input: A quick brown fox
The model will take these input words as token and output logits(logprobs) for each token in the vocabulary for the next token prediction. These logits are converted to normalized probabilities using Softmax function.
Normalized Probabilities:
1 | tensor([[0.01, 0.05, 0.2, 0.02, ...]]) |
The next step involves selecting a probability that corresponds to the next token in the sequence. A straightforward approach might be to simply choose the highest probability. However, this method is not optimal as it leads to deterministic behavior in the model’s output.
The deterministic method has several drawbacks:
- During the inference phase, the tokens generated by the model are fed back as input for next predictions. If these tokens are inaccurate, the errors are perpetuated through each cycle of feedback, leading to a degradation in the quality of the output.
- Additionally, deterministic selection tends to cause repetitive sequences of tokens. Such repetition can render the text output unnatural, robotic, and lacking in coherence.
To enhance the quality and variability of the output, it is preferable to employ a sampling method. This method involves selecting the next token based on a probabilistic distribution of the predicted probabilities, rather than merely choosing the token with the highest probability.
Sampling
Sampling refers to the technique of selecting a specific subset from a larger set of data, which in this context is referred to as the population.
In the case of language models, this subset consists of a single token chosen from the output probabilities of the entire vocabulary. This selection process is crucial and acts as a control knob for diverse and contextually appropriate responses.
Sampling from a Multinomial Distribution
The selection of the next token in a sequence is performed using a probabilistic approach based on a multinomial distribution. Here’s how it works:
Probability Assignment: First, the model computes the logits for each potential next token and applies the softmax function to convert these logits into normalized probabilities.
Random Sampling: A random number between 0 and 1 is generated. This number is used to select a token whose cumulative probability exceeds this random value.
For instance, consider the phrase “A quick brown fox”. After processing this input, the model predicts the probabilities for potential next tokens. Let’s explore how the token “jumps” might be selected based on these probabilities:
- The model calculates the logits for each possible next token and applies the softmax function to derive the probabilities.
- A random number is generated. The token corresponding to the first cumulative probability that is greater than this random number is chosen as the next token in the sequence.
This method ensures that while the most probable token (“jumps” in our example) is often selected, there is still room for less likely tokens to be chosen occasionally, and as a result introducing variability and creativity in the text generated by the model.
Example of Token Selection Using Cumulative Probabilities
Consider the following table which lists tokens along with their individual and cumulative probabilities:
Token | Probability | Cumulative Probability |
---|---|---|
sits | 0.01 | 0.01 |
runs | 0.05 | 0.06 |
jumps | 0.2 | 0.26 |
disappears | 0.02 | 0.28 |
Imagine a scenario where a random number, say 0.15, is generated to determine the next token. The selection process would proceed as follows:
Token | Probability | Cumulative Probability | Random Number | Selection Outcome |
---|---|---|---|---|
sits | 0.01 | 0.01 | 0.15 | Not Selected |
runs | 0.05 | 0.06 | 0.15 | Not Selected |
jumps | 0.2 | 0.26 | 0.15 | Selected |
disappears | 0.02 | 0.28 | 0.15 | - |
In this example, the token “jumps” is selected because its cumulative probability of 0.26 is the first to exceed the random number of 0.15. This illustrates how the model can select a token based on a probabilistic approach, ensuring that while the most probable token is often chosen, other less likely tokens can also be selected occasionally, thereby adding diversity and unpredictability to the generated text.
The technique of choosing the next token from a set of possible tokens is known as sampling from a Multinomial Distribution.
Why is it called multinomial? The term multinomial is used because the selection isn’t binary; instead, there are multiple potential tokens that could be chosen next, each with its own associated probability. This method reflects the reality that multiple outcomes are possible, each with a different likelihood.
Why this approach is effective?
This method is effective because a randomly generated number is more likely to fall within a range that corresponds to a higher probability category. As illustrated in the figure above, the category with a probability of 0.5 covers a larger section of the probability range (from 0.5 to 1). Therefore, it’s statistically more likely for the random number to fall within this range, leading to the selection of the corresponding token.
Ways to control determinism
In the context of token selection, once we have established probabilities for all potential tokens in our vocabulary, we can apply further filtering techniques such as Top-k and Top-p sampling to refine our selection process.
Top-k Sampling
Top-k sampling involves selecting the k tokens with the highest probabilities and then randomly sampling from this subset. This method allows us to control the diversity and creativity of the generated text. By reducing the value of k, we narrow the pool of likely tokens, making the output more deterministic and predictable, as lower probability tokens are excluded from the selection process.
Top-p Sampling
Top-p sampling, on the other hand, is similar to Top-k but focuses on the cumulative probability. Instead of selecting a fixed number of tokens (k), it selects all tokens whose cumulative probabilities exceed a predefined threshold (p). For example, if we set top-p to 0.5:
- We first sort the token probabilities in descending order.
- We then calculate the cumulative probability and continue to add tokens until the cumulative probability exceeds 0.5.
- Finally, we sample a token from this refined subset.
This method ensures that the selection is not just limited to the most probable tokens but also includes any token that collectively contributes to reaching the threshold probability, thus maintaining a balance between predictability and variability in the generated text.
Logit Scaling (Temperature):
Logit scaling, controlled by a temperature parameter, adjusts the distribution of token probabilities. A temperature below 1 sharpens the distribution, amplifying higher probabilities and leading to the selection of more probable tokens.
Conversely, a temperature above 1 flattens the distribution, boosting lower probabilities and increasing diversity but potentially reducing coherence. This technique allows for a customizable balance between predictability and creativity in text generation.
Hint: The scaling follows graph of hyperbolic tangent function.
Here is the implementation of some of the techniques discussed in the blog provided by Andrej Karpathy in the nanoGPT project:
Generate from nanoGPT
Chaturvedi, Pranav. (June 2024). How LLMs Introduce Creativity. https://pranavchat.com/llms/2024/05/12/sampling-in-llms.html
or
1 | @article{chaturvedi2024sampling, |
Attributions
- nanoGPT Implementation by Andrej Karpathy