December 19, 2024
Introduction BERTScore represents a pivotal shift in LLM evaluation, moving beyond traditional heuristic-based metrics like…
Perplexity is, historically speaking, one of the “standard” evaluation metrics for language models. And while recent years have seen a surge in more complex and robust metrics, including LLM-based evaluations, perplexity still has a lot of value as a component in your evaluation suite. If you want to build effective evaluation pipelines—or just understand what researchers mean when they report perplexity scores—you need to have a grasp on what perplexity is and how it can be implemented.
Perplexity seeks to quantify the “uncertainty” a model experiences when when predicting the next token in a sequence. High uncertainty occurs when the model is unsure about the next word or token in a sequence. This can happen when the input is ambiguous or the model hasn’t seen similar examples during training. Quantifying uncertainty in language models helps us judge when it might need human oversight or further training, allowing us to handle those cases differently. This is especially critical in high-stakes situations, like with medical or legal advice, where an overconfident wrong answer could have serious consequences.
But this is just scratching the surface of perplexity. In this article, I want to go in depth, covering perplexity’s mathematical basis, underlying intuitions, and limitations. I’ll show you how to implement perplexity from scratch in Python, and how to add perplexity to your evaluation suite using Opik, our open-source LLM evaluation framework.
Let’s dive in!
Perplexity was first introduced 1977 by a team of IBM researchers working on speech recognition. The team, led by Frederick Jelinek, was looking for a metric that could measure the difficulty a statistical model experienced while making a prediction. As an interesting aside, Jelinek is the original author of the famous quote “Every time I fire a linguist, the performance of the speech recognizer goes up.”
The key insight of the initial perplexity paper is that by applying concepts from information theory to a model’s internal state, we can begin to quantify more subtle qualities of a model. While the original authors were looking for a metric to approximate the “difficulty” of speech recognition tasks, researchers working on NLP quickly recognized perplexity as relevant to their work as well.
Throughout the 1980s and 1990s, perplexity emerged as the key metric for evaluating the performance of n-gram models. Perplexity was used to measure how well these models captured linguistic patterns by quantifying the average uncertainty of predictions. “Uncertainty” was calculated using entropy and its close mathematical relative, cross-entropy, both of which we’ll explore in more detail shortly.
Perplexity remains a primary benchmark to this day and is a popular metric for evaluating sequential neural networks (including the GPT family of models). Its many advantages, and its historical role in benchmarking, make it common even in contemporary research. At the same time, its many limitations make it insufficient as a standalone evaluation metric, especially for modern LLMs.
In order to gain a more intuitive understanding of perplexity and its pros and cons, we need to first explore the underlying mathematics. Namely, we need to understand entropy and cross-entropy. If you already feel comfortable with these topics, feel free to skip the following section.
Perplexity, as an evaluation metric, has its roots in information theory and probabilistic modeling, building on Claude Shannon’s work on entropy in the 1940s. Shannon used language entropy to describe the amount of information in a message, specifically when converting from a programming language to raw binary and back to a programming language:
“The entropy is a statistical parameter which measures, in a certain sense, how much information is produced on the average for each letter of a text in the language. If the language is translated into binary digits (0 or 1) in the most efficient way, the entropy H is the average of binary digits required per letter of the original language.” – (Claude Shannon in Prediction and Entropy of Printed English, 1951)
As described by Shannon in the 40s and 50s, language entropy quantifies the average amount of information contained in a word or sequence of words, reflecting how unpredictable the next word is based on previous context. In other words, language entropy refers to the degree of uncertainty or unpredictability in a language’s word distribution.
In the experiment below, Shannon counted how many guesses it took a human being to correctly predict each letter (including spaces) in the sentence, given only the preceding letters in the sequence.
Entropy is calculated as H(P) where p(w1) is the probability of the ith word occurring in a given context, and the summation runs over all possible words in the vocabulary. The negative sign ensures that entropy is a non-negative value, as logp(w) is negative.
Higher entropy values indicate lower predictability and greater diversity in word choice, while lower values suggest more predictable language patterns, reflecting the underlying complexity of the language being modeled.
Because the output of a large language model is typically a probability distribution calculated across all possible output tokens, entropy is very straightforward to calculate. A natural next question is how we might use entropy to train a language model, and that is where entropy’s close relative cross-entropy comes in.
While entropy measures the average uncertainty in a single probability distribution, cross-entropy quantifies the difference between two probability distributions. In the case of language modeling, these would be the true distribution, P, and the model’s predicted distribution, Q. In this way cross-entropy provides a way to assess how well the model’s predictions approximate the actual distribution of the data.
Here, p(xi) represents the true probability of outcome xi and q(xi) denotes the predicted probability. Lower cross-entropy values indicate higher certainty, as they imply that the model’s probabilities are more closely aligned with the actual distribution.
In the context of model training, cross-entropy is used as the backbone of many loss functions. It measures the difference between how likely the model is to select a given token, versus the true likelihood that a given token is correct.
Once you have a grasp of entropy and cross-entropy, perplexity follows intuitively.
Like entropy and cross-entropy, perplexity also quantifies a model’s uncertainty in predicting the next token in a sequence. So, why not just use entropy or cross-entropy? It turns out perplexity is far more intuitive in explaining model behavior.
Mathematically speaking, perplexity is defined as the exponentiated average log-likelihood of the predicted words in a sequence. Or, less verbosely, perplexity is cross-entropy with the exponential function applied. This transformation might seem somewhat arbitrary at first, but it actually makes a big difference, especially in terms of interpretability.
Because cross-entropy is a negative log measure, when we take its exponential, we “undo” the log, converting this measure back into a regular probability space. This value represents the tangible count of likely choices the model considers at each step, or the “effective branching factor.”
Perplexity, then, is essentially a measure of how many options the model finds plausible on average, with lower values indicating fewer options (more confident predictions) and higher values indicating more options (greater uncertainty).
Entropy | Measures the average uncertainty in a single probability distribution P |
Cross-entropy | Measures how well a predicted distribution Q approximates the true distribution P |
Perplexity | Exponentiation of cross-entropy; Measures how many likely candidate tokens the model is choosing between |
In summary, entropy measures the inherent uncertainty in a true probability distribution, reflecting the average unpredictability of outcomes, such as words in a language. Cross-entropy extends this concept by measuring the difference between the true distribution of the data and the predicted distribution from a model, penalizing inaccurate predictions. Perplexity builds on cross-entropy by transforming it into a more interpretable form, using the exponential function to express how many equally likely word choices the model is effectively considering.
Note that the perplexity score of a language model on a sequence of tokens is the average of the perplexity scores for each predicted token. This means that if a language model has a perplexity of 10, on average, the model is selecting between 10 equally likely options for the next word.
Using this intuition, a lower perplexity score is better because it indicates that a model is effectively “choosing” between fewer viable options for the next word and is “less surprised.” A higher perplexity score, on the other hand, indicates more “uncertainty.”
Of course, it is entirely possible for a language model to be “confident” and “incorrect,” so perplexity should not be confused with an accuracy metric. But we’ll dive into more of perplexity’s limitations later on. First, let’s explore some of its advantages.
As mentioned, one of the biggest advantages of perplexity is that it is highly intuitive and explainable in a field that is notoriously opaque. This is a notable advantage over learned metrics like BERTScore and LLM-as-a-Judge metrics like G-Eval.
Having an estimate of a model’s certainty is also especially useful when using an LLM to plan or guide actions. While high certainty suggests the model has strong backing for a given prediction, low certainty can prompt further human oversight or additional checks before execution.
Perplexity is also computationally straightforward to calculate, making it fast and efficient. This also allows practitioners to evaluate model performance in real-time during training, helping to identify issues and improvements promptly and leading to faster development cycles.
As we’ll see in the next section on perplexity’s limitations, it is not an end-all evaluation metric for LLMs. However, given its explainability and low-overhead, perplexity is a quick and useful first-pass metric that works well when used in conjunction with other LLM evaluation metrics.
The most important limitation of perplexity is that it does not convey a model’s “understanding.” Perplexity is strictly a measure of uncertainty, and a model being uncertain doesn’t mean it is right or wrong. A model may be correct but unconfident or wrong but confident. So, a perplexity score isn’t a measure of accuracy, just of confidence.
It is also difficult to use perplexity as a benchmark between models. Perplexity scores are influenced by various model-specific factors, such as tokenization method, dataset, pre-processing steps, vocabulary size, and context length. For example, a character-level model may have a lower perplexity than a word-level model, but that doesn’t necessarily mean the character-level model is better.
A model can also achieve a low perplexity score by assigning high probabilities to common words, like articles and conjunctions, leading to a misleadingly low score. Overfit models can show low perplexity but lack true understanding. Research also suggests that perplexity doesn’t correlate well with an LLM’s long-term understanding, likely because it fails to capture long-term dependencies. Additionally, perplexity can be skewed by punctuation and repeated text spans, which lower scores but don’t necessarily improve text quality.
While perplexity has limitations, it remains a valuable first-pass metric when combined with other task-specific LLM evaluation metrics, offering both interpretability and efficiency.
Using what we’ve learned so far about perplexity, let’s implement it from scratch in Python so we can apply it directly to our LLM outputs. Note that because perplexity is such a common evaluation metric, there are several pre-built modules to implement it in Python, including Hugging Face’s evaluate.metrics.perplexity and perplexed from Stability AI. But coding the metric from scratch will help build intuition for what perplexity is actually doing under the hood. Later, we’ll test our function out on GPT-2 and learn how to automatically track the perplexity scores of our LLM using a custom metric in Opik.
Throughout our implementation, we’ll be using PyTorch and HuggingFace’s Transformers library.
Our basic perplexity function will take logits and target labels as inputs.
Logits are the raw scores output by the model for each token in the vocabulary for a given position in the input sequence. For each position in the sequence, the model outputs a vector of logits, where each entry in that vector corresponds to a token in the vocabulary.
The targets will be the ground truth label tensors.
To calculate perplexity, we’ll need to:
import torch
def calculate_perplexity(logits, target):
"""
Calculate perplexity from logits and target labels.
Args:
- logits (torch.Tensor): Logits output from the model (batch_size, seq_length, vocab_size).
- target (torch.Tensor): Ground truth labels (batch_size, seq_length).
Returns:
- perplexity (float): The perplexity score.
"""
# Convert logits to log probabilities
log_probs = torch.nn.functional.log_softmax(logits, dim=-1)
# Gather the log probabilities for the correct target tokens
# log_probs has shape (batch_size, seq_length, vocab_size)
# target has shape (batch_size, seq_length)
# The gather method will pick the log probabilities of the true target tokens
target_log_probs = log_probs.gather(dim=-1, index=target.unsqueeze(-1)).squeeze(-1)
# Calculate the negative log likelihood
negative_log_likelihood = -target_log_probs
# Calculate the mean negative log likelihood over all tokens
mean_nll = negative_log_likelihood.mean()
# Calculate perplexity as exp(mean negative log likelihood)
perplexity = torch.exp(mean_nll)
return perplexity.item()
# Example usage
# Simulate a batch of logits (batch_size=2, seq_length=4, vocab_size=10)
logits = torch.randn(2, 4, 10)
# Simulate ground truth target tokens
target = torch.tensor([[1, 2, 3, 4], [4, 3, 2, 1]])
# Calculate perplexity
perplexity = calculate_perplexity(logits, target)
print(f'Perplexity: {perplexity}')
The function above calculates perplexity from a mathematical perspective, but it requires some adjustments to handle raw text, as you would encounter in real-world scenarios.
Now that we’ve covered the math behind perplexity, let’s modify the function to work with the inputs and outputs of a large language model.For this version of our function we’ll want to:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
# Load the model and tokenizer (e.g., GPT-2)
model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
# Assign the EOS token as the padding token
tokenizer.pad_token = tokenizer.eos_token
def calculate_batch_perplexity(input_texts):
"""
Calculate perplexity for a batch of input texts using a pretrained language model.
Args:
- input_texts (List[str]): A list of input texts to evaluate.
Returns:
- List[float]: A list of perplexity scores, one for each input text.
"""
# Tokenize the batch of texts with padding for uniform length
inputs = tokenizer(
input_texts, return_tensors="pt", padding=True, truncation=True
)
input_ids = inputs["input_ids"]
attention_mask = inputs["attention_mask"]
# Pass the input batch through the model to get logits
with torch.no_grad():
outputs = model(input_ids, attention_mask=attention_mask)
logits = outputs.logits
# Shift the logits and input_ids to align targets correctly
# Logits dimensions are: (batch_size, seq_length, vocab_size)
shift_logits = logits[:, :-1, :] # Ignore the last token's logits
shift_labels = input_ids[:, 1:] # Skip the first token in the labels
# Compute log probabilities
log_probs = torch.nn.functional.log_softmax(shift_logits, dim=-1)
# Gather the log probabilities for the correct tokens
target_log_probs = log_probs.gather(dim=-1, index=shift_labels.unsqueeze(-1)).squeeze(-1)
# Mask out positions corresponding to padding tokens
target_log_probs = target_log_probs * attention_mask[:, 1:].to(log_probs.dtype)
# Compute the mean negative log-likelihood for each sequence
negative_log_likelihood = -target_log_probs.sum(dim=-1) / attention_mask[:, 1:].sum(dim=-1)
# Compute perplexity for each sequence
perplexities = torch.exp(negative_log_likelihood)
perplexities = perplexities.tolist()
# Take mean of perplexities of each batch
mean_perplexity_score = torch.mean(perplexities)
return {"perplexities": perplexities, "mean_perplexity": mean_perplexity_score}
# Example usage
texts = [
"The quick brown fox jumps over the lazy dog.",
"A journey of a thousand miles begins with a single step."
]
print(f"Perplexity scores: {calculate_batch_perplexity(texts)}")
This function takes as input a list of texts, and outputs a dictionary containing a list of perplexity scores for each text in the input list, as well as the average perplexity score of text sequences in the input list.
Note that taking the average of perplexity scores across texts of different lengths can lead to a skewed overall perplexity score for a couple of reasons.
First, perplexity scores tend to be more stable for longer sequences, while shorter sequences may have higher variance, leading to outliers. Second, taking a simple arithmetic mean across scores for texts of varying lengths can give disproportionate weight to tokens in shorter sequences. Nevertheless, using the arithmetic mean is currently the most common approach to calculating overall perplexity, so we use it here for the sake of consistency.
In the real world, you’ll likely want to use an evaluation framework to implement LLM metrics. In this section, we’ll implement perplexity in Opik, Comet’s open source LLM evaluation framework.
Here, we use our original perplexity function and modify it slightly to implement it as a custom Opik metric with a `score` method that returns a `ScoreResult` object:
from opik.evaluation.metrics import base_metric, score_result
class Perplexity(base_metric.BaseMetric):
"""
Perplexity (PPL) is a common LLM evaluation metric defined as the exponentiated average
negative log-likelihood of a sequence.
For more information on perplexity, see:
https://en.wikipedia.org/wiki/Perplexity
Args:
name: The name of the metric, perplexity.
"""
def __init__(
self,
name: str = "Perplexity",
):
super().__init__(name=name)
def score(
self, input_ids: torch.Tensor, logits: torch.Tensor, attention_mask: torch.Tensor
) -> score_result.ScoreResult:
"""
Calculate the perplexity score of each token give the previous tokens in the sequence.
Args:
input_ids: input ids of the text sequence input to the model (torch.Tensor)
logits: output logits of the model (torch.Tensor)
attention_mask: attention mask
Returns:
score_result.ScoreResult: A ScoreResult object
"""
# Shift the logits and input_ids to align targets correctly
shift_logits = logits[:, :-1, :] # Ignore the last token's logits
shift_labels = input_ids[:, 1:] # Skip the first token in the labels
# Compute log probabilities
log_probs = torch.nn.functional.log_softmax(shift_logits, dim=-1)
# Gather the log probabilities for the correct tokens
target_log_probs = log_probs.gather(dim=-1, index=shift_labels.unsqueeze(-1)).squeeze(-1)
# Mask out positions corresponding to padding tokens
target_log_probs = target_log_probs * attention_mask[:, 1:].to(log_probs.dtype)
# Compute the mean negative log-likelihood for each sequence
negative_log_likelihood = -target_log_probs.sum(dim=-1) / attention_mask[:, 1:].sum(dim=-1)
# Take the exp(negative_log_likelihood)
perplexities = torch.exp(negative_log_likelihood)
# Take the mean of perplexity scores
mean_perplexity_score = torch.mean(perplexities)
return score_result.ScoreResult(value=mean_perplexity_score, name=self.name)
perplexity = Perplexity()
After defining perplexity as a custom metric, we can use it by:
You can find the full code in the Colab.
from opik import track
from opik.evaluation import evaluate
@track
def your_llm_application(input: str) -> str:
# Tokenize the batch of texts with padding for uniform length
inputs = tokenizer(
input, return_tensors="pt", padding=True, truncation=True
)
input_ids = inputs["input_ids"]
attention_mask = inputs["attention_mask"]
# Pass the input batch through the model to get logits
with torch.no_grad():
outputs = model(input_ids, attention_mask=attention_mask)
return {"input_ids": input_ids,
"logits": outputs.logits,
"attention_mask": attention_mask}
@track
def evaluation_task(x):
llm_outputs = your_llm_application(x['input'])
return {
"input_ids": llm_outputs['input_ids'],
"logits": llm_outputs['logits'],
"attention_mask": llm_outputs['attention_mask']
}
evaluation = evaluate(
experiment_name="My ppl experiment",
dataset=dataset,
task=evaluation_task,
scoring_metrics=[perplexity],
experiment_config={
"model": model_name
}
)
And here is what the output of your evaluation should look like from within the Opik UI:
Perplexity is extremely popular for its intuitiveness and efficiency, but it only provides a partial picture of a language model’s performance. It captures a model’s certainty about its predictions but, notably, it does not convey a model’s “understanding.”
For a more complete understanding of a model’s behavior, perplexity should be used alongside other evaluation metrics, such as accuracy and fluency, as well as task-specific metrics like relevance, coherence, factuality, and hallucination detection. Because of its computational efficiency, perplexity is particularly useful as a first-pass metric, but has significant limitations that require additional evaluation methods to address.
More nuanced evaluation methods include using an LLM-as-a-judge, but these methods are also often less interpretable. Especially when relying on the same language model being evaluated, they can lead to potential biases, circular reasoning, and high variability in results. These limitations make it essential to pair LLM-as-a-judge metrics with other evaluation methods, like perplexity, which has been shown to outperform the results of prompting the LLMs-as-a-judge with basic instructions at estimating text quality.
Using perplexity as part of a “suite” of metrics is useful beyond just the “extra cover” of additional metrics, however. Seeing where these metrics diverge can help you identify problematic data and points of failure in your evaluation suite.
For example, a high perplexity and high accuracy score may indicate that while the model is correct in specific answers, it is uncertain overall and needs additional training. Likewise, a model with low perplexity and coherence may produce text it is confident in, but that doesn’t flow logically, which may not be acceptable for your application and which could point to issues with sentence structure in the training data. Conversely, a model with high perplexity and high coherence suggests the model is uncertain about its predictions even when producing coherent text. As a final example, if both hallucination detection scores and perplexity scores are high, the model is both uncertain and likely producing fabricated content, suggesting potential weaknesses in grounding or fact-based reasoning within the training pipeline. Monitoring these divergences helps identify specific areas for model and data improvement to better align with your model’s intended performance.
In summary, perplexity is a valuable metric for evaluating language models by measuring their confidence and predicting text sequences. While it offers useful insights, perplexity should be used alongside other metrics to get a fuller picture of model performance. This approach helps highlight specific strengths and weaknesses, allowing for more targeted improvements and reliable assessments of model quality.
If you found this article useful, follow me on LinkedIn and Twitter for more content!