What are Evals?

When you start working with Large Language Models (LLMs), you’ll quickly come across the concept of evals. Evals, evals, evals!
As a product manager building a product that uses LLMs to achieve a specific result, you’ll need to make sure that the product works as expected and that its users have the best possible experience. This means performing evaluations, or evals for short.
The key here is to understand that it’s not just people building LLMs themselves, the big tier-one companies like OpenAI, Anthropic and Cohere, that need to understand evals, but also companies building products where a key part of the product’s functionality uses an LLM under the hood. Your product is calling an LLM and you need to make sure it’s working as well as it can.
Furthermore, if you’re a cybersecurity professional, your team may be building internal products and solutions that use LLMs, and you need to understand evals in order to build secure, reliable, high-quality internal products and solutions.
What are evals?
AI evaluation refers to the process of assessing the performance, accuracy, and reliability of AI models designed to generate content, such as text, images, music, or videos. Model evaluation metrics are crucial for assessing AI models to ensure they meet the desired objectives. - Miquido
LLMs change all the time. Usually they get better, but sometimes they get worse at certain skills. Evals help us understand how well the LLM works, and that’s especially important if we’re using it to build a product or service. So we need to test it to make sure it’s working as expected, and we test it by running “evaluations” or “evals”.
Determinism: next best token
The process of predicting the next plausible “best” token is probabilistic in nature. This means that, LLMs can generate a variety of possible outputs for a given input, instead of always providing the same response. It is exactly this non-deterministic nature of LLMs that makes them challenging to evaluate, as there’s often more than one appropriate response. - Confident AI
LLMs are not as deterministic as code. With code, we can be reasonably sure that a function will produce the same output given the same input. This is not so easy with LLMs. They may not always produce the same output given the same input, and testing them may not be as easy as running the code.
Also, it’s not so easy to see the “code changes” in a new version of an LLM. With code, we could see the git commit history, review the pull request comments, read the release notes, etc. With LLMs, we don’t have exactly the same information. So we need to find other ways to understand how the LLM is performing, and that means evals.
How do you run evals?
There are three main types of evals (summarized from this great article):
-
Human Evaluations
Human evaluations incorporate feedback mechanisms directly into your product interface, such as thumbs-up/thumbs-down buttons or comment boxes beside LLM responses, allowing users to rate content while subject-matter experts can provide specialized feedback for prompt optimization or model fine-tuning through RLHF, though these evaluations tend to be sparse, lack specificity about what users liked or disliked, and hiring professional labelers can significantly increase costs.
-
Code-Based Evaluations
Code-based evaluations implement automated checks on API calls or generated code to verify functionality against objective criteria like error-free execution and syntax requirements, offering a quick and cost-effective implementation method for assessing technical accuracy without human intervention, but struggling to evaluate nuanced aspects like readability or contextual appropriateness that require human judgment, making them less suitable for creative or explanatory content.
-
LLM-Based Evaluations
LLM-based evaluations employ an external “judge” language model to assess your primary LLM system’s outputs through carefully crafted prompts, delivering scalable, human-like assessment at a fraction of the cost while providing detailed explanations for ratings, though this approach requires initial investment in creating and calibrating the evaluation system with a small set of human-labeled examples to ensure proper alignment with user preferences.
Code based evals
Here’s an example from Cohere.
In this code, it’s simply getting a simple score from an LLM based on a prompt and a criterion, where the result is a score between 0 and 1. That result is then used as part of a larger process to evaluate a series of outputs.
import cohere
co = cohere.ClientV2(api_key="<YOUR API KEY>")
response = co.chat(
model="command-a-03-2025",
messages=[
{
"role": "user",
"content": """
You are an AI grader that given an output and a criterion, grades the completion based on
the prompt and criterion. Below is a prompt, a completion, and a criterion with which to grade
the completion. You need to respond according to the criterion instructions.
## Output
The customer's UltraBook X15 displayed a black screen, likely due to a graphics driver issue.
Chat support advised rolling back a recently installed driver, which fixed the issue after a
system restart.
## Criterion
Rate the ouput text with a score between 0 and 1. 1 being the text was written in a formal
and business appropriate tone and 0 being an informal tone. Respond only with the score.
""",
}
],
)
print(response.message.content[0].text)
Using this idea of getting a score or a simple answer that we can use as a metric to evaluate the LLM, we can build larger evaluation systems, and of course there are many tools that can help us do that.
Red Team evals
LLM red-teaming is a testing technique where you simulate attacks or feed adversarial inputs to uncover vulnerabilities in the system. This is a crucial step in evaluating AI system safety for high-risk applications. - Evidently AI
As evals are effectively testing LLM output against a criterion, red team evals are testing the LLM against adversarial inputs, and one could see how evals and red teaming are related.
Red teaming could include:
- Adversarial testing: Deliberately crafting prompts designed to make the LLM generate harmful, biased, or otherwise problematic outputs
- Jailbreaking attempts: Trying different techniques to bypass content filters or safety guardrails
- Prompt injection exploration: Testing if the model can be manipulated into ignoring its instructions or revealing sensitive information
- Exploitation of tool-use capabilities: Testing if an LLM with access to external tools or APIs can be manipulated to misuse these capabilities
- Stress testing for edge cases: Pushing the model into unusual or extreme scenarios to observe behavior
Conclusion
Testing code has always been key to building secure, reliable, and high-quality products. LLMs are no different. Evals are a key part of building secure, reliable, and high-quality products that are BASED on LLMs. The ability to run evals is becoming a core competency for product managers and cybersecurity professionals alike.