> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://humanloop.com/docs/llms.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://humanloop.com/docs/_mcp/server.

> Learn how to set up and use Humanloop's evaluation framework to test and track the performance of your prompts.

A key part of successful prompt engineering and deployment for LLMs is a robust evaluation framework. In this section we provide guides for how to set up Humanloop's evaluation framework in your projects.

The core entity in the Humanloop evaluation framework is an **evaluator** - a function you define which takes an LLM-generated log as an argument and returns an **evaluation**. The evaluation is typically either a boolean or a number, indicating how well the model performed according to criteria you determine based on your use case.

## Types

Currently, you can define your evaluators in two different ways:

* **Python** - using our in-browser editor, define simple Python functions to act as evaluators
* **LLM** - use language models to evaluate themselves! Our evaluator editor allows you to define a special-purpose prompt which passes data from the underlying log to a language model. This type of evaluation is particularly useful for more subjective evaluation such as verifying appropriate tone-of-voice or factuality given an input set of facts.

## Modes: Monitoring vs. testing

Evaluation is useful for both testing new model configs as you develop them and for monitoring live deployments that are already in production.

To handle these different use cases, there are two distinct modes of evaluator - **online** and **offline**.

### Online

Online evaluators are for use on logs generated in your project, including live in production. Typically, they are used to monitor deployed model performance over time.

Online evaluators can be set to run automatically whenever logs are added to a project. The evaluator takes the `log` as an argument.

### Offline

Offline evaluators are for use with predefined test [**datasets**](/docs/v4/datasets) in order to evaluate models as you iterate in your prompt engineering workflow, or to test for regressions in a CI environment.

A test dataset is a collection of **datapoints**, which are roughly analogous to unit tests or test cases in traditional programming. Each datapoint specifies inputs to your model and (optionally) some target data.

When you run an offline evaluation, Humanloop iterates through each datapoint in the dataset and triggers a fresh LLM generation using the inputs of the testcase and the model config being evaluated. For each test case, your evaluator function will be called, taking as arguments the freshly generated `log` and the `testcase` datapoint that gave rise to it. Typically, you would write your evaluator to perform some domain-specific logic to determine whether the model-generated `log` meets your desired criteria (as specified in the datapoint 'target').

## Humanloop-hosted vs. self-hosted

Conceptually, evaluation runs have two components:

1. Generation of logs from the datapoints
2. Evaluating those logs.

Using the Evaluations API, Humanloop offers the ability to generate logs either within the Humanloop runtime, or self-hosted. Similarly, evaluations of the logs can be performed in the Humanloop runtime (using evaluators that you can define in-app) or self-hosted (see our [guide on self-hosted evaluations](./self-hosted-evaluations)).

In fact, it's possible to mix-and-match self-hosted and Humanloop-runtime generations and evaluations in any combination you wish. When creating an evaluation via the API, set the `hl_generated` flag to `False` to indicate that you are posting the logs from your own infrastructure (see our [guide on evaluating externally-generated logs](./evaluating-externally-generated-logs)). Include an evaluator of type `External` to indicate that you will post evaluation results from your own infrastructure. You can include multiple evaluators on any run, and these can include any combination of `External` (i.e. self-hosted) and Humanloop-runtime evaluators.

***

title: Evaluating LLM Applications
authors: \["Peter Hayes"]
type: Blog
date: 2024-02-06
draft: false
published: true
tags: \["llm", "gpt-4", "evals"]

summary:
An overview of evaluating LLM applications. The emerging evaluation framework,
parallels to traditional software testing and some guidance on best practices.

## thumbnail: /blog/evaluating-llm-apps/EvalLLMAppsThumbnail2.png

An ever-increasing number of companies are using large language models (LLMs) to
transform both their product experiences and internal operations. These kinds of
foundation models represent a new computing platform. The process of
[prompt engineering](https://humanloop.com/blog/prompt-engineering-101) is
replacing aspects of software development and the scope of what software can
achieve is rapidly expanding.

In order to effectively leverage LLMs in production, having confidence in how
they perform is paramount. This represents a unique challenge for most companies
given the inherent novelty and complexities surrounding LLMs. Unlike traditional
software and non-generative machine learning (ML) models, evaluation is
subjective, hard to automate and the risk of the system going embarrassingly
wrong is higher.

This post provides some thoughts on evaluating LLMs and discusses some emerging
patterns I've seen work well in practice from experience with thousands of teams
deploying LLM applications in production.

# **LLMs are not all you need**

It’s important to first understand the basic makeup of what we are evaluating
when working with LLMs in production. As the models get increasingly more
powerful, a significant amount of effort is spent trying to give the model the
appropriate context and access required to solve a task.

For the current generation of models, at the core of any LLM app is usually some
combination of the following components:

* **LLM model** - the core reasoning engine; an API into OpenAI, Anthropic,
  Google, or open source alternatives like
  [Mistral](https://mistral.ai/news/mixtral-of-experts/).
* **Prompt template** - the boilerplate instructions to your model, which are
  shared between requests. This is generally versioned and managed like code
  using formats like the
  [.prompt](/docs/v4/prompt-file-format) file.
* **Data sources** - to provide the relevant context to the model; often
  referred to as retrieval augmented generation (RAG). Examples being
  traditional relational databases, graph databases, and
  [vector databases](/docs/v4/guides/set-up-semantic-search).
* **Memory** - like a data source, but that builds up a history of previous
  interactions with the model for re-use.
* **Tools** - provides access to actions like API calls and code execution
  empowering the model to interact with external systems where appropriate.
* **Agent control flow** - some form of looping logic that allows the model to
  make multiple generations to solve a task before hitting some stopping
  criteria.
* **Guardrails** - a check that is run on the output of the model before
  returning the output to the user. This can be simple logic, for example
  looking for certain keywords, or another model. Often triggering fallback to
  human-in-the-loop workflows

## **LLM apps are complex systems**

These individual components represent a large and unique design space to
navigate. The configuration of each one requires careful consideration; it's no
longer just strictly prompt engineering.

For example, take the vector database - now a mainstay for the problem of
providing the relevant chunks of context to the model, for a particular query,
from a larger corpus of documents. There is a near infinite number of open or
closed source vector stores to choose from. Then there is the embedding model
(that also has its own design choices), retrieval technique, similarity metric,
how to chunk your documents, how to sync your vector store... and the list goes
on.

Not only that, but there are often complex interactions between these components
that are hard to predict. For example, maybe the performance of your prompt
template is weirdly sensitive to the format of the separator tokens you forgot
to strip when chunking your documents in the vector database (a real personal
anecdote).

Furthermore, we're seeing applications that have multiple specialist blocks of
these components chained together to solve a task. This all adds to the
challenge of evaluating the resulting complex system. Specialist tooling is
increasingly a necessity to help teams build robust applications.

Like for testing in traditional software development, the goal of a good LLM
evaluation framework is to provide confidence that the system is working as
expected and also transparency into what might be causing issues when things go
wrong. Unlike traditional software development, a significant amount of
experimentation and collaboration is required when building with LLMs. From
prompt engineering with domain experts, to tool integrations with engineers. A
systematic way to track progress is required.

# **Take lessons from traditional software**

A large proportion of teams now building great products with LLMs aren't
experienced ML practitioners. Conveniently many of the goals and best practices
from software development are broadly still relevant when thinking about LLM
evals.

## **Automation and continuous integration is still the goal**

Competent teams will traditionally set up robust test suites that are run
automatically against every system change before deploying to production. This
is a key aspect of continuous integration (CI) and is done to protect against
regressions and ensure the system is working as the engineers expect. Test
suites are generally made up of 3 canonical types of tests: unit, integration
and end-to-end.

* **Unit** - very numerous, target a specific atom of code and are fast to run.
* **Integration** - less numerous, cover multiple chunks of code, are slower to
  run than unit tests and may require mocking external services.
* **End-to-end** - emulate the experience of an end UI user or API caller; they
  are slow to run and oftentimes need to interact with a live version of the
  system.

The most effective mix of test types for a given system often sparks debate.
Yet, the role of automated testing as part of the deployment lifecycle,
alongside the various trade-offs between complexity and speed, remain valuable
considerations when working with LLMs.

## **Unit tests are tricky for LLMs**

There are however a number of fundamental differences with LLM native products
when it comes to this type of testing. Of the test types, the most difficult to
transfer over to LLMs is the unit test because of:

* **Randomness** - LLMs produce probabilities over words which can result in
  random variation between generations for the same prompt. Certain
  applications, like task automation, require deterministic predictions. Others,
  like creative writing, demand diversity.
* **Subjectivity** - we often times want LLMs to produce natural human-like
  interactions. This requires more nuanced approaches to evaluation because of
  the inherent subjectivity of the correctness of outputs, which may depend on
  context or user preferences.
* **Cost and latency** - given the computation involved, running SOTA LLMs can
  come with a significant cost and tend to have relatively high latency;
  especially if configured as an agent that can take multiple steps.
* **Scope** - LLMs are increasingly capable of solving broader less well-defined
  tasks, resulting in the scope of what we are evaluating often being a lot more
  open-ended than in traditional software applications.

As a result, the majority of automation efforts in evaluating LLM apps take the
form of integration and end-to-end style tests and should be managed as such
within CI pipelines.

## **Observability needs to evolve**

There is also the important practice of monitoring the system in production.
Load and usage patterns in the wild can be unexpected and lead to bugs.
Traditional observability solutions like [Datadog](https://www.datadoghq.com/)
and [New Relic](https://newrelic.com/) monitor the health of the system and
provide alerts when things go wrong; usually based on simple heuristics and
error codes. This tends to fall short when it comes to LLMs. The more capable
and complex the system, the harder it can be to determine something actually
went wrong and the more important observability and traceability is.

Furthermore, one of the promises of building with LLMs is the potential to more
rapidly intervene and experiment. By tweaking instructions you can fix issues
and improve performance. Another advantage is that less technical teams can be
more involved in building; the
[makeup of the teams](https://humanloop.com/blog/how-to-build-the-right-team-for-generative-ai)
is evolving. This impacts what's needed from an observability solution in this
setting. A tighter integration between observability data and the development
environment to make changes is more beneficial, as well as usability for
collaborating with product teams and domain experts outside of engineering. This
promise of more rapid and sometimes non-technical iteration cycles also
increases the importance of robust regression testing.

Before delving more into the stages of evaluation and how they relate to
existing CI and observability concepts, it's important to understand more about
the different types of evaluations in this space.

# **Types of evaluation can vary significantly**

When evaluating one or more components of an LLM block, different types of
evaluations are appropriate depending on your goals, the complexity of the task
and available resources. Having good coverage over the components that are
likely to have an impact over the overall quality of the system is important.

These different types can be roughly characterized by the return type and the
source of, as well as the criteria for, the judgment required.

## **Judgment return types are best kept simple**

The most common judgment return types are familiar from traditional data science
and machine learning frameworks. From simple to more complex:

* **Binary** - involves a yes/no, true/false, or pass/fail judgment based on
  some criteria.
* **Categorical** - involves more than two categories; for exampling adding an
  abstain or maybe option to a binary judgment.
* **Ranking** - the relative quality of output from different samples or
  variations of the model are being ranked from best to worst based on some
  criteria. Preference based judgments are often used in evaluating the quality
  of a ranking.
* **Numerical** - involves a score, a percentage, or any other kind of numeric
  rating.
* **Text** - a simple comment or a more detailed critique. Often used when a
  more nuanced or detailed evaluation of the model's output is required.
* **Multi-task** - combines multiple types of judgment simultaneously. For
  example, a model's output could be evaluated using both a binary rating and a
  free-form text explanation.

Simple individual judgments can be easily aggregated across a dataset of
multiple examples using well known metrics. For example, for classification
problems, [precision](https://en.wikipedia.org/wiki/Accuracy_and_precision),
[recall](https://en.wikipedia.org/wiki/Precision_and_recall) and
[F1](https://en.wikipedia.org/wiki/F-score) are typical choices. For rankings,
there are metrics like
[NDCG](https://en.wikipedia.org/wiki/Discounted_cumulative_gain),
[Elo ratings](https://en.wikipedia.org/wiki/Elo_rating_system) and
[Kendall's Tau](https://en.wikipedia.org/wiki/Kendall_rank_correlation_coefficient).
For numerical judgments there are variations of the
[Bleu score](https://blog.modernmt.com/understanding-mt-quality-bleu-scores/).

I find that in practice binary and categorical types generally cover the
majority of use cases. They have the added benefit of being the most straight
forward to source reliably. The more complex the judgment type, the more
potential for ambiguity there is and the harder it becomes to make inferences.

## Model sourced judgments are increasingly promising

Sourcing judgments is an area where there are new and evolving patterns around
foundation models like LLMs. At Humanloop, we've standardised around the
following canonical sources:

* **Heuristic/Code** - using simple deterministic rules based judgments against
  attributes like cost, token usage, latency, regex rules on the output, etc.
  These are generally fast and cheap to run at scale.
* **Model (or 'AI')** - using other foundation models to provide judgments on
  the output of the component. This allows for more qualitative and nuanced
  judgments for a fraction of the cost of human judgments.
* **Human** - getting gold standard judgments from either end users of your
  application, or internal domain experts. This can be the most expensive and
  slowest option, but also the most reliable.

![](https://humanloop.com/blog/evaluating-llm-apps/LLMEvals.jpg)

Model judgments in particular are increasingly promising and an active research
area. The paper [Judging LLM-as-a-Judge](https://arxiv.org/abs/2306.05685)
demonstrates that an appropriately prompted GPT-4 model achieves over 80%
agreement with human judgments when rating LLM model responses to questions on a
scale of 1-10; that's equivalent to the levels of agreement between humans.

Such evaluators can be equally effective in evaluating the important non-LLM
components, such as the retrieval component in RAG. In
[Automated Evaluation of Retrieval Augmented Generation](https://arxiv.org/pdf/2309.15217.pdf)
a GPT-3 model is tasked with extracting the most relevant sentences from the
retrieved context. A numeric judgment for relevance is then computed using the
ratio of the number of relevant to irrelevant sentences, which was also found to
be highly correlated with expert human judgments.

However, there are risks to consider. The same reasons that evaluating LLMs is
hard apply to using them as evaluators. Recent research has also shown LLMs to
have biases that can contaminate the evaluation process. In
[Benchmarking Cognitive Biases in Large Language Models as Evaluators](https://arxiv.org/pdf/2309.17012.pdf)
they measure 6 cognitive biases across 15 different LLM variations. They find
that simple details such as the order of the results presented to the model can
have material impact on the evaluation.

![](https://humanloop.com/blog/evaluating-llm-apps/BiasLLMEvals.png)

The takeaway here is that it's important to still experiment with performance on
your target use cases before trusting LLM evaluators - evaluate the evaluator!
All the usual prompt engineering techniques such as including few-shot examples
are just as applicable here. In addition, fine-tuning specialist, more
economical evaluator models using human judgements can be a real unlock.

I believe teams should consider shifting more of their human judgment efforts up
a level to focus on helping improve model evaluators. This will ultimately lead
to a more scalable, repeatable and cost-effective evaluation process. As well as
one where the human expertise can be more targeted on the most important high
value scenarios.

## **Judgment criteria is where most of the customisation happens**

The actual criteria for the judgment is what tends to be most specific to the
needs of a particular use case. This will either be defined in code, in a prompt
(or in the parameters of a model), or just in guidelines depending on whether
it's a code, model or human based evaluator.

There are lots of broad themes to crib from. Humanloop for example provides
templates for popular use cases and best practises, with the ability to
experiment and customize. There are categories like general performance
(latency, cost and error thresholds), behavioural (tone of voice, writing style,
diversity, factuality, relevance, etc.), ethical (bias, safety, privacy, etc.)
and user experience (engagement, satisfaction, productivity, etc.).

Unsurprisingly, starting with a small set of evaluators that cover the most
important criteria is wise. These can then be adapted and added to over time as
requirements are clarified and new edge cases uncovered. Trade-offs are often
necessary between these criteria. For example, a more diverse set of responses
might be more engaging, but also more likely to contain errors and higher
quality can often come at a cost in terms of latency.

Thinking about these criteria upfront for your project can be a good hack to
ensure your team deeply understand the end goals of the application.

# **Different stages of evaluation are necessary**

As discussed with the distinction between CI and observability; different stages
of the app development lifecycle will have different evaluation needs. I've
found this lifecycle to naturally still consist of some sort of planning and
scoping exercise, followed by cycles of development, deployment and monitoring.

These cycles are then repeated during the lifetime of the LLM app in order to
intervene and improve performance. The stronger the teams, the more agile and
continuous this process tends to be.

Development here will include both the typical app development; orchestrating
your LLM blocks in code, setting up your UIs, etc, as well more LLM specific
interventions and experimentation; including prompt engineering, context
tweaking, tool integration updates and fine-tuning - to name a few. Both the
choices and quality of interventions to
[optimize your LLM performance](https://humanloop.com/blog/optimizing-llms) are
much improved if the right evaluation stages are in place. It facilitates a more
data-driven, systematic approach.

From my experience there are 3 complementary stages of evaluation that are
highest ROI in supporting rapid iteration cycles of the LLM block related
interventions:

1. **Interactive** - it's useful to have an interactive playground-like editor
   environment that allows rapid experimentation with components of the model
   and provides immediate evaluator feedback. This usually works best on a
   relatively small number of scenarios. This allows teams (both technical and
   non-technical) to quickly explore the design space of the LLM app and get an
   informal sense of what works well.

2. **Batch offline** - benchmarking or regression testing the most promising
   variations over a larger curated set of scenarios to provide a more
   systematic evaluation. Ideally a range of different evaluators for different
   components of the app can contribute to this stage, some comparing against
   gold standard expected results for the task. This can fit naturally into
   existing CI processes.

3. **Monitoring online** - post deployment, real user interactions can be
   evaluated continuously to monitor the performance of the model. This process
   can drive alerts, gather additional scenarios for offline evaluations and
   inform when to make further interventions. Staging deployments through
   internal environments, or beta testing with selected cohorts of users first,
   are usually super valuable.

![](https://humanloop.com/blog/evaluating-llm-apps/EvalsStages.png)

It's usually necessary to co-evolve to some degree the evaluation framework
alongside the app development as more data becomes available and requirements
are clarified. The ability to easily version control and share across stages and
teams both the evaluators and the configuration of your app can significantly
improve the efficiency of this process.

# **High quality datasets are still paramount**

Lack of access to high quality data will undermine any good evaluation
framework. A good evaluation dataset should ideally be representative of the
full distribution of behaviours you expect to see and care about in production,
considering both the inputs and the expected outputs. It's also important to
keep in mind that coverage of the expected behaviours for the individual
components of your app is important.

Here are some strategies that I think are worth considering: leveraging
public/academic benchmarks, collecting data from your own systems and creating
synthetic data.

## **Pay attention to academic and public benchmarks**

There are well cited academic benchmarks that have been curated to evaluate the
general capabilities of LLMs. For AI leaders, these can be helpful to reference
when choosing which base models to build with originally, or to graduate to when
things like scale and cost start to factor in. For example the
[Large Model Systems Organizations](https://lmsys.org/) maintains
[Chatbot Arena](https://chat.lmsys.org/) where they have crowd-sourced over 200k
human preferences votes to rank LLMs, both commercial and open source, as well
as recording the performance on academic multi-task reasoning benchmarks like
[MMLU](https://arxiv.org/abs/2009.03300).

![](https://humanloop.com/blog/evaluating-llm-apps/ChatbotArena.png)

Another great resource in the same vein is
[Hugging Face datasets](https://huggingface.co/docs/datasets/index), where they
also maintain a leaderboard of how all the latest OSS models perform across a
range of tasks using the
[Eleuther LLM evaluation harness library](https://github.com/EleutherAI/lm-evaluation-harness).

![](https://humanloop.com/blog/evaluating-llm-apps/HuggingFaceLeaderBoard.png)

More domain specific academic datasets may also be particularly relevant for
your target use case and can be used to warm start your evaluation efforts; for
example if you were working on
[medical related tasks](https://huggingface.co/datasets/AdaptLLM/medicine-tasks).

## **Real product interactions are the most valuable source of data**

Arguably the best form of dataset comes from real user interactions. Useful
sources of this kind of data are actually the interactive and monitoring stages
discussed above.

With access to an interactive environment for prompt engineering (or a test
version of your application), internal domain experts can synthesize examples of
the kinds of interactions they expect to see in production. These interactions
should be recorded throughout the course of initial experimentation to form a
benchmark dataset for subsequent offline evaluations.

For leveraging real end-user interactions, a tighter integration between
observability data and the development environment that manages evaluations
makes it easier to curate real scenarios into your benchmark datasets over time.

![](https://humanloop.com/blog/evaluating-llm-apps/HLEditor.png)

Something worth careful consideration to maximise the impact of end-user
interactions is to set up your application to
[capture rich feedback](/docs/v4/guides/evaluation/evaluating-with-human-feedback)
from users form the start. This is an example of an online evaluator that relies
on human judgments, which can be used to filter for particularly interesting
scenarios to add to benchmark datasets.

Feedback doesn't need to be only explicit from the user; it can be provided
implicitly in the way they interact with the system. For example,
[github copilot reportedly](https://thakkarparth007.github.io/copilot-explorer/posts/copilot-internals.html#secret-sauce-3-telemetry)
monitors whether the code suggestion was accepted at various time increments
after the suggestion was made, as well as whether the user made any edits to the
suggestion before accepting it.

## **Synthetic data is on the rise**

Once you have a small amount of high quality data leveraging LLMs to generate
additional input examples can help bootstrap to larger datasets. By utilizing
few-shot prompting and including a representative subset of your existing data
within the prompt, you can guide the synthesizer model to generate a wide range
of supplementary examples.

A quick pointer here is to prompt the model to generate a batch of examples at a
time, rather than one at a time, such that you can encourage characteristics
like diversity between examples. Or, similarly, feed previously generated
examples back into your prompt. For instance, for a customer service system,
prompts could be designed to elicit responses across a variety of emotional
states, from satisfaction to frustration.

A specific example of this is model red-teaming, or synthesizing adversarial
examples. This is where you use the synthesizer model to generate examples that
are designed to break the system. For example, in
[Red Teaming Language Models with Language Models](https://arxiv.org/abs/2202.03286),
they uncover offensive replies, data leakage and other vulnerabilities in an LLM
chat-bot using variations of few-shot prompts to generate adversarial questions.
They also leverage a pre-trained offensive classifier to help automate their
evaluation process. However, it is worth noting they too point out the
limitations caused by LLM biases that limits diversity. They ultimately need to
generate and filter hundreds of thousands of synthetic examples.

![](https://humanloop.com/blog/evaluating-llm-apps/RedLM.png)

As with LLM evaluators, all the same rigour and tools should be applied to
evaluating the quality of the synthetic data generator model before trusting it.

# **Looking forward...**

This is a rapidly evolving area of research and practice. Here's a few areas
that I'm particularly excited about working more on at Humanloop over the coming
months that we'll touch on further in future posts:

* Increasing adoption of AI based evaluators for all components of these
  systems, with improved support for fine-tuning and specialisation happening at
  this level. The existence of OpenAI's
  [Superalignment team](https://openai.com/blog/introducing-superalignment)
  shows there is focus here on the research front.

* Supporting more multi-modal applications deployed in production, with more
  text, image, voice and even video based models coming online.

* More complex agent-based workflows and experimenting with more multi-agent
  setups and how evaluation needs to adapt to supervise these systems.

* Moving towards more end-to-end optimization for the components of these
  complex systems. A robust set of evaluators can provide an objective to
  measure performance, coupled with data synthesization to simulate the system.

At Humanloop, we've built an integrated solution for managing the development
lifecycle of LLM apps from first principles, which includes some of the
evaluation challenges discussed in this post. Please
[reach out](https://humanloop.com/demo) if you'd like to learn more.

[//]: # "## **Common pitfalls** ##"

[//]: # "- Not considering evaluation criteria early enough in the project"

[//]: # "- Relying too much on traditional data science methodologies"

[//]: # "- Being too waterfall"

[//]: # "- Not evaluating your evaluator"

[//]: # "- Not systematically capturing end user feedback"

[//]: # "- Not baking into CI"

[//]: # "- Thinking long context windows are a silver bullet"

[//]: # "- Exploding costs"

[//]: # "- Not exploring finetuning specialist evaluation models"