Run an Evaluation

In this guide, we will walk through how to run an Evaluation to compare multiple different Prompts across a Dataset when Prompts and Evaluators are run on Humanloop.

Paid Feature

This feature is not available for the Free tier. Please contact us if you wish to learn more about our Enterprise plan

An Evaluation on Humanloop leverages a Dataset, a set of Evaluators and different versions of a Prompt to compare.

The Dataset contains testcases describing the inputs (and optionally the expected results) for a given task. The Evaluators define the criteria for judging the performance of the Prompts when executed using these inputs.

Each of the Prompt versions you want to compare are run against the same Dataset producing Logs; judgements are then provided by Evaluators. The Evaluation then uses these judgements to provide a summary report of the performance allowing you to systematically compare the performance of the different Prompt versions.

Prerequisites

Combining Evaluators

You can combine multiple different types of Evaluator in a single Evaluation. For example, you might use an AI Evaluator to judge the quality of the output of the Prompt and a code Evaluator to check the output is below some latency and cost threshold.

For this example, we’re going to evaluate the performance of a Support Agent that responds to user queries about Humanloop’s product and documentation. Our goal is to understand which base model between gpt-4o, gpt-4o-mini and claude-3-5-sonnet-20240620 is most appropriate for this task.

Variations of the Support Agent Prompt, each using a different base model.

Run an Evaluation via UI

For Product and AI teams, the ability to trigger Evaluations against a Dataset within the Humanloop UI allows them to systematically compare the performance to make informed decisions on which to deploy.

2

Setup the Evaluation

  • Select a Dataset using +Dataset.
  • Add the Prompt versions you want to compare using +Version - note you can multi-select versions in the modal resulting in multiple columns.
  • Add the Evaluators you want to use to judge the performance of the Prompts using +Evaluator. By default, Cost, Tokens and Latency Evaluators are pre-selected.
Log Caching

By default the system will re-use Logs if they exist for the chosen Dataset, Prompts and Evaluators. This makes it easy to extend reports without paying the cost of re-running your Prompts and Evaluators.

If you want to force the system to re-run the Prompts against the Dataset producing a new batch of Logs, you can select the Manage button in the setup panel and choose +New Batch.

  • Select Save to trigger the Evaluation report. You will see the report below the setup panel populate with a progress bar and status pending as the Logs are generated on Humanloop.
In progress Evaluation report
Using your Runtime

This guide assumes both the Prompt and Evaluator Logs are generated using the Humanloop runtime. For certain use cases where more flexibility is required, the runtime for producing Logs instead lives in your code - see our guide on Logging, which also works with our Evaluations feature. We have a guide for how to run Evaluations with Logs generated in your code coming soon!

3

Review the results

It will generally take at least a couple of minutes before the Evaluation report is marked as completed as the system generates all the required Prompt and Evaluator Logs.

Once the report is completed, you can review the performance of the different Prompt versions using the Evaluators you selected.

  • The top spider plot provides you with a summary of the average Evaluator performance across all the Prompt versions. In our case, gpt-4o, although on average slightly slower and more expensive on average, is significantly better when it comes to User Satisfaction.
Evaluation Spider plot
  • Below the spider plot, you can see the breakdown of performance per Evaluator.
Evaluation Evaluator stats breakdown
  • To drill into and debug the Logs that were generated, select the Logs button top right of the Evaluation report. This brings you to the Evaluation Logs table and you can filter and review logs to understand the performance better and replay Logs in our Prompt Editor.
Drill down to Evaluatoin Logs.

Run an Evaluation via API

For Engineering teams, the ability to trigger Evaluations via the API allows them to integrate the Evaluation process into their existing pipelines.

Under Development

This content is currently under development. Please refer to our V4 documentation for the current docs.

Next Steps

  • Incorporate this Evaluation process into your Prompt engineering and deployment workflow.
  • Setup Evaluations where the runtime for producing Logs lives in your code - see our guide on Logging.
  • Utilise Evaluations as part of your CI/CD pipeline