For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
Sign inBook a demo
DocsReferenceChangelog
DocsReferenceChangelog
  • Getting Started
    • Overview
    • Quickstart
  • Explanation
    • Integrating Humanloop
  • Tutorials
    • Evaluate an Agent in the UI
    • Evaluate an Agent in code
    • Evaluate a RAG app
    • Capture user feedback
  • How-To Guides
    • Migrating from Humanloop
      • Run an Evaluation via the UI
      • Run an Evaluation via the API
      • Upload a Dataset from CSV
      • Create a Dataset via the API
      • Create a Dataset from existing Logs
      • Set up a code Evaluator
      • Set up LLM as a Judge
      • Set up a Human Evaluator
      • Run a Human Evaluation
      • Manage multiple reviewers
      • Compare and Debug Prompts
      • Set up CI/CD Evaluations
      • Spot-check your Logs
      • Use external Evaluators
      • Evaluate external logs
  • Reference
    • Deployment Options
    • Supported Models
    • Template Library
    • Vercel AI SDK
    • .prompt and .agent Files
    • Humanloop Runtime Environment
    • Security and Compliance
    • Data Management
    • Access roles (RBACs)
    • SSO and Authentication
    • LLMs.txt
LogoLogo
Sign inBook a demo
On this page
  • Prerequisites
  • Run an Evaluation via UI
  • Next Steps
How-To GuidesEvaluation

Run an Evaluation via the UI

In this guide, we will walk through how to evaluate multiple different Prompts to compare quality and performance of each version.
Was this page helpful?
Previous

Run an Evaluation via the API

In this guide, we will walk through how to programmatically evaluate multiple different Prompts to compare the quality and performance of each version.
Next
Built with

An Evaluation on Humanloop leverages a Dataset, a set of Evaluators and different versions of a Prompt to compare.

The Dataset contains datapoints describing the inputs (and optionally the expected results) for a given task. The Evaluators define the criteria for judging the performance of the Prompts when executed using these inputs.

Prompts, when evaluated, produce Logs. These Logs are then judged by the Evaluators. You can see the summary of Evaluators judgments to systematically compare the performance of the different Prompt versions.

Prerequisites

  • A set of Prompt versions you want to compare - see the guide on creating Prompts.
  • A Dataset containing datapoints for the task - see the guide on creating a Dataset.
  • At least one Evaluator to judge the performance of the Prompts - see the guides on creating Code, AI and Human Evaluators.

Run an Evaluation via UI

For this example, we’re going to evaluate the performance of a Support Agent that responds to user queries about Humanloop’s product and documentation. Our goal is to understand which base model between gpt-4o, gpt-4o-mini and claude-3-5-sonnet-20241022 is most appropriate for this task.

1

Navigate to the Evaluations tab of your Prompt

  • Go to the Prompt you want to evaluate and then click on Evaluations tab at the top of the page.
  • Click the Evaluate button top right to create a new Evaluation.
  • Click the +Run button top right to create a new Evaluation Run.
Prompt Evaluations Run tab.
2

Set up an Evaluation Run

  • Select a Dataset using +Dataset.
  • Add the Prompt versions you want to compare using +Prompt.
  • Add the Evaluators you want to use to judge the performance of the Prompts using +Evaluator.
Log Caching

By default the system will re-use Logs if they exist for the chosen Dataset, Prompts and Evaluators. This makes it easy to extend Evaluation Run without paying the cost of re-running your Prompts and Evaluators.

If you want to force the system to re-run the Prompts against the Dataset producing a new batch of Logs, you can click on regenerate button next to the Logs count

  • Click Save. Humanloop will start generating Logs for the Evaluation.
In progress Evaluation run
Using your Runtime

This guide assumes both the Prompt and Evaluator Logs are generated using the Humanloop runtime. For certain use cases where more flexibility is required, the runtime for producing Logs instead lives in your code - see our guide on Logging, which also works with our Evaluations feature. We have a guide for how to run Evaluations with Logs generated in your code coming soon!

3

Review the results

Once the Logs are produced, you can review the performance of the different Prompt versions by navigating to the Stats tab.

  • The top spider plot provides you with a summary of the average Evaluator performance across all the Prompt versions. In our case, gpt-4o, although on average slightly slower and more expensive on average, is significantly better when it comes to User Satisfaction.
Evaluation Spider plot
  • Below the spider plot, you can see the breakdown of performance per Evaluator.
Evaluation Evaluator stats breakdown
  • To drill into and debug the Logs that were generated, navigate to the Review tab at top left of the Run page. The Review view allows you to better understand performance and replay logs in our Prompt Editor.
Drill down to Evaluatoin Logs.

Next Steps

  • Incorporate this Evaluation process into your Prompt engineering and deployment workflow.
  • Setup Evaluations where the runtime for producing Logs lives in your code - see our guide on Logging.
  • Utilise Evaluations as part of your CI/CD pipeline