For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
Sign inBook a demo
DocsReferenceChangelog
DocsReferenceChangelog
  • Getting Started
    • Overview
    • Quickstart
  • Explanation
    • Integrating Humanloop
  • Tutorials
    • Evaluate an Agent in the UI
    • Evaluate an Agent in code
    • Evaluate a RAG app
    • Capture user feedback
  • How-To Guides
    • Migrating from Humanloop
      • Run an Evaluation via the UI
      • Run an Evaluation via the API
      • Upload a Dataset from CSV
      • Create a Dataset via the API
      • Create a Dataset from existing Logs
      • Set up a code Evaluator
      • Set up LLM as a Judge
      • Set up a Human Evaluator
      • Run a Human Evaluation
      • Manage multiple reviewers
      • Compare and Debug Prompts
      • Set up CI/CD Evaluations
      • Spot-check your Logs
      • Use external Evaluators
      • Evaluate external logs
  • Reference
    • Deployment Options
    • Supported Models
    • Template Library
    • Vercel AI SDK
    • .prompt and .agent Files
    • Humanloop Runtime Environment
    • Security and Compliance
    • Data Management
    • Access roles (RBACs)
    • SSO and Authentication
    • LLMs.txt
LogoLogo
Sign inBook a demo
On this page
  • Prerequisites
  • Divide work between SMEs
  • Improve the Prompt
  • Next steps
How-To GuidesEvaluation

Manage multiple reviewers

How to split a review between multiple subject-matter experts to effectively and quickly evaluate your AI product.

Was this page helpful?
Previous

Compare and Debug Prompts

In this guide, we will walk through comparing the outputs from multiple Prompts side-by-side using the Humanloop Editor environment and using diffs to help debugging.

Next
Built with

Who is this for: This guide is for large teams that want to leverage their internal subject matter experts (SMEs) to evaluate the performance of their AI features.

Prerequisites

  • You have set up Evaluators. If not, follow our guide to create a Human Evaluator.
  • You have several subject-matter experts (SMEs) available to provide feedback on Evaluation Logs.

Divide work between SMEs

When you have a large Dataset to evaluate, it’s helpful to split the work between your SMEs to ensure that the evaluation is completed quickly and effectively.

1

Split the Dataset into chunks

Each Dataset consists of datapoints. Add an identifier to each datapoint to group them into chunks.

For example, we created a Dataset with 100 common customer support questions. In the csv file, we added an identifier called “chunk” to each datapoint, splitting the whole Dataset into 10 equal parts.

To upload this CSV on Humanloop, create a new Dataset file, then click on the Upload CSV button.

Upload CSV as dataset to Humanloop.

Alternatively, you upload Dataset via our SDK

2

Run an Evaluation

Navigate to a Prompt you want to evaluate and create a new Evaluation Run.

Run Evals with Dataset on Humanloop.
Example of running an Evaluation on a Prompt
3

Split the workload between SMEs

To split the workload between your SMEs, navigate to the Review tab, turn on Focus mode, and click on the Filters button. Filter the Dataset by identifiers, such as “chunk”, to split the review work into smaller pieces.

4

Send the URL to your SMEs

After you have filtered the Dataset, copy the URL and send it to your SME. When they open the link, they will only see the relevant chunk of the Dataset.

Focus mode on.
The view that SME will see when they open the link. This only shows chunk of the Dataset that is relevant to them.
5

Monitor progress

As the SMEs provide judgments on the outputs, we display the overall progress and the number of outstanding judgments. When the final judgment is given, the Evaluation is marked as complete.

Improve the Prompt

With judgments from your SMEs, you can now better understand the model’s performance and iterate on your Prompt to improve the model outputs.

Completed evaluations.
Completed Evaluation with judgments from SMEs.

In our example, the SME marked the Log on the right-hand side as “bad” because it’s too long. To take action, click on the Log ID above the Log output to open the Log drawer. In the drawer, click on the Editor -> button to load this Log in the Prompt Editor. Now, modify the instructions to explicitly state that the model should provide a concise answer.

Next steps

We’ve successfully split the work among multiple SMEs to effectively evaluate the performance of our AI product. Explore next:

  • If your SMEs gave negative judgments on the Logs, see our guide on Comparing and Debugging Prompts.
  • Find out more about Human Evaluators to capture feedback that is most relevant to your use case.