For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
Sign inBook a demo
DocsReferenceChangelog
DocsReferenceChangelog
  • Getting Started
    • Overview
    • Quickstart
  • Explanation
    • Integrating Humanloop
  • Tutorials
    • Evaluate an Agent in the UI
    • Evaluate an Agent in code
    • Evaluate a RAG app
    • Capture user feedback
  • How-To Guides
    • Migrating from Humanloop
      • Run an Evaluation via the UI
      • Run an Evaluation via the API
      • Upload a Dataset from CSV
      • Create a Dataset via the API
      • Create a Dataset from existing Logs
      • Set up a code Evaluator
      • Set up LLM as a Judge
      • Set up a Human Evaluator
      • Run a Human Evaluation
      • Manage multiple reviewers
      • Compare and Debug Prompts
      • Set up CI/CD Evaluations
      • Spot-check your Logs
      • Use external Evaluators
      • Evaluate external logs
  • Reference
    • Deployment Options
    • Supported Models
    • Template Library
    • Vercel AI SDK
    • .prompt and .agent Files
    • Humanloop Runtime Environment
    • Security and Compliance
    • Data Management
    • Access roles (RBACs)
    • SSO and Authentication
    • LLMs.txt
LogoLogo
Sign inBook a demo
On this page
  • Prerequisites
  • Create an LLM Evaluator
  • Next steps
How-To GuidesEvaluation

Set up LLM as a Judge

In this guide, we will set up an LLM evaluator to check for PII (Personally Identifiable Information) in Logs.

Was this page helpful?
Previous

Set up a Human Evaluator

In this guide we will show how to create and use a Human Evaluator in Humanloop
Next
Built with

LLMs can be used for evaluating the quality and characteristics of other AI-generated outputs. When correctly prompted, LLMs can act as impartial judges, providing insights and assessments that might be challenging or time-consuming for humans to perform at scale.

In this guide, we’ll explore how to setup an LLM as an AI Evaluator in Humanloop, demonstrating their effectiveness in assessing various aspects of AI-generated content, such as checking for the presence of Personally Identifiable Information (PII).

An AI Evaluator is a Prompt that takes attributes from a generated Log (and optionally from a testcase Datapoint if comparing to expected results) as context and returns a judgement. The judgement is in the form of a boolean or number that measures some criteria of the generated Log defined within the Prompt instructions.

Prerequisites

You should have an existing Prompt to evaluate and already generated some Logs. Follow our guide on creating a Prompt.

In this example we will use a simple Support Agent Prompt that answers user queries about Humanloop’s product and docs.

Support agent base prompt.

Create an LLM Evaluator

1

Create a new Evaluator

  • Click the New button at the bottom of the left-hand sidebar, select Evaluator, then select AI.

  • Give the Evaluator a name when prompted in the sidebar, for example PII Identifier.

2

Define the Evaluator Prompt

After creating the Evaluator, you will automatically be taken to the Evaluator editor. For this example, our Evaluator will check whether the request to, or response from, our support agent contains PII. The Evaluator acts as Guardrail, helping us spot issues in our agent workflow.

  • Make sure the Mode of the Evaluator is set to Online in the options on the left.
  • Copy and paste the following Prompt into the Editor:
You are a helpful assistant. Your job is to observe the requests and outputs to a support agent and identify whether or not they contain any PII.
Examples of PII information are:
- Names
- Addresses
- Bank account information
- Job information
Here is the request and response information:
###
Request:
{{log.messages}}
###
Response:
{{log.output_message}}
###
Your response should contain the rationale and the final binary true/false verdict as to whether PII exists in the request resposne. The final true/false verdit should be on a new line at the end.
Available Prompt Variables

In the Prompt Editor for an LLM evaluator, you have access to the underlying log you are evaluating as well as the testcase Datapoint that gave rise to it if you are using a Dataset for offline Evaluations. These are accessed with the standard {{ variable }} syntax, enhanced with a familiar dot notation to pick out specific values from inside the log and testcase objects.

For example, suppose you are evaluating a Log object like this.

1{
2 "id": "data_B3RmIu9aA5FibdtXP7CkO",
3 "prompt": {...},
4 "inputs": {
5 "query": "What is the meaning of life?",
6 },
7 "messages": []
8 "output": "I'm sorry, as an AI I don't have the capacity to understand the meaning of life.",
9 "metadata": {...},
10 ...etc
11}

In the LLM Evaluator Prompt, {{ log.inputs.query }} will be replaced with the actual query in the final prompt sent to the LLM Evaluator.

In order to get access to the fully populated Prompt that was sent in the underlying Log, you can use the special variable {{ log_prompt }}.

3

Debug the code with Prompt Logs

  • In the debug console beneath where you pasted the code, click Select Prompt or Dataset and find and select the Prompt you’re evaluating. The debug console will load a sample of Logs from that Prompt.
The debug console for testing the Evaluator Prompt.
  • Click the Run button at the far right of one of the loaded Logs to trigger a debug run. This causes the Evaluator Prompt to be called with the selected Log attributes as input and populates the Result column.
  • Inspect the output of the executed code by selecting the arrow to the right of Result.
Inspect evaluator log in debug console.
4

Save the code

Now that you’ve validated the behaviour, save the code by selecting the Save button at the top right of the Editor and optionally provide a suitable version name and description.

5

Inspect Evaluator logs

Navigate to the Logs tab of the Evaluator to see and debug all the historic usages of this Evaluator.

Evaluator logs table.

Next steps

  • Explore Code Evaluators and Human Evaluators to complement your AI judgements.
  • Combine your Evaluator with a Dataset to run Evaluations to systematically compare the performance of different versions of your AI application.