For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
Sign inBook a demo
DocsReferenceChangelog
DocsReferenceChangelog
  • Getting Started
    • Overview
    • Quickstart
  • Explanation
    • Integrating Humanloop
  • Tutorials
    • Evaluate an Agent in the UI
    • Evaluate an Agent in code
    • Evaluate a RAG app
    • Capture user feedback
  • How-To Guides
    • Migrating from Humanloop
      • Run an Evaluation via the UI
      • Run an Evaluation via the API
      • Upload a Dataset from CSV
      • Create a Dataset via the API
      • Create a Dataset from existing Logs
      • Set up a code Evaluator
      • Set up LLM as a Judge
      • Set up a Human Evaluator
      • Run a Human Evaluation
      • Manage multiple reviewers
      • Compare and Debug Prompts
      • Set up CI/CD Evaluations
      • Spot-check your Logs
      • Use external Evaluators
      • Evaluate external logs
  • Reference
    • Deployment Options
    • Supported Models
    • Template Library
    • Vercel AI SDK
    • .prompt and .agent Files
    • Humanloop Runtime Environment
    • Security and Compliance
    • Data Management
    • Access roles (RBACs)
    • SSO and Authentication
    • LLMs.txt
LogoLogo
Sign inBook a demo
On this page
  • Prerequisites
  • Create a code Evaluator
  • Monitor a Prompt
  • Evaluating a Dataset
  • Next steps
How-To GuidesEvaluation

Set up a code Evaluator

In this guide we will show how to create and use a code Evaluator in Humanloop
Was this page helpful?
Previous

Set up LLM as a Judge

In this guide, we will set up an LLM evaluator to check for PII (Personally Identifiable Information) in Logs.

Next
Built with

A code Evaluator is a Python function that takes a generated Log (and optionally a testcase Datapoint if comparing to expected results) as input and returns a judgement. The judgement is in the form of a boolean or number that measures some criteria of the generated Log defined within the code.

Code Evaluators provide a flexible way to evaluate the performance of your AI applications, allowing you to re-use existing evaluation packages as well as define custom evaluation heuristics.

We support a fully featured Python environment; details on the supported packages can be found in the environment reference

Prerequisites

You should have an existing Prompt to evaluate and already generated some Logs. Follow our guide on creating a Prompt.

In this example, we’ll reference a Prompt that categorises a user query about Humanloop’s product and docs by which feature it relates to.

An example Prompt with a variable `{{query}}`.

Create a code Evaluator

1

Create a new Evaluator

  • Click the New button at the bottom of the left-hand sidebar, select Evaluator, then select Code.
Create code evaluator.
  • Give the Evaluator a name when prompted in the sidebar, for example Category Validator.
2

Define the Evaluator code

After creating the Evaluator, you will automatically be taken to the code editor. For this example, our Evaluator will check that the feature category returned by the Prompt is from the list of allowed feature categories. We want to ensure our categoriser isn’t hallucinating new features.

  • Make sure the Mode of the Evaluator is set to Online in the options on the left.
  • Copy and paste the following code into the code editor:
Python
1ALLOWED_FEATURES = [
2 "Prompt Editor",
3 "Model Integrations",
4 "Online Monitoring",
5 "Offline Evaluations",
6 "Dataset Management",
7 "User Management",
8 "Roles Based Access Control",
9 "Deployment Options",
10 "Collaboration",
11 "Agents and chaining"
12]
13
14def validate_feature(log):
15 print(f"Full log output: \n {log['output']}")
16 # Parse the final line of the log output to get the returned category
17 feature = log["output"].split("\n")[-1]
18 return feature in ALLOWED_FEATURES
Code Organisation

You can define multiple functions in the code Editor to organize your evaluation logic. The final function defined is used as the main Evaluator entry point that takes the Log argument and returns a valid judgement.

3

Debug the code with Prompt Logs

  • In the debug console beneath where you pasted the code, click Select Prompt or Dataset and find and select the Prompt you’re evaluating. The debug console will load a sample of Logs from that Prompt.
The debug console for testing the code.
  • Click the Run button at the far right of one of the loaded Logs to trigger a debug run. This causes the code to be executed with the selected Log as input and populates the Result column.
  • Inspect the output of the executed code by selecting the arrow to the right of Result.
Inspect evaluator log in debug console.
4

Save the code

Now that you’ve validated the behaviour, save the code by selecting the Save button at the top right of the Editor and optionally provide a suitable version name and description.

5

Inspect Evaluator logs

Navigate to the Logs tab of the Evaluator to see and debug all the historic usages of this Evaluator.

Evaluator logs table.

Monitor a Prompt

Now that you have an Evaluator, you can use it to monitor the performance of your Prompt by linking it so that it is automatically run on new Logs.

1

Link the Evaluator to the Prompt

  • Navigate to the Dashboard of your Prompt
  • Select the Monitoring button above the graph and select Connect Evaluators.
  • Find and select the Evaluator you just created and click Chose.
Select Evaluator for monitoring.
Linking Evaluators for Monitoring

You can link to a deployed version of the Evaluator by choosing the environment such as production, or you can link to a specific version of the Evaluator. If you want changes deployed to your Evaluator to be automatically reflected in Monitoring, link to the environment, otherwise link to a specific version.

This linking results in: - An additional graph on your Prompt dashboard showing the Evaluator results over time. - An additional column in your Prompt Versions table showing the aggregated Evaluator results for each version. - An additional column in your Logs table showing the Evaluator results for each Log.

2

Generate new Logs

Navigate to the Editor tab of your Prompt and generate a new Log by entering a query and clicking Run.

3

Inspect the Monitoring results

Navigate to the Logs tab of your Prompt and see the result of the linked Evaluator against the new Log. You can filter on this value in order to create a Dataset of interesting examples.

See the results of monitoring on your logs.

Evaluating a Dataset

When running a code Evaluator on a Dataset, you can compare a generated Log to each Datapoint’s target. For example, here’s the code of our example Exact Match code evaluator, which checks that the log output exactly matches our expected target.

Python
1def exact_match(log, testcase):
2 target = testcase["target"]["output"]
3 generation = log["output"]
4
5 return target == generation

Next steps

  • Explore AI Evaluators and Human Evaluators to complement your code-based judgements for more qualitative and subjective criteria.
  • Combine your Evaluator with a Dataset to run Evaluations to systematically compare the performance of different versions of your AI application.