Set up a code Evaluator

In this guide we will show how to create and use a code Evaluator in Humanloop

A code Evaluator is a Python function that takes a generated Log (and optionally a testcase Datapoint if comparing to expected results) as input and returns a judgement. The judgement is in the form of a boolean or number that measures some criteria of the generated Log defined within the code.

Code Evaluators provide a flexible way to evaluate the performance of your AI applications, allowing you to re-use existing evaluation packages as well as define custom evaluation heuristics.

We support a fully featured Python environment; details on the supported packages can be found in the environment reference

Prerequisites

You should have an existing Prompt to evaluate and already generated some Logs. Follow our guide on creating a Prompt.

In this example, we’ll reference a Prompt that categorises a user query about Humanloop’s product and docs by which feature it relates to.

An example Prompt with a variable `{{query}}`.

Create a code Evaluator

1

Create a new Evaluator

  • Click the New button at the bottom of the left-hand sidebar, select Evaluator, then select Code.
Create code evaluator.
  • Give the Evaluator a name when prompted in the sidebar, for example Category Validator.
2

Define the Evaluator code

After creating the Evaluator, you will automatically be taken to the code editor. For this example, our Evaluator will check that the feature category returned by the Prompt is from the list of allowed feature categories. We want to ensure our categoriser isn’t hallucinating new features.

  • Make sure the Mode of the Evaluator is set to Online in the options on the left.
  • Copy and paste the following code into the code editor:
Python
1ALLOWED_FEATURES = [
2 "Prompt Editor",
3 "Model Integrations",
4 "Online Monitoring",
5 "Offline Evaluations",
6 "Dataset Management",
7 "User Management",
8 "Roles Based Access Control",
9 "Deployment Options",
10 "Collaboration",
11 "Agents and chaining"
12]
13
14def validate_feature(log):
15 print(f"Full log output: \n {log['output']}")
16 # Parse the final line of the log output to get the returned category
17 feature = log["output"].split("\n")[-1]
18 return feature in ALLOWED_FEATURES
Code Organisation

You can define multiple functions in the code Editor to organize your evaluation logic. The final function defined is used as the main Evaluator entry point that takes the Log argument and returns a valid judgement.

3

Debug the code with Prompt Logs

  • In the debug console beneath where you pasted the code, click Select Prompt or Dataset and find and select the Prompt you’re evaluating. The debug console will load a sample of Logs from that Prompt.
The debug console for testing the code.
  • Click the Run button at the far right of one of the loaded Logs to trigger a debug run. This causes the code to be executed with the selected Log as input and populates the Result column.
  • Inspect the output of the executed code by selecting the arrow to the right of Result.
Inspect evaluator log in debug console.
4

Commit the code

Now that you’ve validated the behaviour, commit the code by selecting the Commit button at the top right of the Editor and provide a suitable commit message describing your changes.

5

Inspect Evaluator logs

Navigate to the Logs tab of the Evaluator to see and debug all the historic usages of this Evaluator.

Evaluator logs table.

Monitor a Prompt

Now that you have an Evaluator, you can use it to monitor the performance of your Prompt by linking it so that it is automatically run on new Logs.

2

Generate new Logs

Navigate to the Editor tab of your Prompt and generate a new Log by entering a query and clicking Run.

3

Inspect the Monitoring results

Navigate to the Logs tab of your Prompt and see the result of the linked Evaluator against the new Log. You can filter on this value in order to create a Dataset of interesting examples.

See the results of monitoring on your logs.

Next steps

  • Explore AI Evaluators and Human Evaluators to complement your code-based judgements for more qualitative and subjective criteria.
  • Combine your Evaluator with a Dataset to run Evaluations to systematically compare the performance of different versions of your AI application.
Built with