A code Evaluator is a Python function that takes a generated Log (and optionally a testcase Datapoint if comparing to expected results) as input and returns a judgement. The judgement is in the form of a boolean or number that measures some criteria of the generated Log defined within the code.
Code Evaluators provide a flexible way to evaluate the performance of your AI applications, allowing you to re-use existing evaluation packages as well as define custom evaluation heuristics.
We support a fully featured Python environment; details on the supported packages can be found in the environment reference
You should have an existing Prompt to evaluate and already generated some Logs. Follow our guide on creating a Prompt.
In this example, we’ll reference a Prompt that categorises a user query about Humanloop’s product and docs by which feature it relates to.

Category Validator.After creating the Evaluator, you will automatically be taken to the code editor. For this example, our Evaluator will check that the feature category returned by the Prompt is from the list of allowed feature categories. We want to ensure our categoriser isn’t hallucinating new features.
You can define multiple functions in the code Editor to organize your evaluation logic. The final function defined is used as the main Evaluator entry point that takes the Log argument and returns a valid judgement.


Now that you have an Evaluator, you can use it to monitor the performance of your Prompt by linking it so that it is automatically run on new Logs.

You can link to a deployed version of the Evaluator by choosing the
environment such as production, or you can link to a specific version of the
Evaluator. If you want changes deployed to your Evaluator to be automatically
reflected in Monitoring, link to the environment, otherwise link to a specific
version.
This linking results in: - An additional graph on your Prompt dashboard showing the Evaluator results over time. - An additional column in your Prompt Versions table showing the aggregated Evaluator results for each version. - An additional column in your Logs table showing the Evaluator results for each Log.
Navigate to the Editor tab of your Prompt and generate a new Log by entering a query and clicking Run.
Navigate to the Logs tab of your Prompt and see the result of the linked Evaluator against the new Log. You can filter on this value in order to create a Dataset of interesting examples.

When running a code Evaluator on a Dataset, you can compare a generated Log to each Datapoint’s target. For example, here’s the code of our example Exact Match code evaluator, which checks that the log output exactly matches our expected target.