In this guide, we will demonstrate how to create and use online evaluators to observe the performance of your models.

Paid Feature

This feature is not available for the Free tier. Please contact us if you wish to learn more about our Enterprise plan

Create an online evaluator

Prerequisites

  • You need to have access to evaluations.
  • You also need to have a Prompt – if not, please follow our Prompt creation guide.
  • Finally, you need at least a few logs in your project. Use the Editor to generate some logs if you don’t have any yet.

To set up an online Python evaluator:

1

Go to the Evaluations page in one of your projects and select the Evaluators tab

2

Select + New Evaluator and choose Code Evaluator in the dialog

Selecting the type of a new evaluator
3

From the library of presets on the left-hand side, we’ll choose Valid JSON for this guide. You’ll see a pre-populated evaluator with Python code that checks the output of our model is valid JSON grammar.

The evaluator editor after selecting **Valid JSON** preset
4

In the debug console at the bottom of the dialog, click Random logs from project. The console will be populated with five datapoints from your project.

The debug console (you can resize this area to make it easier to view the logs)
5

Click the Run button at the far right of one of the log rows. After a moment, you’ll see the Result column populated with a True or False.

The **Valid JSON** evaluator returned `True` for this particular log, indicating the text output by the model was grammatically correct JSON.
6

Explore the log dictionary in the table to help understand what is available on the Python object passed into the evaluator.

7

Click Create on the left side of the page.

Activate an evaluator for a project

1

On the new **Valid JSON ** evaluator in the Evaluations tab, toggle the switch to on - the evaluator is now activated for the current project.

Activating the new evaluator to run automatically on your project.
2

Go to the Editor, and generate some fresh logs with your model.

3

Over in the Logs tab you’ll see the new logs. The Valid JSON evaluator runs automatically on these new logs, and the results are displayed in the table.

The **Logs** table includes a column for each activated evaluator in your project. Each activated evaluator runs on any new logs in the project.

Track the performance of models

Prerequisites

  • A Humanloop Prompt with a reasonable amount of data.
  • An Evaluator activated in that project.

To track the performance of different versions of your Prompts:

1

Go to the Dashboard tab.

In the table of model configs at the bottom, choose a subset of the project’s model configs.

2

Use the graph controls

At the top of the page to select the date range and time granularity of interest.

3

Review the relative performance

For each activated Evaluator shown in the graphs, you can see the relative performance of the model configs you selected.

Available Modules

The following Python modules are available to be imported in your code evaluators:

  • re
  • math
  • random
  • datetime
  • json (useful for validating JSON grammar as per the example above)
  • jsonschema (useful for more fine-grained validation of JSON output - see the in-app example)
  • sqlglot (useful for validating SQL query grammar)
  • requests (useful to make further LLM calls as part of your evaluation - see the in-app example for a suggestion of how to get started).