Evaluating externally generated Logs

If running your infrastructure to generate logs, you can still leverage the Humanloop evaluations suite via our API. The workflow looks like this:

Trigger the creation of an evaluation run
Loop through the datapoints in your dataset and perform generations on your side
Post the generated logs to the evaluation run

This works with any evaluator - if you have configured a Humanloop-runtime evaluator, these will be automatically run on each log you post to the evaluation run; or, you can use self-hosted evaluators and post the results to the evaluation run yourself (see Self-hosted evaluations).

Prerequisites

You need to have access to evaluations
You also need to have a project created - if not, please first follow our project creation guides.
You need to have a dataset in your project. See our dataset creation guide if you don’t yet have one.
You need a model configuration to evaluate, so create one in the Editor.

Setting up the script

Install the latest version of the Humanloop Python SDK

$ pip install humanloop

In a new Python script, import the Humanloop SDK and create an instance of the client

1 humanloop = Humanloop(
2     api_key=YOUR_API_KEY, # Replace with your Humanloop API key
3 )

Retrieve the ID of the Humanloop project you are working in

You can find this in the Humanloop app.

1 PROJECT_ID = ... # Replace with the project ID

Retrieve the dataset you’re going to use for evaluation from the project

1 # Retrieve a dataset
2 DATASET_ID = ... # Replace with the dataset ID you use for evaluation.
3 								 # This must be a dataset in the project you are working on.
4 datapoints = humanloop.datasets.list_datapoints(DATASET_ID).records

Set up the model config you are evaluating

If you constructed this in Humanloop, retrieve it by calling:

1 config = humanloop.model_configs.get(id=CONFIG_ID)

Alternatively, if your model config lives outside the Humanloop system, post it to Humanloop with the register model config endpoint.

Either way, you need the ID of the config.

1 CONFIG_ID = <YOUR_CONFIG_ID>

In the Humanloop app, create an evaluator

We’ll create a Valid JSON checker for this guide.

Visit the Evaluations tab, and select Evaluators
Click + New Evaluator and choose Code from the options.
Select the Valid JSON preset on the left.
Choose the mode Offline in the settings panel on the left.
Click Create.
Copy your new evaluator’s ID from the address bar. It starts with evfn_.

1 EVALUATOR_ID = <YOUR_EVALUATOR_ID>

Create an evaluation run with `hl_generated` set to `False`

This tells the Humanloop runtime that it should not trigger evaluations but wait for them to be posted via the API.

1 evaluation_run = humanloop.evaluations.create(
2     project_id=PROJECT_ID,
3     config_id=CONFIG_ID,
4     dataset_id=DATASET_ID,
5     evaluator_ids=[EVALUATOR_ID],
6     hl_generated=False,
7 )

By default, the evaluation status after creation is pending. Before sending the generation logs, set the status to running.

1 humanloop.evaluations.update_status(id=evaluation_run.id, status="running")

Iterate through the datapoints in the dataset, produce a generation and post the evaluation

1 for datapoint in datapoints:
2 		# Use the datapoint to produce a log with the model config you are testing.
3     # This will depend on whatever model calling setup you are using on your side.
4     # For simplicity, we simply log a hardcoded
5     log = {
6         "project_id": PROJECT_ID,
7         "config_id": CONFIG_ID,
8         "messages":  [*config.chat_template, *datapoint.messages],
9         "output": "Hello World!",
10     }
11 
12     print(f"Logging generation for datapoint {datapoint.id}")
13     humanloop.evaluations.log(
14         evaluation_id=evaluation_run.id,
15         log=log,
16         datapoint_id=datapoint.id,
17     )

Run the full script above.

If everything goes well, you should now have posted a new evaluation run to Humanloop and logged all the generations derived from the underlying datapoints.

The Humanloop evaluation runtime will now iterate through those logs and run the Valid JSON evaluator on each. To check progress:

Visit your project in the Humanloop app and go to the Evaluations tab.

You should see the run you recently created; click through to it, and you’ll see rows in the table showing the generations.

In this case, all the evaluations returned False because the “Hello World!” string wasn’t valid JSON. Try logging something valid JSON to check that everything works as expected.

Full Script

For reference, here’s the full script to get started quickly.

1 from humanloop import Humanloop
2 
3 API_KEY = <YOUR_API_KEY>
4 
5 humanloop = Humanloop(
6     api_key=API_KEY,
7 )
8 
9 PROJECT_ID = <YOUR_PROJECT_ID>
10 DATASET_ID = <YOUR_DATASET_ID>
11 CONFIG_ID = <YOUR_CONFIG_ID>
12 EVALUATOR_ID = <YOUR_EVALUATOR_ID>
13 
14 # Retrieve the datapoints in the dataset.
15 datapoints = humanloop.datasets.list_datapoints(dataset_id=DATASET_ID).records
16 
17 # Retrieve the model config
18 config = humanloop.model_configs.get(id=CONFIG_ID)
19 
20 # Create the evaluation run
21 evaluation_run = humanloop.evaluations.create(
22     project_id=PROJECT_ID,
23     config_id=CONFIG_ID,
24     dataset_id=DATASET_ID,
25     evaluator_ids=[EVALUATOR_ID],
26     hl_generated=False,
27 )
28 print(f"Started evaluation run {evaluation_run.id}")
29 
30 # Set the status of the run to running.
31 humanloop.evaluations.update_status(id=evaluation_run.id, status="running")
32 
33 # Iterate the datapoints and log a generation for each one.
34 for i, datapoint in enumerate(datapoints):
35 		# Produce the log somehow. This is up to you and your external setup!
36   	log = {
37         "project_id": PROJECT_ID,
38         "config_id": CONFIG_ID,
39         "messages":  [*config.chat_template, *datapoint.messages],
40         "output": "Hello World!", # Hardcoded example for demonstration
41     }
42 
43     print(f"Logging generation for datapoint {datapoint.id}")
44     humanloop.evaluations.log(
45         evaluation_id=evaluation_run.id,
46         log=log,
47         datapoint_id=datapoint.id,
48     )
49 
50 print(f"Completed evaluation run {evaluation_run.id}")

It’s also a good practice to wrap the above code in a try-except block and to mark the evaluation run as failed (using update_status) if an exception causes something to fail.