Self-hosted evaluations | Humanloop Docs

For some use cases, you may wish to run your evaluation process outside of Humanloop, as opposed to running the evaluators we offer in our Humanloop runtime.

For example, you may have implemented an evaluator that uses your own custom model or which has to interact with multiple systems. In these cases, you can continue to leverage the datasets you have curated on Humanloop, as well as consolidate all of the results alongside the prompts you maintain in Humanloop.

In this guide, we’ll show an example of setting up a simple script to run such a self-hosted evaluation using our Python SDK.

Prerequisites

You need to have access to evaluations
You also need to have a Prompt – if not, please follow our Prompt creation guide.
You need to have a dataset in your project. See our dataset creation guide if you don’t yet have one.
You need to have a model config that you’re trying to evaluate - create one in the Editor.

Setting up the script

Install the latest version of the Humanloop Python SDK:

$ pip install humanloop

In a new Python script, import the Humanloop SDK and create an instance of the client:

1 from humanloop import Humanloop
2 
3 humanloop = Humanloop(
4     api_key=YOUR_API_KEY, # Replace with your API key
5 )

Retrieve the ID of the Humanloop project you are working in - you can find this in the Humanloop app

1 PROJECT_ID = ... # Replace with the project ID

Retrieve the dataset you’re going to use for evaluation from the project

1 # Retrieve a dataset
2 DATASET_ID = ... # Replace with the dataset ID you are using for evaluation (this should be inside the project)
3 datapoints = humanloop.datasets.list_datapoints(DATASET_ID).records

Create an external evaluator

1 # Create an external evaluator
2 evaluator = humanloop.evaluators.create(
3     name="My External Evaluator",
4     description="An evaluator that runs outside of Humanloop runtime.",
5     type="external",
6     arguments_type="target_required",
7     return_type="boolean",
8 )

Retrieve the model config you’re evaluating

1 CONFIG_ID = ... # Replace with the model config ID you are evaluating (should be inside the project)
2 model_config = humanloop.model_configs.get(CONFIG_ID)

Initiate an evaluation run in Humanloop

1 evaluation_run = humanloop.evaluations.create(
2     project_id=PROJECT_ID,
3     config_id=CONFIG_ID,
4     evaluator_ids=[EVALUATOR_ID],
5     dataset_id=DATASET_ID,
6 )

After this step, you’ll see a new run in the Humanloop app, under the Evaluations tab of your project. It should have status running.

Iterate through the datapoints in your dataset and use the model config to generate logs from them

1 logs = []
2 for datapoint in datapoints:
3     log = humanloop.chat_model_config(
4         project_id=PROJECT_ID,
5         model_config_id=model_config.id,
6         inputs=datapoint.inputs,
7         messages=[
8             {key: value for key, value in dict(message).items() if value is not None}
9             for message in datapoint.messages
10         ],
11         source_datapoint_id=datapoint.id,
12     ).data[0]
13     logs.append((log, datapoint))

Evaluate the logs using your own evaluation logic and post the results back to Humanloop

In this example, we use an extremely simple evaluation function for clarity.

1 for log, datapoint in logs:
2     # The datapoint's 'target' field tells us the correct answer for this datapoint
3     expected_answer = str(datapoint.target["answer"])
4 
5     # The log output is what the model produced
6     model_output = log.output
7 
8     # The evaluation is a boolean, indicating whether the model was correct.
9     result = expected_answer == model_output
10 
11     # Post the result back to Humanloop.
12     evaluation_result_log = humanloop.evaluations.log_result(
13         log_id=log.id,
14         evaluator_id=evaluator.id,
15         evaluation_run_external_id=evaluation_run.id,
16         result=result,
17     )

Mark the evaluation run as completed

1 humanloop.evaluations.update_status(id=evaluation_run.id, status="completed")

Review the results

After running this script with the appropriate resource IDs (project, dataset, model config), you should see the results in the Humanloop app, right alongside any other evaluations you have performed using the Humanloop runtime.