Set up evaluations using API

In this guide, we’ll walk through an example of using our API to create dataset and trigger an evaluation.

Paid Feature

This feature is not available for the Free tier. Please contact us if you wish to learn more about our Enterprise plan

API Options

This guide uses our Python SDK. All of the endpoints used are available in our TypeScript SDK and directly via the API.

Prerequisites

Create evaluation

We’ll go through how to use the SDK in a Python script to set up a project, create a dataset and then finally trigger an evaluation.

Set up a project

1

Import Humanloop and set your Humanloop and OpenAI API keys.

1from humanloop import Humanloop
2
3HUMANLOOP_API_KEY = "<YOUR HUMANLOOP KEY>"
4OPENAI_API_KEY = "<YOUR OPENAI KEY>"
5
6# Initialize the Humanloop client
7humanloop = Humanloop(
8 api_key=HUMANLOOP_API_KEY,
9 openai_api_key=OPENAI_API_KEY,
10)
2

Create a project and register your first model config

We’ll use OpenAI’s GPT-4 for extracting product feature names from customer queries in this example. The first model config created against the project is automatically deployed:

1# Create a project
2project = humanloop.projects.create(name="evals-guide")
3project_id = project.id
4
5# Create the first model config for the project, which will automatically be deployed
6model_config = humanloop.model_configs.register(
7 project_id=project_id,
8 model="gpt-4",
9 name="Entity extractor v0",
10 endpoint="chat",
11 chat_template=[
12 {
13 "role": "system",
14 "content": "Extract the name of the feature or issue the customer is describing. "
15 "Possible features are only: evaluations, experiments, fine-tuning \n"
16 "Write your response in json format as follows:"
17 ' \n {"feature": "feature requested", "issue": "description of issue"}',
18 }
19 ],
20)
21config_id = model_config.config.id

If you log onto your Humanloop account you will now see your project with a single model config defined:

Create a dataset

Follow the steps in our guide to Upload a Dataset via API.

1

Now test your model manually by generating a log for one of the datapoints’ messages:

1# Generate a log
2log = humanloop.chat_deployed(
3 project_id=project_id,
4 messages=data[0]["messages"],
5 inputs={"features": "evaluations, experiments, fine-tuning"},
6).data[0]
7
8import json
9print(json.dumps(log))

You can see from the output field in the response that the model has done a good job at extracting the mentioned features in the desired json format:

1{
2 "id": "data_aVUA2QZPHaQTnhoOCG7yS",
3 "model_config_id": "config_RbbfjXOkEnzYK6PS8cS96",
4 "messages": [
5 {
6 "role": "system",
7 "content": "Extract the name of the feature or issue the customer is describing. Possible features are only: evaluations, experiments, fine-tuning \nWrite your response in json format as follows: \n {\"feature\": \"feature requested\", \"issue\": \"description of issue\"}"
8 },
9 {
10 "role": "user",
11 "content": "Hi Humanloop support team, I'm having trouble understanding how to use the evaluations feature in your software. Can you provide a step-by-step guide or any resources to help me get started?"
12 }
13 ],
14 "output": "{\"feature\": \"evaluations\", \"issue\": \"trouble understanding how to use the evaluations feature\"}",
15 "finish_reason": "stop"
16}

Create an evaluator

Now that you have a project with a model config and a dataset defined, you can create an evaluator that will determine the success criteria for a log generated from the model using the target defined in the test datapoint.

1

Create an evaluator to determine if the extracted JSON is correct and test it against the generated log and the corresponding test datapoint:

1# Define an evaluator
2import json
3from json import JSONDecodeError
4
5
6def check_feature_json(datapoint, testcase):
7 expected_feature = testcase["target"]["feature"]
8
9 try:
10 # The model is expected to produce valid JSON output but it could fail to do so.
11 output = json.loads(datapoint["output"])
12 actual_feature = output.get("feature", None)
13 return expected_feature == actual_feature
14 except JSONDecodeError:
15 # If the model didn't even produce valid JSON, then it fails
16 return False
17
18# Try out the evalutor
19print(f"Test case result: {check_feature_json(datapoint, data[0])}")
$Test case result: True
2

Submit this evaluator to Humanloop

This means it can be used for future evaluations triggered via the UI or the API:

1import inspect
2
3# The evaluator must be sent as a string, so we convert it first
4json_imports = "import json\nfrom json import JSONDecodeError\n"
5evaluator_code = json_imports + inspect.getsource(check_feature_json)
6
7# Send evaluator to Humanloop
8evaluator = humanloop.evaluators.create(
9 name="Feature request json",
10 description="Validate that the json returned by the model matches the target json",
11 code=evaluator_code,
12 arguments_type="target_required",
13 return_type="boolean",
14)
15evaluator_id = evaluator.id

In your Humanloop project you will now see an evaluator defined:

Launch an evaluation

1

Launch an evaluation

You can now low against the model config using the dataset and evaluator. In practice you can include more than one evaluator:

1# Finally trigger an evaluation
2evaluation = humanloop.evaluations.create(
3 project_id=project_id,
4 evaluator_ids=[evaluator_id],
5 config_id=config_id,
6 dataset_id=dataset_id,
7)

Navigate to your Humanloop account to see the evaluation results. Initially it will be in a pending state, but will quickly move to completed given the small number of test cases. The datapoints generated by your model as part of the evaluation will also be recorded in your project’s logs table.

Create evaluation - full script

Here is the full script you can copy and paste and run in your Python environment:

1from humanloop import Humanloop
2import inspect
3import json
4from json import JSONDecodeError
5
6
7HUMANLOOP_API_KEY = "<YOUR HUMANLOOP API KEY>"
8OPENAI_API_KEY = "<YOUR OPENAI API KEY>"
9
10# Initialize the Humanloop client
11humanloop = Humanloop(
12 api_key=HUMANLOOP_API_KEY,
13 openai_api_key=OPENAI_API_KEY,
14)
15
16# Create a project
17project = humanloop.projects.create(name="evals-guide")
18project_id = project.id
19
20# Create the first model config for the project, which will automatically be deployed
21model_config = humanloop.model_configs.register(
22 project_id=project_id,
23 model="gpt-4",
24 name="Entity extractor v0",
25 chat_template=[
26 {
27 "role": "system",
28 "content": "Extract the name of the feature or issue the customer is describing. "
29 "Possible features are only: evaluations, experiments, fine-tuning \n"
30 "Write your response in json format as follows:"
31 ' \n {"feature": "feature requested", "issue": "description of issue"}',
32 }
33 ],
34 endpoint="chat",
35 temperature=0.5,
36)
37config_id = model_config.config.id
38
39# Example test case data
40data = [
41 {
42 "messages": [
43 {
44 "role": "user",
45 "content": "Hi Humanloop support team, I'm having trouble understanding how to use the evaluations feature in your software. Can you provide a step-by-step guide or any resources to help me get started?",
46 }
47 ],
48 "target": {"feature": "evaluations", "issue": "needs step-by-step guide"},
49 "inputs": {},
50 },
51 {
52 "messages": [
53 {
54 "role": "user",
55 "content": "Hi there, I'm interested in fine-tuning a language model using your software. Can you explain the process and provide any best practices or guidelines?",
56 }
57 ],
58 "target": {
59 "feature": "fine-tuning",
60 "issue": "process explanation and best practices",
61 },
62 "inputs": {},
63 },
64]
65
66# Create a dataset
67dataset = humanloop.datasets.create(
68 project_id=project_id,
69 name="Target feature requests",
70 description="Target feature request json extractions",
71)
72
73# Create test datapoints for the dataset
74datapoints = humanloop.datasets.create_datapoint(
75 dataset_id=dataset.id,
76 body=data,
77)
78
79# Generate a log
80log = humanloop.chat_deployed(
81 project_id=project_id,
82 messages=data[0]["messages"],
83).data[0]
84
85
86# Define an evaluator
87
88def check_feature_json(log, testcase):
89 expected_feature = testcase["target"]["feature"]
90
91 try:
92 # The model is expected to produce valid JSON output but it could fail to do so.
93 output = json.loads(log["output"])
94 actual_feature = output.get("feature", None)
95 return expected_feature == actual_feature
96
97 except JSONDecodeError:
98 # If the model didn't even produce valid JSON, then it fails
99 return False
100
101
102# Try out the evalutor
103print(f"Test case result: {check_feature_json(log, data[0])}")
104
105# The evaluator must be sent as a string, so we convert it first
106json_imports = "import json\nfrom json import JSONDecodeError\n"
107evaluator_code = json_imports + inspect.getsource(check_feature_json)
108
109# Send evaluator to Humanloop
110evaluator = humanloop.evaluators.create(
111 name="Feature request json",
112 description="Validate that the json returned by the model matches the target json",
113 code=evaluator_code,
114 arguments_type="target_required",
115 return_type="boolean",
116)
117
118# Finally trigger an evaluation
119evaluation = humanloop.evaluations.create(
120 project_id=project_id,
121 evaluator_ids=[evaluator.id],
122 config_id=config_id,
123 dataset_id=dataset_id,
124)
125
126# Now navigate to your project's evaluations tab on humanloop to inspect the results