Set up evaluations using API

Paid Feature

This feature is not available for the Free tier. Please contact us if you wish to learn more about our Enterprise plan

API Options

This guide uses our Python SDK. All of the endpoints used are available in our TypeScript SDK and directly via the API.

Prerequisites

Install and initialize the SDK

First you need to install and initialize the SDK. If you have already done this, skip to the next section.

Open up your terminal and follow these steps:

Install the Humanloop SDK:

1 pip install humanloop

Initialize the SDK with your Humanloop API key (you can get it from the Organization Settings page).

1 from humanloop import Humanloop
2 humanloop = Humanloop(api_key="<YOUR HUMANLOOP KEY>")
3 
4 # Check that the authentication was successful
5 print(humanloop.prompts.list())

Create evaluation

We’ll go through how to use the SDK in a Python script to set up a project, create a dataset and then finally trigger an evaluation.

Set up a project

Import Humanloop and set your Humanloop and OpenAI API keys.

1 from humanloop import Humanloop
2 
3 HUMANLOOP_API_KEY = "<YOUR HUMANLOOP KEY>"
4 OPENAI_API_KEY = "<YOUR OPENAI KEY>"
5 
6 # Initialize the Humanloop client
7 humanloop = Humanloop(
8     api_key=HUMANLOOP_API_KEY,
9     openai_api_key=OPENAI_API_KEY,
10 )

Create a project and register your first model config

We’ll use OpenAI’s GPT-4 for extracting product feature names from customer queries in this example. The first model config created against the project is automatically deployed:

1 # Create a project
2 project = humanloop.projects.create(name="evals-guide")
3 project_id = project.id
4 
5 # Create the first model config for the project, which will automatically be deployed
6 model_config = humanloop.model_configs.register(
7     project_id=project_id,
8     model="gpt-4",
9     name="Entity extractor v0",
10     endpoint="chat",
11     chat_template=[
12         {
13             "role": "system",
14             "content": "Extract the name of the feature or issue the customer is describing. "
15             "Possible features are only: evaluations, experiments, fine-tuning \n"
16             "Write your response in json format as follows:"
17             ' \n {"feature": "feature requested", "issue": "description of issue"}',
18         }
19     ],
20 )
21 config_id = model_config.config.id

If you log onto your Humanloop account you will now see your project with a single model config defined:

Create a dataset

Follow the steps in our guide to Upload a Dataset via API.

Now test your model manually by generating a log for one of the datapoints’ messages:

1 # Generate a log
2 log = humanloop.chat_deployed(
3     project_id=project_id,
4     messages=data[0]["messages"],
5     inputs={"features": "evaluations, experiments, fine-tuning"},
6 ).data[0]
7 
8 import json
9 print(json.dumps(log))

You can see from the output field in the response that the model has done a good job at extracting the mentioned features in the desired json format:

1 {
2   "id": "data_aVUA2QZPHaQTnhoOCG7yS",
3   "model_config_id": "config_RbbfjXOkEnzYK6PS8cS96",
4   "messages": [
5     {
6       "role": "system",
7       "content": "Extract the name of the feature or issue the customer is describing. Possible features are only: evaluations, experiments, fine-tuning \nWrite your response in json format as follows: \n {\"feature\": \"feature requested\", \"issue\": \"description of issue\"}"
8     },
9     {
10       "role": "user",
11       "content": "Hi Humanloop support team, I'm having trouble understanding how to use the evaluations feature in your software. Can you provide a step-by-step guide or any resources to help me get started?"
12     }
13   ],
14   "output": "{\"feature\": \"evaluations\", \"issue\": \"trouble understanding how to use the evaluations feature\"}",
15   "finish_reason": "stop"
16 }

Create an evaluator

Now that you have a project with a model config and a dataset defined, you can create an evaluator that will determine the success criteria for a log generated from the model using the target defined in the test datapoint.

Create an evaluator to determine if the extracted JSON is correct and test it against the generated log and the corresponding test datapoint:

1 # Define an evaluator
2 import json
3 from json import JSONDecodeError
4 
5 
6 def check_feature_json(datapoint, testcase):
7     expected_feature = testcase["target"]["feature"]
8 
9     try:
10         # The model is expected to produce valid JSON output but it could fail to do so.
11         output = json.loads(datapoint["output"])
12         actual_feature = output.get("feature", None)
13         return expected_feature == actual_feature
14     except JSONDecodeError:
15         # If the model didn't even produce valid JSON, then it fails
16         return False
17 
18 # Try out the evalutor
19 print(f"Test case result: {check_feature_json(datapoint, data[0])}")

$ Test case result: True

Submit this evaluator to Humanloop

This means it can be used for future evaluations triggered via the UI or the API:

1 import inspect
2 
3 # The evaluator must be sent as a string, so we convert it first
4 json_imports = "import json\nfrom json import JSONDecodeError\n"
5 evaluator_code = json_imports + inspect.getsource(check_feature_json)
6 
7 # Send evaluator to Humanloop
8 evaluator = humanloop.evaluators.create(
9     name="Feature request json",
10     description="Validate that the json returned by the model matches the target json",
11     code=evaluator_code,
12     arguments_type="target_required",
13     return_type="boolean",
14 )
15 evaluator_id = evaluator.id

In your Humanloop project you will now see an evaluator defined:

Launch an evaluation

You can now low against the model config using the dataset and evaluator. In practice you can include more than one evaluator:

1 # Finally trigger an evaluation
2 evaluation = humanloop.evaluations.create(
3     project_id=project_id,
4     evaluator_ids=[evaluator_id],
5     config_id=config_id,
6     dataset_id=dataset_id,
7 )

Navigate to your Humanloop account to see the evaluation results. Initially it will be in a pending state, but will quickly move to completed given the small number of test cases. The datapoints generated by your model as part of the evaluation will also be recorded in your project’s logs table.

Create evaluation - full script

Here is the full script you can copy and paste and run in your Python environment:

1 from humanloop import Humanloop
2 import inspect
3 import json
4 from json import JSONDecodeError
5 
6 
7 HUMANLOOP_API_KEY = "<YOUR HUMANLOOP API KEY>"
8 OPENAI_API_KEY = "<YOUR OPENAI API KEY>"
9 
10 # Initialize the Humanloop client
11 humanloop = Humanloop(
12     api_key=HUMANLOOP_API_KEY,
13     openai_api_key=OPENAI_API_KEY,
14 )
15 
16 # Create a project
17 project = humanloop.projects.create(name="evals-guide")
18 project_id = project.id
19 
20 # Create the first model config for the project, which will automatically be deployed
21 model_config = humanloop.model_configs.register(
22     project_id=project_id,
23     model="gpt-4",
24     name="Entity extractor v0",
25     chat_template=[
26         {
27             "role": "system",
28             "content": "Extract the name of the feature or issue the customer is describing. "
29             "Possible features are only: evaluations, experiments, fine-tuning \n"
30             "Write your response in json format as follows:"
31             ' \n {"feature": "feature requested", "issue": "description of issue"}',
32         }
33     ],
34     endpoint="chat",
35     temperature=0.5,
36 )
37 config_id = model_config.config.id
38 
39 # Example test case data
40 data = [
41     {
42         "messages": [
43             {
44                 "role": "user",
45                 "content": "Hi Humanloop support team, I'm having trouble understanding how to use the evaluations feature in your software. Can you provide a step-by-step guide or any resources to help me get started?",
46             }
47         ],
48         "target": {"feature": "evaluations", "issue": "needs step-by-step guide"},
49         "inputs": {},
50     },
51     {
52         "messages": [
53             {
54                 "role": "user",
55                 "content": "Hi there, I'm interested in fine-tuning a language model using your software. Can you explain the process and provide any best practices or guidelines?",
56             }
57         ],
58         "target": {
59             "feature": "fine-tuning",
60             "issue": "process explanation and best practices",
61         },
62         "inputs": {},
63     },
64 ]
65 
66 # Create a dataset
67 dataset = humanloop.datasets.create(
68     project_id=project_id,
69     name="Target feature requests",
70     description="Target feature request json extractions",
71 )
72 
73 # Create test datapoints for the dataset
74 datapoints = humanloop.datasets.create_datapoint(
75     dataset_id=dataset.id,
76     body=data,
77 )
78 
79 # Generate a log
80 log = humanloop.chat_deployed(
81     project_id=project_id,
82     messages=data[0]["messages"],
83 ).data[0]
84 
85 
86 # Define an evaluator
87 
88 def check_feature_json(log, testcase):
89     expected_feature = testcase["target"]["feature"]
90 
91     try:
92         # The model is expected to produce valid JSON output but it could fail to do so.
93         output = json.loads(log["output"])
94         actual_feature = output.get("feature", None)
95         return expected_feature == actual_feature
96 
97     except JSONDecodeError:
98         # If the model didn't even produce valid JSON, then it fails
99         return False
100 
101 
102 # Try out the evalutor
103 print(f"Test case result: {check_feature_json(log, data[0])}")
104 
105 # The evaluator must be sent as a string, so we convert it first
106 json_imports = "import json\nfrom json import JSONDecodeError\n"
107 evaluator_code = json_imports + inspect.getsource(check_feature_json)
108 
109 # Send evaluator to Humanloop
110 evaluator = humanloop.evaluators.create(
111     name="Feature request json",
112     description="Validate that the json returned by the model matches the target json",
113     code=evaluator_code,
114     arguments_type="target_required",
115     return_type="boolean",
116 )
117 
118 # Finally trigger an evaluation
119 evaluation = humanloop.evaluations.create(
120     project_id=project_id,
121     evaluator_ids=[evaluator.id],
122     config_id=config_id,
123     dataset_id=dataset_id,
124 )
125 
126 # Now navigate to your project's evaluations tab on humanloop to inspect the results