Run an Evaluation via the API

In this guide, we will walk through how to programmatically evaluate multiple different Prompts to compare the quality and performance of each version.

An Evaluation on Humanloop leverages a Dataset, a set of Evaluators and different versions of a Prompt to compare.

In this guide, we use a Dataset to evaluate the performance of different Prompt versions. To learn how to evaluate Prompts without a Dataset, see the guide on Spot-check your Logs.

Prerequisites

Run an Evaluation

For this guide, we’re going to evaluate the performance of a Support Agent that responds to user queries about Humanloop’s product and documentation. Our goal is to understand which base model between gpt-4o, gpt-4o-mini and claude-3-5-sonnet-20241022 is most appropriate for this task.

1

Create a Prompt

Create a Support Agent Prompt with three versions each using a different base model.

Python
1from humanloop import Humanloop
2
3humanloop = Humanloop(api_key="YOUR_API_KEY")
4
5system_message = "You are a helpful assistant. Your job is to respond to FAQ style queries about the Humanloop documentation and platform. Be polite and succinct."
6gpt_4o = humanloop.prompts.upsert(
7 path="Run Evaluation via API/Support Agent",
8 model="gpt-4o",
9 endpoint="chat",
10 template=[
11 {
12 "content": system_message,
13 "role": "system",
14 }
15 ],
16 provider="openai",
17 commit_message="gpt-4o",
18)
19gpt_4o_mini = humanloop.prompts.upsert(
20 path="Run Evaluation via API/Support Agent",
21 model="gpt-4o-mini",
22 endpoint="chat",
23 template=[
24 {
25 "content": system_message,
26 "role": "system",
27 }
28 ],
29 provider="openai",
30 commit_message="gpt-4o-mini",
31)
32
33sonnet = humanloop.prompts.upsert(
34 path="Run Evaluation via API/Support Agent",
35 model="claude-3-5-sonnet-20241022",
36 endpoint="chat",
37 template=[
38 {
39 "content": system_message,
40 "role": "system",
41 }
42 ],
43 provider="anthropic",
44 commit_message="claude-3-5-sonnet-20241022",
45)
46
47# store prompt versions for later use
48prompt_versions = [gpt_4o.version_id, gpt_4o_mini.version_id, sonnet.version_id]
2

Create a Dataset

We defined sample data that contains user messages and desired responses for the Support Agent Prompt. We will now create a Dataset with these datapoints.

Python
1humanloop.datasets.upsert(
2 path="Run Evaluation via API/Dataset with user questions",
3 datapoints=[
4 {
5 "messages": [{
6 "role": "user",
7 "content": "How do i manage my organizations API keys?",
8 }],
9 "target": {"answer": "Hey, thanks for your questions. Here are steps for how to achieve: 1. Log in to the Humanloop Dashboard \n\n2. Click on \"Organization Settings.\"\n If you do not see this option, you might need to contact your organization admin to gain the necessary permissions.\n\n3. Within the settings or organization settings, select the option labeled \"API Keys\" on the left. Here you will be able to view and manage your API keys.\n\n4. You will see a list of existing API keys. You can perform various actions, such as:\n - **Generate New API Key:** Click on the \"Generate New Key\" button if you need a new API key.\n - **Revoke an API Key:** If you need to disable an existing key, find the key in the list and click the \"Revoke\" or \"Delete\" button.\n - **Copy an API Key:** If you need to use an existing key, you can copy it to your clipboard by clicking the \"Copy\" button next to the key.\n\n5. **Save and Secure API Keys:** Make sure to securely store any new or existing API keys you are using. Treat them like passwords and do not share them publicly.\n\nIf you encounter any issues or need further assistance, it might be helpful to engage with an engineer or your IT department to ensure you have the necessary permissions and support.\n\nWould you need help with anything else?"},
10 },
11 {
12 "messages":[{
13 "role": "user",
14 "content": "Hey, can do I use my code evaluator for monitoring my legal-copilot prompt?",
15 }],
16 "target": {"answer": "Hey, thanks for your questions. Here are steps for how to achieve: 1. Navigate to your Prompt dashboard. \n 2. Select the `Monitoring` button on the top right of the Prompt dashboard \n 3. Within the model select the Version of the Evaluator you want to turn on for monitoring. \n\nWould you need help with anything else?"},
17 },
18 ],
19 action="set",
20 commit_message="Add two new questions and answers",
21)
3

Create an Evaluation

We create an Evaluation Run to compare the performance of the different Prompts using the Dataset we just created.

For this guide, we selected Semantic similarity, Cost and Latency Evaluators. You can find these Evaluators in the Example Evaluators folder in your workspace.

”Semantic similarity” Evaluator measures the degree of similarity between the model’s response and the expected output provided on a scale from 1 to 5 (1 is very dissimilar and 5 is very similar).

Python
1evaluation = humanloop.evaluations.create(
2 name="Evaluation via API",
3 file={
4 "path": "Run Evaluation via API/Support Agent",
5 },
6 evaluators=[{"path": "Example Evaluators/AI/Semantic similarity"}, {"path": "Example Evaluators/Code/Cost"}, {"path": "Example Evaluators/Code/Latency"}],
7)
8
9# Create a Run for each prompt version
10for prompt_version in prompt_versions:
11 humanloop.evaluations.create_run(
12 id=evaluation.id,
13 dataset={"path": "Run Evaluation via API/Dataset with user questions"},
14 version={"version_id": prompt_version},
15 )
4

Inspect the Evaluation stats

When Runs are completed, you can inspect the Evaluation Stats to see the summary of the Evaluators judgments.

Python
1evaluation_stats = humanloop.evaluations.get_stats(
2 id=evaluation.id,
3)
4
5print(evaluation_stats.report)
Drill down to Evaluatoin Logs.

Alternatively you can see detailed stats in the Humanloop UI. Navigate to the Prompt, click on the Evaluations tab at the top of the page and select the Evaluation you just created. The stats are displayed in the Stats tab.

Drill down to Evaluatoin Logs.

Run an Evaluation using your runtime

If you choose to execute Prompts using your own Runtime, you still can benefit from Humanloop Evaluations. In code snippet below, we run Evaluators hosted on Humanloop using logs produced by the OpenAI client.

Python
1# create new Humanloop prompt
2prompt = humanloop.prompts.upsert(
3 path="Run Evaluation via API/Support Agent my own runtime",
4 model="gpt-4o",
5 endpoint="chat",
6 template=[
7 {
8 "content": "You are a helpful assistant. Your job is to respond to FAQ style queries about the Humanloop documentation and platform. Be polite and succinct.",
9 "role": "system",
10 }
11 ],
12 provider="openai",
13)
14
15# create the evaluation
16evaluation = humanloop.evaluations.create(
17 name="Evaluation via API using my own runtime",
18 file={
19 "path": "Run Evaluation via API/Support Agent my own runtime",
20 },
21 evaluators=[{"path": "Example Evaluators/AI/Semantic similarity"}, {"path": "Example Evaluators/Code/Cost"}, {"path": "Example Evaluators/Code/Latency"}],
22)
23
24# use dataset created in previous steps
25datapoints = humanloop.datasets.list_datapoints(dataset.id)
26
27import openai
28openai_client = openai.OpenAI(api_key="USE_YOUR_OPENAI_API_KEY")
29
30# create a run
31run = humanloop.evaluations.create_run(
32 id=evaluation.id,
33 dataset={"version_id": dataset.version_id},
34 version={"version_id": prompt.version_id},
35)
36
37
38# for each datapoint in the dataset, create a chat completion
39for datapoint in datapoints:
40 # create a run
41 chat_completion = openai_client.chat.completions.create(
42 messages=datapoint.messages, model=prompt.model
43 )
44 # log the prompt
45 humanloop.prompts.log(
46 id=prompt.id,
47 run_id=run.id,
48 version_id=prompt.version_id,
49 source_datapoint_id=datapoint.id,
50 output_message=chat_completion.choices[0].message,
51 messages=datapoint.messages,
52 )

Next steps