An Evaluation on Humanloop leverages a Dataset, a set of Evaluators and different versions of a Prompt to compare.

In this guide, we use a Dataset to evaluate the performance of different Prompt versions. To learn how to evaluate Prompts without a Dataset, see the guide on Spot-check your Logs.

Prerequisites

A set of Prompt versions you want to compare - see the guide on creating Prompts.
A Dataset containing datapoints for the task - see the guide on creating a Dataset via API.
At least one Evaluator to judge the performance of the Prompts - see the guides on creating Code, AI and Human Evaluators.

Run an Evaluation

For this guide, we’re going to evaluate the performance of a Support Agent that responds to user queries about Humanloop’s product and documentation. Our goal is to understand which base model between gpt-4o, gpt-4o-mini and claude-3-5-sonnet-20241022 is most appropriate for this task.

Create a Prompt

Create a Support Agent Prompt with three versions each using a different base model.

1 from humanloop import Humanloop
2 
3 humanloop = Humanloop(api_key="YOUR_API_KEY")
4 
5 system_message = "You are a helpful assistant. Your job is to respond to FAQ style queries about the Humanloop documentation and platform. Be polite and succinct."
6 gpt_4o = humanloop.prompts.upsert(
7     path="Run Evaluation via API/Support Agent",
8     model="gpt-4o",
9     endpoint="chat",
10     template=[
11         {
12             "content": system_message,
13             "role": "system",
14         }
15     ],
16     provider="openai",
17     version_name="gpt-4o",
18     version_description="FAQ style support agent using gpt-4o",
19 )
20 gpt_4o_mini = humanloop.prompts.upsert(
21     path="Run Evaluation via API/Support Agent",
22     model="gpt-4o-mini",
23     endpoint="chat",
24     template=[
25         {
26             "content": system_message,
27             "role": "system",
28         }
29     ],
30     provider="openai",
31     version_name="gpt-4o-mini",
32     version_description="FAQ style support agent using gpt-4o-mini",
33 )
34 
35 sonnet = humanloop.prompts.upsert(
36     path="Run Evaluation via API/Support Agent",
37     model="claude-3-5-sonnet-20241022",
38     endpoint="chat",
39     template=[
40         {
41             "content": system_message,
42             "role": "system",
43         }
44     ],
45     provider="anthropic",
46     version_name="claude-3-5-sonnet",
47     version_description="FAQ style support agent using Claude 3.5 Sonnet",
48 )
49 
50 # store prompt versions for later use
51 prompt_versions = [gpt_4o.version_id, gpt_4o_mini.version_id, sonnet.version_id]

Create a Dataset

We defined sample data that contains user messages and desired responses for the Support Agent Prompt. We will now create a Dataset with these datapoints.

1 humanloop.datasets.upsert(
2     path="Run Evaluation via API/Dataset with user questions",
3     datapoints=[
4         {
5             "messages": [{
6                 "role": "user",
7                 "content": "How do I manage my organization's API keys?",
8             }],
9             "target": {"answer": "Hey, thanks for your questions. Here are steps for how to achieve: 1. Log in to the Humanloop Dashboard \n\n2. Click on \"Organization Settings.\"\n If you do not see this option, you might need to contact your organization admin to gain the necessary permissions.\n\n3. Within the settings or organization settings, select the option labeled \"API Keys\" on the left. Here you will be able to view and manage your API keys.\n\n4. You will see a list of existing API keys. You can perform various actions, such as:\n     - **Generate New API Key:** Click on the \"Generate New Key\" button if you need a new API key.\n     - **Revoke an API Key:** If you need to disable an existing key, find the key in the list and click the \"Revoke\" or \"Delete\" button.\n     - **Copy an API Key:** If you need to use an existing key, you can copy it to your clipboard by clicking the \"Copy\" button next to the key.\n\n5. **Save and Secure API Keys:** Make sure to securely store any new or existing API keys you are using. Treat them like passwords and do not share them publicly.\n\nIf you encounter any issues or need further assistance, it might be helpful to engage with an engineer or your IT department to ensure you have the necessary permissions and support.\n\nWould you need help with anything else?"},
10         },
11         {
12              "messages":[{
13                 "role": "user",
14                 "content": "Can I use my code evaluator for monitoring my legal-copilot prompt?",
15             }],
16             "target": {"answer": "Hey, thanks for your questions. Here are steps for how to achieve: 1. Navigate to your Prompt dashboard. \n 2. Select the `Monitoring` button on the top right of the Prompt dashboard \n 3. Within the model select the Version of the Evaluator you want to turn on for monitoring. \n\nWould you need help with anything else?"},
17         },
18     ],
19     action="set",
20     version_name="User questions",
21     version_description="Add two new questions and answers",
22 )

Create an Evaluation

We create an Evaluation Run to compare the performance of the different Prompts using the Dataset we just created.

For this guide, we selected Semantic Similarity, Cost and Latency Evaluators. You can find these Evaluators in the Example Evaluators folder in your workspace.

“Semantic Similarity” Evaluator measures the degree of similarity between the model’s response and the expected output. The similarity is rated on a scale from 1 to 5, where 5 means very similar.

1 evaluation = humanloop.evaluations.create(
2     name="Evaluation via API",
3     file={
4         "path": "Run Evaluation via API/Support Agent",
5     },
6     evaluators=[{"path": "Example Evaluators/AI/Semantic Similarity"}, {"path": "Example Evaluators/Code/Cost"}, {"path": "Example Evaluators/Code/Latency"}],
7 )
8 
9 # Create a Run for each prompt version
10 for prompt_version in prompt_versions:
11     humanloop.evaluations.create_run(
12         id=evaluation.id,
13         dataset={"path": "Run Evaluation via API/Dataset with user questions"},
14         version={"version_id": prompt_version},
15     )

Inspect the Evaluation stats

When Runs are completed, you can inspect the Evaluation Stats to see the summary of the Evaluators judgments.

1 evaluation_stats = humanloop.evaluations.get_stats(
2     id=evaluation.id,
3 )
4 
5 print(evaluation_stats.report)

Alternatively you can see detailed stats in the Humanloop UI. Navigate to the Prompt, click on the Evaluations tab at the top of the page and select the Evaluation you just created. The stats are displayed in the Stats tab.

Run an Evaluation using your runtime

If you choose to execute Prompts using your own Runtime, you still can benefit from Humanloop Evaluations. In code snippet below, we run Evaluators hosted on Humanloop using logs produced by the OpenAI client.

1 # create new Humanloop prompt
2 prompt = humanloop.prompts.upsert(
3     path="Run Evaluation via API/Support Agent my own runtime",
4     model="gpt-4o",
5     endpoint="chat",
6     template=[
7         {
8             "content": "You are a helpful assistant. Your job is to respond to FAQ style queries about the Humanloop documentation and platform. Be polite and succinct.",
9             "role": "system",
10         }
11     ],
12     provider="openai",
13 )
14 
15 # create the evaluation
16 evaluation = humanloop.evaluations.create(
17     name="Evaluation via API using my own runtime",
18     file={
19         "path": "Run Evaluation via API/Support Agent my own runtime",
20     },
21     evaluators=[{"path": "Example Evaluators/AI/Semantic Similarity"}, {"path": "Example Evaluators/Code/Cost"}, {"path": "Example Evaluators/Code/Latency"}],
22 )
23 
24 # use dataset created in previous steps
25 datapoints = humanloop.datasets.list_datapoints(dataset.id)
26 
27 import openai
28 openai_client = openai.OpenAI(api_key="OPENAI_API_KEY")
29 
30 # create a run
31 run = humanloop.evaluations.create_run(
32     id=evaluation.id,
33     dataset={"version_id": dataset.version_id},
34     version={"version_id": prompt.version_id},
35 )
36 
37 
38 # for each datapoint in the dataset, create a chat completion
39 for datapoint in datapoints:
40     # create a run
41     chat_completion = openai_client.chat.completions.create(
42         messages=datapoint.messages, model=prompt.model
43     )
44     # log the prompt
45     humanloop.prompts.log(
46         id=prompt.id,
47         run_id=run.id,
48         version_id=prompt.version_id,
49         source_datapoint_id=datapoint.id,
50         output_message=chat_completion.choices[0].message,
51         messages=datapoint.messages,
52     )

Next steps

Learn how to set up LLM as a Judge to evaluate your AI applications.