Evaluate an Agent in code | Humanloop Docs

Humanloop offers first-class support for agents on its runtime. This tutorial provides guidance on how to instrument existing code-based agentic systems with Humanloop.

Check out our Evaluating Agents in UI tutorial for a more streamlined experience, including tool calling on the Humanloop runtime, and autonomous agents.

Working with LLMs is daunting: you are dealing with a black box that outputs unpredictable results.

Humanloop provides tools to make your development process systematic, bringing it closer to traditional software testing and quality assurance.

In this tutorial, we’ll build an agentic question-and-answer system and use Humanloop to iterate on its performance. The agent will provide simple, child-friendly answers using Wikipedia as its source of factual information.

Prerequisites

Account setup

Create a Humanloop Account

If you haven’t already, create an account or log in to Humanloop

Add an OpenAI API Key

If you’re the first person in your organization, you’ll need to add an API key to a model provider.

Go to OpenAI and grab an API key.
In Humanloop Organization Settings set up OpenAI as a model provider.

Using the Prompt Editor will use your OpenAI credits in the same way that the OpenAI playground does. Keep your API keys for Humanloop and the model providers private.

Install dependencies

Python

TypeScript

Install the project’s dependencies:

1 pip install humanloop openai wikipedia

Humanloop SDK requires Python 3.9 or higher. Optionally, create a virtual environment to keep dependencies tidy.

Initial agent code

Let’s build the first iteration of the agent. We’ll use OpenAI’s function calling to connect the agent to the Wikipedia API. The agent will also be allowed to refine its answer through multiple iterations if the initial tool call doesn’t yield relevant information.

1 from humanloop import Humanloop
2 from openai import OpenAI
3 from openai.types.chat.chat_completion_message import ChatCompletionMessage as Message
4 import wikipedia
5 import json
6 
7 
8 openai = OpenAI(api_key="ADD YOUR KEY HERE")
9 humanloop = Humanloop(api_key="ADD YOUR KEY HERE")
10 
11 
12 def search_wikipedia(query: str) -> dict:
13     """Search Wikipedia to get up-to-date information for a query."""
14     try:
15         page = wikipedia.page(query)
16         return {
17             "title": page.title,
18             "content": page.content,
19             "url": page.url,
20         }
21     except Exception as _:
22         return {
23             "title": "",
24             "content": "No results found",
25             "url": "",
26         }
27 
28 
29 def call_model(messages: list[Message]) -> Message:
30     """Calls the model with the given messages"""
31     system_message = {
32         "role": "system",
33         "content": (
34             "You are an assistant that helps to answer user questions. "
35             "You should leverage wikipedia to answer questions so that "
36             "the information is up to date. If the response from "
37             "Wikipedia does not seem relevant, rephrase the question "
38             "and call the tool again. Then finally respond to the user."
39         ),
40     }
41     response = openai.chat.completions.create(
42         model="gpt-4o",
43         messages=[system_message] + messages,
44         tools=[
45             {
46                 "type": "function",
47                 "function": {
48                     "name": "search_wikipedia",
49                     "description": "Search the internet to get up to date answers for a query.",
50                     "parameters": {
51                         "type": "object",
52                         "required": ["query"],
53                         "properties": {
54                             "query": {"type": "string"},
55                         },
56                         "additionalProperties": False,
57                     },
58                 },
59             }
60         ],
61     )
62     return response.choices[0].message.to_dict(exclude_unset=False)
63 
64 
65 def call_agent(question: str) -> str:
66     """Calls the main agent loop and returns the final result"""
67     messages = [{"role": "user", "content": question}]
68     # Retry for a relevant response 3 times at most
69     for _ in range(3):
70         response = call_model(messages)
71         messages.append(response)
72         if response["tool_calls"]:
73             # Call wikipedia to get up-to-date information
74             for tool_call in response["tool_calls"]:
75                 source = search_wikipedia(
76                     **json.loads(tool_call["function"]["arguments"])
77                 )
78                 messages.append(
79                     {
80                         "role": "tool",
81                         "content": json.dumps(source),
82                         "tool_call_id": tool_call["id"],
83                     }
84                 )
85         else:
86             # Respond to the user
87             return response["content"]
88 
89 
90 if __name__ == "__main__":
91     result = call_agent("Where does the sun go at night?")
92     print(result)

Run the agent and check that it works as expected:

Python

TypeScript

$ python main.py

Okay! Imagine the Earth is like a big ball, and we live on it.
The sun doesn't really “go” anywhere—it stays in the same spot,
shining all the time. But our Earth is spinning like a top!

Add Humanloop to the agent

Humanloop offers first-class support for agentic systems, plus the ability to effortlessly switch between providers.

Evaluate the agent

How Evaluators work

Evaluators are callables that take the Log’s dictionary representation as input and return a judgment. The Evaluator’s judgment should respect the return_type present in the Evaluator’s specification.

The Evaluator can take an additional target argument to compare the Log against. The target is provided in an Evaluation context by the validation Dataset.

For more details, check out our Evaluator explanation.

Let’s check if the agent respects the requirement of providing easy-to-understand answers.

We will create an Evaluation to benchmark the performance of the agent. An Evaluation requires a Dataset and at least one Evaluator.

Create LLM judge

We will use an LLM judge to automatically evaluate the agent’s responses.

We will define the Evaluator in code, but you can also manage Evaluators in the UI.

Replace your main function with the following:

1 if __name__ == "__main__":
2     humanloop.evaluators.upsert(
3         path="QA Agent/Comprehension",
4         spec={
5             "arguments_type": "target_free",
6             "return_type": "number",
7             "evaluator_type": "llm",
8             "prompt": {
9                 "model": "gpt-4o",
10                 "endpoint": "complete",
11                 "template": (
12                     "You must decide if an explanation is simple "
13                     "enough to be understood by a 5-year old. "
14                     "A better explanation is shorter and uses less jargon. "
15                     "Rate the answer from 1 to 10, where 10 is the best.\n"
16                     "\n<Question>\n{{log.inputs.question}}\n</Question>\n\n"
17                     "\n<Answer>\n{{log.output}}</Answer>\n\n"
18                     "First provide your rationale, then on a newline, "
19                     "output your judgment."
20                 ),
21                 "provider": "openai",
22                 "temperature": 0,
23             },
24         },
25     )

Add Dataset

Create a file called dataset.jsonl and add the following:

dataset.jsonl

1 {"inputs": {"question": "Why is the sky blue?"}}
2 {"inputs": {"question": "Where does the sun go at night?"}}
3 {"inputs": {"question": "Why do birds fly?"}}
4 {"inputs": {"question": "What makes rainbows?"}}
5 {"inputs": {"question": "Why do we have to sleep?"}}
6 {"inputs": {"question": "How do fish breathe underwater?"}}
7 {"inputs": {"question": "Why do plants need water?"}}
8 {"inputs": {"question": "How does the moon stay in the sky?"}}
9 {"inputs": {"question": "What are stars made of?"}}
10 {"inputs": {"question": "Why do we have seasons?"}}
11 {"inputs": {"question": "How does the TV work?"}}
12 {"inputs": {"question": "Why do dogs wag their tails?"}}
13 {"inputs": {"question": "What makes cars go?"}}
14 {"inputs": {"question": "Why do we need to brush our teeth?"}}
15 {"inputs": {"question": "What do ants eat?"}}
16 {"inputs": {"question": "Why does the wind blow?"}}
17 {"inputs": {"question": "How do airplanes stay in the air?"}}
18 {"inputs": {"question": "Why does the ocean look so big?"}}
19 {"inputs": {"question": "What makes the grass green?"}}
20 {"inputs": {"question": "Why do we have to eat vegetables?"}}
21 {"inputs": {"question": "How do butterflies fly?"}}
22 {"inputs": {"question": "Why do some animals live in the zoo?"}}
23 {"inputs": {"question": "How do magnets stick to the fridge?"}}
24 {"inputs": {"question": "What makes fire hot?"}}
25 {"inputs": {"question": "Why do leaves change color?"}}
26 {"inputs": {"question": "What happens when we flush the toilet?"}}
27 {"inputs": {"question": "Why do we have belly buttons?"}}
28 {"inputs": {"question": "What makes the clouds move?"}}
29 {"inputs": {"question": "Why do we have eyebrows?"}}
30 {"inputs": {"question": "How do seeds turn into plants?"}}
31 {"inputs": {"question": "Why does the moon change shape?"}}
32 {"inputs": {"question": "Why do bees make honey?"}}
33 {"inputs": {"question": "What makes ice melt?"}}
34 {"inputs": {"question": "Why do we sneeze?"}}
35 {"inputs": {"question": "How do trains stay on the tracks?"}}
36 {"inputs": {"question": "Why do stars twinkle?"}}
37 {"inputs": {"question": "Why can't we see air?"}}
38 {"inputs": {"question": "What makes the Earth spin?"}}
39 {"inputs": {"question": "Why do frogs jump?"}}
40 {"inputs": {"question": "Why do cats purr?"}}
41 {"inputs": {"question": "How do phones let us talk to people far away?"}}
42 {"inputs": {"question": "Why does the moon follow us?"}}
43 {"inputs": {"question": "What makes lightning?"}}
44 {"inputs": {"question": "Why does it snow?"}}
45 {"inputs": {"question": "Why do we have shadows?"}}
46 {"inputs": {"question": "Why do boats float?"}}
47 {"inputs": {"question": "What makes our heart beat?"}}
48 {"inputs": {"question": "Why do some animals sleep all winter?"}}
49 {"inputs": {"question": "Why do we have to wear shoes?"}}
50 {"inputs": {"question": "What makes music?"}}

Run an Evaluation

Add this to your main function:

1 if __name__ == "__main__":
2     # ...
3 
4     # Read the evaluation dataset
5     with open("dataset.jsonl", "r") as fp:
6         dataset = [json.loads(line) for line in fp]
7 
8     humanloop.evaluations.run(
9         name="QA Agent Answer Check",
10         file={
11             "path": "QA Agent/Agent",
12             "callable": call_agent,
13         },
14         evaluators=[{"path": "QA Agent/Comprehension"}],
15         dataset={
16             "path": "QA Agent/Dataset",
17             "datapoints": dataset,
18         },
19         workers=8,
20     )

Run your file and let the Evaluation finish:

Python

TypeScript

Terminal

$ python main.py

Terminal

$ Navigate to your Evaluation:
> https://app.humanloop.com/project/fl_9CCIoTzySPfUFeIxfYE6g/evaluations/evr_67tEc2DiR83fy9iTaqyPA/stats
> 
> Flow Version ID: flv_9ECTrfeZYno2OIj9KAqlz
> Run ID: rn_67tEcDYV6mqUS86hD8vrP
> 
> Running 'Agent' over the Dataset 'Children Questions' using 8 workers 
> [##############--------------------------] 15/50 (30.00%) | ETA: 14
> 
> ...
> 
> 📊 Evaluation Results for QA Agent/Agent
> +------------------------+---------------------+
> |                        |        Latest       |
> +------------------------+---------------------+
> |                 Run ID |        67tEc        |
> +------------------------+---------------------+
> |             Version ID |        9ECTr        |
> +------------------------+---------------------+
> |                  Added | 2024-11-19 21:49:02 |
> +------------------------+---------------------+
> |             Evaluators |                     |
> +------------------------+---------------------+
> | QA Agent/Comprehension |         3.24        |
> +------------------------+---------------------+

Iterate and evaluate again

The score of the initial setup is quite low. Click the Evaluation link from the terminal and switch to the Logs view. You will see that the model tends to provide elaborate answers.

Python

TypeScript

Let’s modify the LLM prompt inside call_model:

main.py

1 def call_model(messages: list[Message]) -> Message:
2     """Calls the model with the given messages"""
3     system_message = {
4         "role": "system",
5         "content": (
6           "You are an assistant that help to answer user questions. "
7           "You should leverage wikipedia to answer questions so that "
8           "the information is up to date. If the response from Wikipedia "
9           "does not seem relevant, rephrase the question and call the "
10           "tool again. Then finally respond to the user. "
11           "Formulate the response so that it is easy to understand "
12           "for a 5 year old."
13         )
14     }
15     response = openai.chat.completions.create(
16         model="gpt-4o",
17         messages=[system_message] + messages,
18         tools=[
19             {
20                 "type": "function",
21                 "function": {
22                     "name": "search_wikipedia",
23                     "description": "Search the internet to get up to date answers for a query.",
24                     "parameters": {
25                         "type": "object",
26                         "required": ["query"],
27                         "properties": {
28                             "query": {"type": "string"},
29                         },
30                         "additionalProperties": False,
31                     },
32                 }
33             }
34         ],
35     )
36     return response.choices[0].message.to_dict(exclude_unset=False)

Run the agent again and let the Evaluation finish:

$ python main.py

Terminal

$ Flow Version ID: flv_9ECTrfeZYno2OIj9KAqlz
> Run ID: rn_WnIwPSI7JFKEtwTS0l3mj
> 
> Navigate to your Evaluation:
> https://app.humanloop.com/project/fl_9CCIoTzySPfUFeIxfYE6g/evaluations/rn_WnIwPSI7JFKEtwTS0l3mj/stats
> 
> Running 'Agent' over the Dataset 'Children Questions' using 8 workers 
> [######################------------------] 34/50 (68.00%) | ETA: 14
> 
> ...
> 
> +------------------------+---------------------+---------------------+
> |                        |       Control       |        Latest       |
> +------------------------+---------------------+---------------------+
> |                 Run ID |        67tEc        |        WnIwP        |
> +------------------------+---------------------+---------------------+
> |             Version ID |        9ECTr        |        9ECTr        |
> +------------------------+---------------------+---------------------+
> |                  Added | 2024-11-19 22:05:17 | 2024-11-19 22:24:13 |
> +------------------------+---------------------+---------------------+
> |             Evaluators |                     |                     |
> +------------------------+---------------------+---------------------+
> | QA Agent/Comprehension |         3.24        |         8.04        |
> +------------------------+---------------------+---------------------+
> 
> Change of [4.80] for Evaluator QA Agent/Comprehension

Click the Evaluation link again. The agent’s performance has improved significantly.

Add detailed logging

If you use a programming language not supported by the SDK, or want more control, see our guide on logging through the API for an alternative to decorators.

Up to this point, we have treated the agent as a black box, reasoning about its behavior by looking at the inputs and outputs.

Let’s use Humanloop logging to observe the step-by-step actions taken by the agent.

Python

TypeScript

Modify main.py:

main.py

1 @humanloop.tool(path="QA Agent/Search Wikipedia")
2 def search_wikipedia(query: str) -> dict:
3     ...
4 
5 @humanloop.prompt(path="QA Agent/Prompt")
6 def call_model(messages: list[Message]) -> Message:
7     ...
8 
9 @humanloop.flow(path="QA Agent/Agent")
10 def call_agent(question: str) -> str:
11     ...

Evaluate the agent again. When it’s done, head to your workspace and click the Agent Flow on the left. Select the Logs tab from the top of the page.

The trace captures the agent searching for the best article to answer a question.

The decorators divide the code in logical components, allowing you to observe the steps taken to answer a question. Every step taken by the agent creates a Log.

Next steps

We’ve built a complex agentic workflow and learned how to use Humanloop to add logging to it and evaluate its performance.

Take a look at these resources to learn more about evals on Humanloop:

Learn how to create a custom dataset for your project.
Learn more about using LLM Evaluators on Humanloop.