Evaluate an agent

Evaluate and improve your agent with an LLM judge.

This tutorial demonstrates how to evaluate and iterate on a more complex AI feature.

Prerequisites

Create the agent

We will build an agent that answers questions asked by children. The agent queries Wikipedia and replies with an easy to understand explanation.

Let’s create the initial version of our agent. Add the following in an agent.py file:

agent.py
1import json
2import os
3
4from humanloop import Humanloop
5from openai import OpenAI
6
7import wikipedia
8
9
10openai = OpenAI(api_key="YOUR_OPENAI_KEY")
11humanloop = Humanloop(api_key="YOUR_HUMANLOOP_KEY")
12
13
14def search_wikipedia(query: str) -> dict:
15 """Search the internet to get data sources for a query."""
16
17 try:
18 # Let Wikipedia suggest a relevant page
19 page = wikipedia.page(query)
20 return {
21 "title": page.title,
22 "content": page.content,
23 "url": page.url,
24 }
25 except Exception:
26 return {
27 "title": "",
28 "content": "No results found",
29 "url": "",
30 }
31
32
33TOOL_SCHEMA = {
34 "type": "function",
35 "function": {
36 "name": "search_wikipedia",
37 "description": "Search the internet to get data sources for a query.",
38 "parameters": {
39 "type": "object",
40 "required": ["query"],
41 "properties": {
42 "query": {"type": "string"},
43 },
44 "additionalProperties": False,
45 },
46 },
47}
48
49
50def agent(messages: list[dict], tool_call: bool) -> str:
51 args = {
52 "messages": messages,
53 "model": "gpt-4o",
54 "temperature": 0,
55 }
56 if tool_call:
57 args["tools"] = [TOOL_SCHEMA]
58 response = openai.chat.completions.create(**args)
59 return {
60 "content": response.choices[0].message.content,
61 "tool_calls": [
62 tool_call.to_dict()
63 for tool_call in response.choices[0].message.tool_calls or []
64 ],
65 }
66
67
68def workflow(question: str) -> str:
69 source = None
70 messages = [
71 {
72 "role": "system",
73 "content": (
74 "You must find a good source to answer a "
75 "question using the provided Wikipedia tool."
76 ),
77 },
78 {
79 "role": "user",
80 "content": question,
81 },
82 ]
83 # Maximum 3 attempts to find a relevant source
84 for _ in range(3):
85 response = agent(messages, tool_call=True)
86
87 if "FINISHED" in (response["content"] or "") or not response["tool_calls"]:
88 # Model chose a source to answer the question
89 break
90
91 tool_call = response["tool_calls"][0]
92 arguments = json.loads(tool_call["function"]["arguments"])
93 source = search_wikipedia(**arguments)
94 messages.extend(
95 [
96 {
97 "role": "assistant",
98 "content": f"Found a source called {source['title']}",
99 },
100 {
101 "role": "user",
102 "content": (
103 "Is this a relevant source? Output 'FINISHED' "
104 "or rephrase the question to essential "
105 "terms and call the tool again to get a "
106 "new information source."
107 ),
108 },
109 ]
110 )
111
112 source = source or {
113 "title": "",
114 "content": "No relevant source found",
115 "url": "",
116 }
117 answer = agent(
118 messages=[
119 {
120 "role": "system",
121 "content": (
122 "You are a summarizer. Answer "
123 "the question based on this source: "
124 f"{source['content']}"
125 ),
126 },
127 {
128 "role": "user",
129 "content": question,
130 },
131 ],
132 tool_call=False,
133 )["content"]
134 return f"{answer}\n\nSource: {source['url']}"

Evaluate the agent

1

Create LLM judge

We rely on an LLM judge to automatically rate the answer comprehensibility.

Add this in agent.py main:

agent.py
1if __name__ == "__main__":
2 humanloop.evaluators.upsert(
3 path="QA Agent/Comprehension",
4 spec={
5 "arguments_type": "target_free",
6 "return_type": "number",
7 "evaluator_type": "llm",
8 "prompt": {
9 "model": "gpt-4o",
10 "endpoint": "complete",
11 "template": (
12 "You must decide if an explanation is simple "
13 "enough to be understood by a 5-year old. "
14 "A better explanation is shorter and uses less jargon "
15 "Rate the answer from 1 to 10, where 10 is the best.\n"
16 "\n<Question>\n{{log.inputs.question}}\n</Question>\n\n"
17 "\n<Answer>\n{{log.output}}</Answer>\n\n"
18 "First provide your rationale, then on a newline, output your judgment."
19 ),
20 "provider": "openai",
21 "temperature": 0,
22 },
23 },
24 )
2

Add dataset

Create a file called dataset.jsonl and add the following:

1{"inputs": {"question": "Why is the sky blue?"}}
2{"inputs": {"question": "Where does the sun go at night?"}}
3{"inputs": {"question": "Why do birds fly?"}}
4{"inputs": {"question": "What makes rainbows?"}}
5{"inputs": {"question": "Why do we have to sleep?"}}
6{"inputs": {"question": "How do fish breathe underwater?"}}
7{"inputs": {"question": "Why do plants need water?"}}
8{"inputs": {"question": "How does the moon stay in the sky?"}}
9{"inputs": {"question": "What are stars made of?"}}
10{"inputs": {"question": "Why do we have seasons?"}}
11{"inputs": {"question": "How does the TV work?"}}
12{"inputs": {"question": "Why do dogs wag their tails?"}}
13{"inputs": {"question": "What makes cars go?"}}
14{"inputs": {"question": "Why do we need to brush our teeth?"}}
15{"inputs": {"question": "What do ants eat?"}}
16{"inputs": {"question": "Why does the wind blow?"}}
17{"inputs": {"question": "How do airplanes stay in the air?"}}
18{"inputs": {"question": "Why does the ocean look so big?"}}
19{"inputs": {"question": "What makes the grass green?"}}
20{"inputs": {"question": "Why do we have to eat vegetables?"}}
21{"inputs": {"question": "How do butterflies fly?"}}
22{"inputs": {"question": "Why do some animals live in the zoo?"}}
23{"inputs": {"question": "How do magnets stick to the fridge?"}}
24{"inputs": {"question": "What makes fire hot?"}}
25{"inputs": {"question": "Why do leaves change color?"}}
26{"inputs": {"question": "What happens when we flush the toilet?"}}
27{"inputs": {"question": "Why do we have belly buttons?"}}
28{"inputs": {"question": "What makes the clouds move?"}}
29{"inputs": {"question": "Why do we have eyebrows?"}}
30{"inputs": {"question": "How do seeds turn into plants?"}}
31{"inputs": {"question": "Why does the moon change shape?"}}
32{"inputs": {"question": "Why do bees make honey?"}}
33{"inputs": {"question": "What makes ice melt?"}}
34{"inputs": {"question": "Why do we sneeze?"}}
35{"inputs": {"question": "How do trains stay on the tracks?"}}
36{"inputs": {"question": "Why do stars twinkle?"}}
37{"inputs": {"question": "Why can't we see air?"}}
38{"inputs": {"question": "What makes the Earth spin?"}}
39{"inputs": {"question": "Why do frogs jump?"}}
40{"inputs": {"question": "Why do cats purr?"}}
41{"inputs": {"question": "How do phones let us talk to people far away?"}}
42{"inputs": {"question": "Why does the moon follow us?"}}
43{"inputs": {"question": "What makes lightning?"}}
44{"inputs": {"question": "Why does it snow?"}}
45{"inputs": {"question": "Why do we have shadows?"}}
46{"inputs": {"question": "Why do boats float?"}}
47{"inputs": {"question": "What makes our heart beat?"}}
48{"inputs": {"question": "Why do some animals sleep all winter?"}}
49{"inputs": {"question": "Why do we have to wear shoes?"}}
50{"inputs": {"question": "What makes music?"}}
3

Add the Evaluation

Add this in the main of agent.py:

agent.py
1if __name__ == "__main__":
2 # ...
3
4 # Read the evaluation dataset
5 with open("dataset.jsonl", "r") as fp:
6 dataset = [json.loads(line) for line in fp]
7
8 humanloop.evaluations.run(
9 name="QA Agent Answer Comprehensiveness",
10 file={
11 "path" "QA Agent/Workflow",
12 "callable": workflow,
13 },
14 evaluators=[
15 {
16 "path": "QA Agent/Comprehension",
17 }
18 ],
19 dataset={
20 "path": "QA Agent/Children Comprehension",
21 "datapoints": dataset,
22 },
23 workers=8,
24 )
4

Run the agent

Run agent.py and let the Evaluation finish:

Terminal
$python main.py
>
>Navigate to your Evaluation:
>https://app.humanloop.com/project/fl_9CCIoTzySPfUFeIxfYE6g/evaluations/evr_67tEc2DiR83fy9iTaqyPA/stats
>
>Flow Version ID: flv_9ECTrfeZYno2OIj9KAqlz
>Run ID: rn_67tEcDYV6mqUS86hD8vrP
>
>Running 'Orchestrator' over the Dataset 'Children Questions' using 8 workers
>[##############--------------------------] 15/50 (30.00%) | ETA: 14
>
>...
>
>📊 Evaluation Results for QA Agent/Orchestrator
>+------------------------+---------------------+
>| | Latest |
>+------------------------+---------------------+
>| Run ID | 67tEc |
>+------------------------+---------------------+
>| Version ID | None |
>+------------------------+---------------------+
>| Added | 2024-11-19 21:49:02 |
>+------------------------+---------------------+
>| Evaluators | |
>+------------------------+---------------------+
>| QA Agent/Comprehension | 3.24 |
>+------------------------+---------------------+

Iterate and evaluate again

The score of the initial setup is quite low. Click the Evaluation link from the terminal and switch to the Logs view. You will see that the model tends to provide elaborate answers.

Let’s modify the summarization step inside workflow:

agent.py
1def workflow(question: str) -> str:
2 # ...
3
4 answer = agent(
5 messages=[
6 {
7 "role": "system",
8 "content": (
9 "You are a summarizer. Answer questions "
10 "so a 5-year old can understand. Answer "
11 "the question based on this source: "
12 f"{source['content']}"
13 ),
14 },
15 {
16 "role": "user",
17 "content": question,
18 },
19 ],
20 tool_call=False,
21 )["content"]
22 return f"{answer}\n\nSource: {source['url']}"

Run the agent again and let the Evaluation finish:

Terminal
$python agent.py
>
>Flow Version ID: flv_9ECTrfeZYno2OIj9KAqlz
>Run ID: rn_WnIwPSI7JFKEtwTS0l3mj
>
>Navigate to your Evaluation:
>https://app.humanloop.com/project/fl_9CCIoTzySPfUFeIxfYE6g/evaluations/rn_WnIwPSI7JFKEtwTS0l3mj/stats
>
>Running 'Orchestrator' over the Dataset 'Children Questions' using 8 workers
>[######################------------------] 34/50 (68.00%) | ETA: 14
>
>...
>
>+------------------------+---------------------+---------------------+
>| | Control | Latest |
>+------------------------+---------------------+---------------------+
>| Run ID | 67tEc | WnIwP |
>+------------------------+---------------------+---------------------+
>| Version ID | None | None |
>+------------------------+---------------------+---------------------+
>| Added | 2024-11-19 22:05:17 | 2024-11-19 22:24:13 |
>+------------------------+---------------------+---------------------+
>| Evaluators | | |
>+------------------------+---------------------+---------------------+
>| QA Agent/Comprehension | 3.24 | 8.04 |
>+------------------------+---------------------+---------------------+
>
>Change of [4.80] for Evaluator QA Agent/Comprehension

Click the Evaluation link again. The agent’s performance has improved significantly.

Add detailed logging

If you use a programming language not supported by the SDK, or want more control, see our guide on logging through the API for an alternative to decorators.

Let’s use Humanloop to get more insight into our agent’s behavior. Modify agent.py:

agent.py
1@humanloop.tool(path="QA Agent/Search Wikipedia")
2def search_wikipedia(query: str) -> dict:
3 ...
4
5@humanloop.prompt(path="QA Agent/Agent")
6def agent(messages: list[dict], tool_call: bool) -> str:
7 ...
8
9@humanloop.flow(path="QA Agent/Workflow")
10def workflow(question: str) -> str:
11 ...

Evaluate the agent again. When it’s done, click the Workflow Flow on the left and select the Logs tab from the top of the page.

The trace captures the agent searching for the best article to answer a question.

The decorators version the agent’s components and log individuals steps. Each Flow Log contains a trace of the Logs involved in answering a question.

Next steps

We’ve built a complex agentic workflow and learned how to use Humanloop to add logging to it and evaluate its performance.