Use external Evaluators

Integrate your existing evaluation process with Humanloop.

LLM and code Evaluators generally live on the Humanloop runtime environment. The advantage of this is that these Evaluators can be used as monitoring Evaluators and to allow triggering evaluations directly from the Humanloop UI.

Your setup however can be more complex: your Evaluator has library dependencies that are not present in the runtime environment, your LLM evaluator has multiple reasoning steps, or you prefer managing the logic yourself.

External Evaluators address this: they are registered with Humanloop but their code definition remains in your environment. In order to evaluate a Log, you call the logic yourself and send the judgment to Humanloop.

In this tutorial, we will build a chat agent that answers questions asked by children, and evaluate its performance using an external Evaluator.

Create the agent

We reuse the chat agent from our evaluating an agent tutorial.

Let’s create the initial version of our agent. Add the following in an agent.py file:

agent.py
1from humanloop import Humanloop
2from openai import OpenAI
3from openai.types.chat.chat_completion_message import ChatCompletionMessage as Message
4import wikipedia
5import json
6
7
8openai = OpenAI(api_key="ADD YOUR KEY HERE")
9humanloop = Humanloop(api_key="ADD YOUR KEY HERE")
10
11
12def search_wikipedia(query: str) -> dict:
13 """Search Wikipedia to get up-to-date information for a query."""
14 try:
15 page = wikipedia.page(query)
16 return {
17 "title": page.title,
18 "content": page.content,
19 "url": page.url,
20 }
21 except Exception as _:
22 return {
23 "title": "",
24 "content": "No results found",
25 "url": "",
26 }
27
28
29def call_model(messages: list[Message]) -> Message:
30 """Calls the model with the given messages"""
31 system_message = {
32 "role": "system",
33 "content": (
34 "You are an assistant that help to answer user questions. You should "
35 "leverage wikipedia to answer questions so that the information is up to date. "
36 "If the response from Wikipedia does not seem relevant, rephrase the question "
37 "and call the tool again. Then finally respond to the user."
38 ),
39 }
40 response = openai.chat.completions.create(
41 model="gpt-4o",
42 messages=[system_message] + messages,
43 tools=[
44 {
45 "type": "function",
46 "function": {
47 "name": "search_wikipedia",
48 "description": "Search the internet to get up to date answers for a query.",
49 "parameters": {
50 "type": "object",
51 "required": ["query"],
52 "properties": {
53 "query": {"type": "string"},
54 },
55 "additionalProperties": False,
56 },
57 },
58 }
59 ],
60 )
61 return response.choices[0].message.to_dict(exclude_unset=False)
62
63
64def call_agent(query: str) -> str:
65 """Calls the main agent loop and returns the final result"""
66 messages = [{"role": "user", "content": query}]
67 # Retry for a relevant response 3 times at most
68 for _ in range(3):
69 response = call_model(messages)
70 messages.append(response)
71 if response["tool_calls"]:
72 # Call wikipedia to get up-to-date information
73 for tool_call in response["tool_calls"]:
74 source = search_wikipedia(
75 **json.loads(tool_call["function"]["arguments"])
76 )
77 messages.append(
78 {
79 "role": "tool",
80 "content": json.dumps(source),
81 "tool_call_id": tool_call["id"],
82 }
83 )
84 else:
85 # Respond to the user
86 return response["content"]
87
88
89if __name__ == "__main__":
90 result = call_agent("Where does the sun go at night?")
91 print(result)

Run the agent to check if it works:

$python agent.py
>
>Okay! Imagine the Earth is like a big ball, and we live on it.
>The sun doesn't really “go” anywhere—it stays in the same spot,
>shining all the time. But our Earth is spinning like a top!

Evaluate the agent

1

Define external Evaluator

The Evaluator takes a log argument, which represents the Log created by calling call_agent.

Let’s add a simple Evaluator that checks if the agent’s answers are too long. Add this in the agent.py file:

1if __name__ == "__main__":
2 def easy_to_understand(log):
3 return len(log["output"]) < 100
2

Add dataset

Create a file called dataset.jsonl and add the following:

1{"inputs": {"question": "Why is the sky blue?"}}
2{"inputs": {"question": "Where does the sun go at night?"}}
3{"inputs": {"question": "Why do birds fly?"}}
4{"inputs": {"question": "What makes rainbows?"}}
5{"inputs": {"question": "Why do we have to sleep?"}}
6{"inputs": {"question": "How do fish breathe underwater?"}}
7{"inputs": {"question": "Why do plants need water?"}}
8{"inputs": {"question": "How does the moon stay in the sky?"}}
9{"inputs": {"question": "What are stars made of?"}}
10{"inputs": {"question": "Why do we have seasons?"}}
11{"inputs": {"question": "How does the TV work?"}}
12{"inputs": {"question": "Why do dogs wag their tails?"}}
13{"inputs": {"question": "What makes cars go?"}}
14{"inputs": {"question": "Why do we need to brush our teeth?"}}
15{"inputs": {"question": "What do ants eat?"}}
16{"inputs": {"question": "Why does the wind blow?"}}
17{"inputs": {"question": "How do airplanes stay in the air?"}}
18{"inputs": {"question": "Why does the ocean look so big?"}}
19{"inputs": {"question": "What makes the grass green?"}}
20{"inputs": {"question": "Why do we have to eat vegetables?"}}
21{"inputs": {"question": "How do butterflies fly?"}}
22{"inputs": {"question": "Why do some animals live in the zoo?"}}
23{"inputs": {"question": "How do magnets stick to the fridge?"}}
24{"inputs": {"question": "What makes fire hot?"}}
25{"inputs": {"question": "Why do leaves change color?"}}
26{"inputs": {"question": "What happens when we flush the toilet?"}}
27{"inputs": {"question": "Why do we have belly buttons?"}}
28{"inputs": {"question": "What makes the clouds move?"}}
29{"inputs": {"question": "Why do we have eyebrows?"}}
30{"inputs": {"question": "How do seeds turn into plants?"}}
31{"inputs": {"question": "Why does the moon change shape?"}}
32{"inputs": {"question": "Why do bees make honey?"}}
33{"inputs": {"question": "What makes ice melt?"}}
34{"inputs": {"question": "Why do we sneeze?"}}
35{"inputs": {"question": "How do trains stay on the tracks?"}}
36{"inputs": {"question": "Why do stars twinkle?"}}
37{"inputs": {"question": "Why can't we see air?"}}
38{"inputs": {"question": "What makes the Earth spin?"}}
39{"inputs": {"question": "Why do frogs jump?"}}
40{"inputs": {"question": "Why do cats purr?"}}
41{"inputs": {"question": "How do phones let us talk to people far away?"}}
42{"inputs": {"question": "Why does the moon follow us?"}}
43{"inputs": {"question": "What makes lightning?"}}
44{"inputs": {"question": "Why does it snow?"}}
45{"inputs": {"question": "Why do we have shadows?"}}
46{"inputs": {"question": "Why do boats float?"}}
47{"inputs": {"question": "What makes our heart beat?"}}
48{"inputs": {"question": "Why do some animals sleep all winter?"}}
49{"inputs": {"question": "Why do we have to wear shoes?"}}
50{"inputs": {"question": "What makes music?"}}
3

Add Evaluation

Instantiate an Evaluation using the client’s evaluations.run utility. easy_to_understand is an external Evaluator, so we provide its definition via the callable argument. At runtime, evaluations.run will call the function and submit the judgment to Humanloop.

agent.py
1if __name__ == "__main__":
2 def easy_to_understand(log):
3 return len(log["output"]) < 100
4
5 # Read the evaluation dataset
6 with open("dataset.jsonl", "r") as fp:
7 dataset = [json.loads(line) for line in fp]
8
9 humanloop.evaluations.run(
10 name="QA Agent Answer Comprehensiveness",
11 file={
12 "path" "QA Agent/Agent",
13 "callable": call_agent,
14 },
15 evaluators=[
16 {
17 "path": "QA Agent/Comprehension",
18 "callable": easy_to_understand,
19 "args_type": "target_free",
20 "return_type": "boolean",
21 }
22 ],
23 dataset={
24 "path": "QA Agent/Children Questions",
25 "datapoints": dataset,
26 },
27 workers=8,
28 )
4

Run the evaluation

Terminal
$python main.py
>
>Navigate to your Evaluation:
>https://app.humanloop.com/project/fl_9CCIoTzySPfUFeIxfYE6g/evaluations/evr_67tEc2DiR83fy9iTaqyPA/stats
>
>Flow Version ID: flv_9ECTrfeZYno2OIj9KAqlz
>Run ID: rn_67tEcDYV6mqUS86hD8vrP
>
>Running 'Agent' over the Dataset 'Children Questions' using 8 workers
>[##############--------------------------] 15/50 (30.00%) | ETA: 14
>
>...
>
>📊 Evaluation Results for QA Agent/Agent
>+------------------------+---------------------+
>| | Latest |
>+------------------------+---------------------+
>| Run ID | 67tEc |
>+------------------------+---------------------+
>| Version ID | 9ECTr |
>+------------------------+---------------------+
>| Added | 2024-11-19 21:49:02 |
>+------------------------+---------------------+
>| Evaluators | |
>+------------------------+---------------------+
>| QA Agent/Comprehension | 3.24 |
>+------------------------+---------------------+

Click on the link to see the results when the Evaluation is complete.

Add detailed logging

If you use a programming language not supported by the SDK, or want more control, see our guide on logging through the API for an alternative to decorators.

Up to this point, we have treated the agent as a black box, reasoning about its behavior by looking at the inputs and outputs.

Let’s use Humanloop logging to observe the step-by-step actions taken by the agent.

Modify agent.py:

agent.py
1@humanloop.tool(path="QA Agent/Search Wikipedia")
2def search_wikipedia(query: str) -> dict:
3 ...
4
5@humanloop.prompt(path="QA Agent/Prompt")
6def call_model(messages: list[Message]) -> Message:
7 ...
8
9@humanloop.flow(path="QA Agent/Agent")
10def call_agent(question: str) -> str:
11 ...

Evaluate the agent again. When it’s done, head to your workspace and click the Agent Flow on the left. Select the Logs tab from the top of the page.

The trace captures the agent searching for the best article to answer a question.

The decorators divide the code in logical components, allowing you to observe the steps taken to answer a question. Every step taken by the agent creates a Log.

Next steps

You’ve learned how to integrate your existing evaluation process with Humanloop.

Learn more about Humanloop’s features in these guides: