Evaluate an agent
Evaluate and improve the performance of an LLM agent.
Working with LLMs is daunting: you are dealing with a black box that outputs unpredictable results.
Humanloop provides tools to make your development process systematic, bringing it closer to traditional software testing and quality assurance.
In this tutorial, we will use Humanloop to evaluate the quality of a chat agent’s answers and demonstrate how to use results to improve the agent’s performance.
Prerequisites
Create the agent
We will build an agent that answers questions asked by children. The agent queries Wikipedia and replies with an easy-to-understand explanation.
Let’s create the initial version of our agent. Add the following in an agent.py
file:
Run the agent to check if it works:
Evaluate the agent
Let’s check if the agent respects the requirement of providing easy-to-understand answers.
We will create an Evaluation to benchmark the performance of the agent. An Evaluation requires a Dataset and at least one Evaluator.
Create LLM judge
We will use an LLM judge to automatically evaluate the agent’s responses.
We will define the Evaluator in code, but you can also manage Evaluators in the UI.
Add this in agent.py
main:
Iterate and evaluate again
The score of the initial setup is quite low. Click the Evaluation link from the terminal and switch to the Logs view. You will see that the model tends to provide elaborate answers.
Let’s modify the LLM prompt inside call_model
:
Run the agent again and let the Evaluation finish:
Click the Evaluation link again. The agent’s performance has improved significantly.
Add detailed logging
Up to this point, we have treated the agent as a black box, reasoning about its behavior by looking at the inputs and outputs.
Let’s use Humanloop logging to observe the step-by-step actions taken by the agent.
Modify agent.py
:
Evaluate the agent again. When it’s done, head to your workspace and click the Agent Flow on the left. Select the Logs tab from the top of the page.
The decorators divide the code in logical components, allowing you to observe the steps taken to answer a question. Every step taken by the agent creates a Log.
Next steps
We’ve built a complex agentic workflow and learned how to use Humanloop to add logging to it and evaluate its performance.
Take a look at these resources to learn more about evals on Humanloop:
-
Take a look at our guide on creating a custom dataset for your project.
-
Learn more about using LLM Evaluators on Humanloop.