Evaluate an agent
Evaluate and improve your agent with an LLM judge.
This tutorial demonstrates how to evaluate and iterate on a more complex AI feature.
Prerequisites
Create the agent
We will build an agent that answers questions asked by children. The agent queries Wikipedia and replies with an easy to understand explanation.
Let’s create the initial version of our agent. Add the following in an agent.py
file:
Evaluate the agent
Create LLM judge
We rely on an LLM judge to automatically rate the answer comprehensibility.
Add this in agent.py
main:
Iterate and evaluate again
The score of the initial setup is quite low. Click the Evaluation link from the terminal and switch to the Logs view. You will see that the model tends to provide elaborate answers.
Let’s modify the summarization step inside workflow
:
Run the agent again and let the Evaluation finish:
Click the Evaluation link again. The agent’s performance has improved significantly.
Add detailed logging
Let’s use Humanloop to get more insight into our agent’s behavior. Modify agent.py
:
Evaluate the agent again. When it’s done, click the Workflow Flow on the left and select the Logs tab from the top of the page.
The decorators version the agent’s components and log individuals steps. Each Flow Log contains a trace of the Logs involved in answering a question.
Next steps
We’ve built a complex agentic workflow and learned how to use Humanloop to add logging to it and evaluate its performance.
-
Take a look at our guide on creating a custom dataset for your project.
-
Learn more about using LLM Evaluators on Humanloop.