Quickly evaluate your LLM apps and improve them, all managed in code.

Prerequisites

This tutorial will take you through running your first Eval with Humanloop. You’ll learn how to trigger an evaluation from code, interpret an eval-report on Humanloop and use it to improve your AI features.

Create an evals script

Create a file named eval.py and add the following code:

1from humanloop import Humanloop
2
3# Get API key at https://app.humanloop.com/account/api-keys
4hl = Humanloop(api_key="<YOUR Humanloop API KEY>")
5
6checks = hl.evaluations.run(
7 name="Initial Test",
8 file={
9 "path": "Scifi/App",
10 "callable": lambda messages: (
11 "I'm sorry, Dave. I'm afraid I can't do that."
12 if messages[-1]["content"].lower() == "hal"
13 else "Beep boop!"
14 ) # replace the lambda with your AI model
15 },
16 dataset={
17 "path": "Scifi/Tests",
18 "datapoints": [
19 {
20 "messages": [
21 {"role": "system", "content": "You are an AI that responds like famous sci-fi AIs."},
22 {"role": "user", "content": "HAL"}
23 ],
24 "target": {"output": "I'm sorry, Dave. I'm afraid I can't do that."}
25 },
26 {
27 "messages": [
28 {"role": "system", "content": "You are an AI that responds like famous sci-fi AIs."},
29 {"role": "user", "content": "R2D2"}
30 ],
31 "target": {"output": "Beep boop beep!"}
32 }
33 ]
34 }, # replace with your test-dataset
35 evaluators=[
36 {"path": "Example Evaluators/Code/Exact match"},
37 {"path": "Example Evaluators/Code/Latency"},
38 {"path": "Example Evaluators/AI/Semantic similarity"},
39 ],
40)

This sets up the basic structure of an Evaluation:

  1. A callable function that you want to evaluate. The callable should take your inputs and/or messages and returns a string. The “file” argument defines the callable as well as the location of where the evaluation results will appear on Humanloop.
  2. A test Dataset of inputs and/or messages to run your function over and optional expected targets to evaluate against.
  3. A set of Evaluators to provide judgements on the output of your function. This example uses default evaluators that come with every Humanloop workspace. Evaluators can also be defined locally and pushed to the Humanloop runtime.

It returns a checks object that contains the results of the eval per evaluator.

Run your script

Run your script with the following command:

$python eval.py

You will see a URL to view your evals on Humanloop. A summary of progress and the final results will be displayed directly in your terminal:

Eval progress and url in terminal
Eval results in terminal

View the results

Navigate to the URL provided in your terminal to see the result of running your script on Humanloop. This Stats view will show you the live progress of your local eval runs as well summary statistics of the final results. Each new run will add a column to your Stats view, allowing you to compare the performance of your LLM app over time.

The Logs and Review tabs allow you to drill into individual datapoints and view the outputs of different runs side-by-side to understand how to improve your LLM app.

Eval results in terminal

Make a change and re-run

Your first run resulted in a Semantic similarity score of 3 (out of 5) and an Exact match score of 0. Try and make a change to your callable to improve the output and re-run your script. A second run will be added to your Stats view and the difference in performance will be displayed.

Re-run eval results in terminal
Re-run eval results in UI

Next steps

Now that you’ve run your first eval on Humanloop, you can:

  • Explore our detailed tutorial on evaluating a real RAG app where you’ll learn about versioning your app, customising logging, adding Evaluator thresholds and more.
  • Create your own Dataset of test cases to evaluate your LLM app against.
  • Create your own Evaluators to provide judgements on the output of your LLM app.