Evaluate external logs

Run an Evaluation on Humanloop with your own logs.

This guide demonstrates how to run an Evaluation on Humanloop using your own logs. This is useful if you have existing logs in an external system and want to evaluate them on Humanloop with minimal setup.

In this guide, we will use the example of a JSON file containing chat messages between users and customer support agents. This guide will bring you through uploading these logs to Humanloop and creating an Evaluation with them.

Prerequisites

The code in this guide uses the Python SDK. To follow along, you will need to have the SDK installed and configured. While the code snippets are in Python, the same steps can be performed using the TypeScript SDK or via the API directly. If you are using the API directly, you will need to have an API key.

The example JSON data in this guide can be found in the Humanloop Cookbook. To continue with the code in this guide, download conversations-a.json and conversations-b.json from the assets folder.

Evaluate your external logs

We’ll start by loading data from conversations-a.json, which represents logs recorded by an external system.

1import json
2
3with open("conversations-a.json") as f:
4 data = json.load(f)

In this example, data is a list of chat messages between users and a support agent.

1

Upload logs to Humanloop

These steps are suitable if you do not already have an Evaluation on Humanloop. The Upload new logs step demonstrates a simpler process if you already have an Evaluation you want to add a new set of logs to.

Upload the logs with the log(...) method. This will automatically create a Flow on Humanloop.

We additionally pass in some attributes identifying the configuration of the system that generated these logs. attributes accepts arbitrary values, and is used for versioning the Flow. Here, it allows us to associate this set of logs with a specific version of the support agent.

1log_ids = []
2for messages in data:
3 log = humanloop.flows.log(
4 path="External logs demo/Travel planner",
5 flow={"attributes": {"agent-version": "1.0.0"}}, # Optionally add attributes to identify this version of the support agent.
6 messages=messages,
7 )
8 log_ids.append(log.id)

This will have created a new Flow on Humanloop named Travel planner. To confirm this logging has succeeded, navigate to the Logs tab of the Flow and view the uploaded logs. Each Log should correspond to a conversation and contain a list of messages.

Flow Logs

We will also use the created Flow version when creating our Run.

1version_id = log.version_id
2

Create an Evaluation Run

Next, create an Evaluation on Humanloop. Within the Evaluation, create a Run which will contain the Logs.

Here, we’ll use the example “Helpfulness” LLM-as-a-judge Evaluator. This will automatically rate the helpfulness of the support agent across our logs.

1evaluation = humanloop.evaluations.create(
2 name="Past records",
3 # NB: you can use `path`or `id` for references on Humanloop
4 file={"path": "External logs demo/Travel planner"},
5 evaluators=[
6 # Replace with your Evaluators
7 {"path": "Example Evaluators/AI/Helpfulness"},
8 ],
9)
10run = humanloop.evaluations.create_run(
11 id=evaluation.id,
12 version={'version_id': version_id}, # Associate this Run to the Flow version created above.
13)
3

Assign Logs to the Run

Finally, add the Flow Logs to the Run.

1humanloop.evaluations.add_logs_to_run(
2 id=evaluation.id,
3 run_id=run.id,
4 log_ids=log_ids,
5)
4

Review the Evaluation

You have now created an Evaluation on Humanloop and added Logs to it.

Evaluation on Humanloop

Go to the Humanloop UI to view the Evaluation. Within the Evaluation, go to Logs tab. Here, you can view your uploaded logs as well as the Evaluator judgments.

Logs tab of Evaluation

The following steps will guide you through adding a different set of logs to a new Run for comparison.

5

Upload new logs

If you already have an Evaluation that you want to add a new set of logs to, you can start from here.

To start from this point, retrieve the ID of the Evaluation you want to add logs to.

Go to the Evaluation you want to add logs to on the Humanloop UI and copy the ID from the URL. This is the segment of the URL after evaluations/, e.g. evr_....

Now that we have an Evaluation on Humanloop, we can add a separate set of logs to it and compare the performance across this set of logs to the previous set.

While we can achieve this by repeating the above steps, we can add logs to a Run in a more direct and simpler way now that we have an existing Evaluation.

For this example, we’ll continue with the Evaluation created in the previous section, and add a new Run with the data from conversations-b.json. These represent a set of logs from a prototype version of the support agent.

1with open("conversations-b.json") as f:
2 data = json.load(f)

Create a new Run

Create a new Run within the Evaluation that will contain this set of logs.

1run = humanloop.evaluations.create_run(
2 id=evaluation.id,
3)

Log to the Run

Pass the run_id argument in your log(...) call to associate the Log with the Run.

1# Add the new data to the Run
2for messages in data:
3 log = humanloop.flows.log(
4 path="External logs demo/Travel planner",
5 flow={"attributes": {"agent-version": "2.0.0"}},
6 messages=messages,
7 # Pass `run_id` to associate the Log with the Run.
8 run_id=run.id,
9 )
6

Compare the results

View the Evaluation on Humanloop. It will now contain two Runs.

In the Stats tab of the Evaluation, you can now compare the performance of the two sets of logs.

In our case, our second set of logs (on the right) can be seen to be less helpful.

Evaluation with two Runs on Humanloop

Stats tab showing box plots for the two Runs

Next steps

The above examples demonstrate how you can quickly populate an Evaluation Run with your logs.

  • You can extend this Evaluation with custom Evaluators, such as using Code Evaluators to calculate metrics, or using Human Evaluators to set up your Logs to be reviewed by your subject-matter experts.
  • Now that you’ve set up an Evaluation, explore the other File types on Humanloop to see how they can better reflect your production systems, and how you can use Humanloop to version-control them. Here, we’ve used a Flow to represent a black-box system.