Evals from the UI

Evaluate and improve your LLM apps by running Evals in the Humanloop UI.

This tutorial will take you through running your first Eval. You’ll learn how to assess multiple Prompt versions to improve the quality of your AI products.

Running Evals

For this tutorial, we’re going to evaluate the performance of a simple Support Agent that responds to user queries.

1

Create a Prompt File

When you first open Humanloop you’ll see your File navigation on the left. Click +New and create a Prompt. In the Prompt Editor, add the following Prompt instructions:

You are a support agent. Answer questions.

Commit this version.

2

Create a Dataset

The Dataset contains datapoints describing the inputs and, optionally, the expected results for a given task.

For this tutorial, we created a csv file with 100 common customer support questions.

Create a new Dataset file, then click on the Upload CSV button to upload the file.

3

Run your first Evaluation

Navigate to the Prompt you’ve just created and click on the Evaluation tab. Click on the Evaluate button to create a new Evaluation.

Create a new Run by clicking on the +Run button. Select the Dataset and Prompt version we created earlier.

Add Evaluators by clicking on +Evaluator button. For this tutorial, we selected Semantic similarity, Cost and Latency Evaluators. You can find these Evaluators in the Example Evaluators folder.

”Semantic similarity” Evaluator measures the degree of similarity between the model’s response and the expected output provided on a scale from 1 to 5 (1 is very dissimilar and 5 is very similar).

Click Save. Humanloop will start generating Logs for the Evaluation.

4

Review results

After the Run is completed, you can review the Logs produced and corresponding judgments in the Review tab. The summary of all Evaluators is displayed in the Stats tab.

5

Make changes to your Prompt

After reviewing the results, we now have a better understanding of the Prompt’s behavior. We can improve its performance. Navigate back to the Prompt Editor and change the instructions to:

You are a helpful assistant. Your job is to respond to users questions in {{tone}} tone. Be polite and succinct.

Commit this new version.

6

Run another Evaluation

We can now create a new Run with the new Prompt version. Click on the +Run button and select the newly created Prompt version.

Run page with two runs. Each Run evaluates a different Prompt version

We can now see from the Stats view that the updated version performs better across the board. To get a detailed view, navigate to Logs or Stats tabs.

Review view
Stats view

Next steps

Now that you’ve successfully run your first Eval, you can explore customizing for your use case: