Run Evaluation

Quickly evaluate your LLM apps and improve them, all managed in code.

The evaluations.run() function is a convenience function that allows you to trigger evaluations from code. It will create the evaluation, fetch the dataset, generate all the Logs and then run the evaluators on each log.

It supports evaluating arbitrary functions, Prompts stored on Humanloop, and Prompts defined in code.

Parameters

You can see the source code for the evaluations.run() function in Python and TypeScript.

name
stringRequired

Name of the evaluation to help identify it

file
objectRequired

Configuration for what is being evaluated. The evaluation will be stored on this File.

path
stringRequired

Path to the evaluated File (a Prompt, Flow, Tool, Evaluator etc.) on Humanloop. If the File does not exist on Humanloop, it will be created.

Example: My Agent will create a flow file on Humanloop.

type
enum

flow (default), prompt, tool, evaluator

If the File does not exist on Humanloop, it will be created with this File type.

version
object

Pass in the details of the version of the File you want to evaluate.

For example, for a Flow you might pass in identifiers:

1{
2 "git_hash": "1234567890",
3 "identifier": "rag-with-pinecone"
4}

Or for a Prompt you can pass in Prompt details and it will be called.

1{
2 "model": "gpt-4",
3 "template": [
4 {
5 "role": "user",
6 "content": "You are a helpful assistant on the topic of {{topic}}."
7 }
8 ]
9}
callable
function

Function to evaluate (optional if the File is runnable on Humanloop like a Prompt). It will be called using your Dataset callable(**datapoint.inputs, messages=datapoint.messages). It should return a single string output.

evaluators
arrayRequired

List of evaluators to judge the generated output

path
stringRequired

Path to evaluator on Humanloop

args_type
string

The type of arguments the Evaluator expects - only required for local Evaluators

return_type
string

The type of return value the Evaluator produces - only required for local Evaluators

callable
function

Function to evaluate (optional if the Evaluator is runnable on Humanloop). It will be called using the generated output as follows: callable(output). It should return a single string output.

custom_logger
function

Optional function that logs the output judgment from your Evaluator to Humanloop. If provided, it will be called as: judgment = callable(log_dict); log = custom_logger(client, judgment). Inside the custom_logger, you can use the Humanloop client to log the judgment to Humanloop. If not provided your function must return a single string and by default the code will be used to inform the version of the external Evaluator on Humanloop.

threshold
number

The threshold to check the evaluator result against

dataset
objectRequired

Dataset to evaluate against

path
stringRequired

Path to existing dataset on Humanloop. If the Dataset does not exist on Humanloop, it will be created.

datapoints
array

The datapoints to map your function over to produce the outputs required by the evaluation. Optional - if not provided, the evaluation will be run over the datapoints stored on Humanloop.

Return Type

Returns an EvaluationStats object containing:

  • run_stats: Array of statistics for each run
  • progress: Summary of evaluation progress
  • report: Detailed evaluation report
  • status: Current status of evaluation

Examples

1. Evaluating an Arbitrary Flow Function

To evaluate an arbitrary workflow you can pass in the callable parameter to the file object.

1def my_flow_function(messages):
2 # Your custom logic here
3 return "Response based on messages"
4
5evaluation = humanloop.evaluations.run(
6 name="Custom Flow Evaluation",
7 type="flow",
8 file={
9 "path": "Custom/Flow",
10 "callable": my_flow_function
11 },
12 evaluators=[
13 {"path": "Example Evaluators/AI/Semantic similarity"},
14 {"path": "Example Evaluators/Code/Latency"}
15 ],
16 dataset={
17 "path": "Test/Dataset",
18 "datapoints": [
19 {
20 "messages": [
21 {"role": "user", "content": "Test question 1"}
22 ]
23 }
24 ]
25 }
26)

2. Evaluating a Prompt on Humanloop

To evaluate a Prompt stored on Humanloop you simply supply a path to the Prompt and a list of Evaluators.

1evaluation = humanloop.evaluations.run(
2 name="Existing Prompt Evaluation",
3 file={
4 "path": "Existing/Prompt",
5 },
6 evaluators=[
7 {"path": "Example Evaluators/AI/Semantic similarity"},
8 {"path": "Example Evaluators/Code/Cost"}
9 ],
10 dataset={
11 "path": "Existing/Dataset"
12 }
13)

3. Evaluating a Prompt in Code

To evaluate a Prompt defined in code you can pass in the model, template and other Prompt parameters to the file’s version object.

1evaluation = humanloop.evaluations.run(
2 name="Code Prompt Evaluation",
3 file={
4 "path": "Code/Prompt",
5 "version": {
6 "model": "gpt-4",
7 "template": [
8 {
9 "role": "system",
10 "content": "You are a helpful assistant on the topic of {{topic}}."
11 }
12 ]
13 },
14 },
15 evaluators=[
16 {"path": "Example Evaluators/AI/Semantic similarity"},
17 {"path": "Example Evaluators/Code/Latency"}
18 ],
19 dataset={
20 "datapoints": [
21 {
22 "inputs": { "topic": "machine learning" },
23 "messages": [ {"role": "user", "content": "What is machine learning?"} ],
24 "target": { "output": "Machine learning is a subset of artificial intelligence..." }
25 }
26 ]
27 }
28)

Each example demonstrates a different way to use the evaluation.run function. The function returns evaluation statistics that can be used to understand the performance of your LLM application according to the specified evaluators.

You can view the results of your evaluation in the Humanloop UI by navigating to the specified file path, or by checking the evaluation stats programmatically using the returned object’s report field.