The evaluations.run() function is a convenience function that allows you to trigger evaluations from code. It will create the evaluation, fetch the dataset, generate all the Logs and then run the evaluators on each log.

It supports evaluating arbitrary functions, Prompts stored on Humanloop, and Prompts defined in code.

Parameters

You can see the source code for the evaluations.run() function in Python and TypeScript.

name

stringRequired

Name of the evaluation to help identify it

file

objectRequired

Configuration for what is being evaluated. The evaluation will be stored on this File.

file

path

stringRequired

Path to the evaluated File (a Prompt, Flow, Tool, Evaluator etc.) on Humanloop. If the File does not exist on Humanloop, it will be created.

Example: My Agent will create a flow file on Humanloop.

type

enum

flow (default), prompt, tool, evaluator

If the File does not exist on Humanloop, it will be created with this File type.

version

object

Pass in the details of the version of the File you want to evaluate.

For example, for a Flow you might pass in identifiers:

1 {
2   "git_hash": "1234567890",
3   "identifier": "rag-with-pinecone"
4 }

Or for a Prompt you can pass in Prompt details and it will be called.

1 {
2   "model": "gpt-4",
3   "template": [
4     {
5       "role": "user",
6       "content": "You are a helpful assistant on the topic of {{topic}}."
7     }
8   ]
9 }

callable

function

Function to evaluate (optional if the File is runnable on Humanloop like a Prompt). It will be called using your Dataset callable(**datapoint.inputs, messages=datapoint.messages). It should return a single string output.

evaluators

arrayRequired

List of evaluators to judge the generated output

evaluator

path

stringRequired

Path to evaluator on Humanloop

args_type

string

The type of arguments the Evaluator expects - only required for local Evaluators

return_type

string

The type of return value the Evaluator produces - only required for local Evaluators

callable

function

Function to evaluate (optional if the Evaluator is runnable on Humanloop). It will be called using the generated output as follows: callable(output). It should return a single string output.

custom_logger

function

Optional function that logs the output judgment from your Evaluator to Humanloop. If provided, it will be called as: judgment = callable(log_dict); log = custom_logger(client, judgment). Inside the custom_logger, you can use the Humanloop client to log the judgment to Humanloop. If not provided your function must return a single string and by default the code will be used to inform the version of the external Evaluator on Humanloop.

threshold

number

The threshold to check the evaluator result against

dataset

objectRequired

Dataset to evaluate against

dataset

path

stringRequired

Path to existing dataset on Humanloop. If the Dataset does not exist on Humanloop, it will be created.

datapoints

array

The datapoints to map your function over to produce the outputs required by the evaluation. Optional - if not provided, the evaluation will be run over the datapoints stored on Humanloop.

Return Type

Returns an EvaluationStats object containing:

run_stats: Array of statistics for each run
progress: Summary of evaluation progress
report: Detailed evaluation report
status: Current status of evaluation

Examples

1. Evaluating an Arbitrary Flow Function

To evaluate an arbitrary workflow you can pass in the callable parameter to the file object.

1 def my_flow_function(messages):
2     # Your custom logic here
3     return "Response based on messages"
4 
5 evaluation = humanloop.evaluations.run(
6     name="Custom Flow Evaluation",
7     type="flow",
8     file={
9         "path": "Custom/Flow",
10         "callable": my_flow_function
11     },
12     evaluators=[
13         {"path": "Example Evaluators/AI/Semantic similarity"},
14         {"path": "Example Evaluators/Code/Latency"}
15     ],
16     dataset={
17         "path": "Test/Dataset",
18         "datapoints": [
19             {
20                 "messages": [
21                     {"role": "user", "content": "Test question 1"}
22                 ]
23             }
24         ]
25     }
26 )

2. Evaluating a Prompt on Humanloop

To evaluate a Prompt stored on Humanloop you simply supply a path to the Prompt and a list of Evaluators.

1 evaluation = humanloop.evaluations.run(
2     name="Existing Prompt Evaluation",
3     file={
4         "path": "Existing/Prompt",
5     },
6     evaluators=[
7         {"path": "Example Evaluators/AI/Semantic similarity"},
8         {"path": "Example Evaluators/Code/Cost"}
9     ],
10     dataset={
11         "path": "Existing/Dataset"
12     }
13 )

3. Evaluating a Prompt in Code

To evaluate a Prompt defined in code you can pass in the model, template and other Prompt parameters to the file’s version object.

1 evaluation = humanloop.evaluations.run(
2     name="Code Prompt Evaluation",
3     file={
4         "path": "Code/Prompt",
5         "version": {
6             "model": "gpt-4",
7             "template": [
8                 {
9                     "role": "system",
10                     "content": "You are a helpful assistant on the topic of {{topic}}."
11                 }
12             ]
13         },
14     },
15     evaluators=[
16         {"path": "Example Evaluators/AI/Semantic similarity"},
17         {"path": "Example Evaluators/Code/Latency"}
18     ],
19     dataset={
20         "datapoints": [
21             {
22                 "inputs": { "topic": "machine learning" },
23                 "messages": [ {"role": "user", "content": "What is machine learning?"} ],
24                 "target": { "output": "Machine learning is a subset of artificial intelligence..." }
25             }
26         ]
27     }
28 )

Each example demonstrates a different way to use the evaluation.run function. The function returns evaluation statistics that can be used to understand the performance of your LLM application according to the specified evaluators.

You can view the results of your evaluation in the Humanloop UI by navigating to the specified file path, or by checking the evaluation stats programmatically using the returned object’s report field.