Run Evaluation
Quickly evaluate your LLM apps and improve them, all managed in code.
The evaluations.run()
function is a convenience function that allows you to trigger evaluations from code. It will create the evaluation, fetch the dataset, generate all the Logs and then run the evaluators on each log.
It supports evaluating arbitrary functions, Prompts stored on Humanloop, and Prompts defined in code.
Parameters
You can see the source code for the evaluations.run()
function in Python and TypeScript.
Name of the evaluation to help identify it
Configuration for what is being evaluated. The evaluation will be stored on this File.
file
flow
(default), prompt
, tool
, evaluator
If the File does not exist on Humanloop, it will be created with this File type.
Pass in the details of the version of the File you want to evaluate.
For example, for a Flow you might pass in identifiers:
Or for a Prompt you can pass in Prompt details and it will be called.
Function to evaluate (optional if the File is runnable on Humanloop like a Prompt).
It will be called using your Dataset callable(**datapoint.inputs, messages=datapoint.messages)
. It should return a single string output.
List of evaluators to judge the generated output
evaluator
Path to evaluator on Humanloop
The type of arguments the Evaluator expects - only required for local Evaluators
The type of return value the Evaluator produces - only required for local Evaluators
Function to evaluate (optional if the Evaluator is runnable on Humanloop).
It will be called using the generated output as follows: callable(output)
.
It should return a single string output.
Optional function that logs the output judgment from your Evaluator to Humanloop. If provided, it will be called as:
judgment = callable(log_dict); log = custom_logger(client, judgment)
. Inside the custom_logger, you can use the Humanloop client
to log the judgment to Humanloop.
If not provided your function must return a single string and by default the code will be used to inform the version of the external Evaluator on Humanloop.
The threshold to check the evaluator result against
Dataset to evaluate against
dataset
Path to existing dataset on Humanloop. If the Dataset does not exist on Humanloop, it will be created.
The datapoints to map your function over to produce the outputs required by the evaluation. Optional - if not provided, the evaluation will be run over the datapoints stored on Humanloop.
Return Type
Returns an EvaluationStats
object containing:
- run_stats: Array of statistics for each run
- progress: Summary of evaluation progress
- report: Detailed evaluation report
- status: Current status of evaluation
Examples
1. Evaluating an Arbitrary Flow Function
To evaluate an arbitrary workflow you can pass in the callable
parameter to the file
object.
2. Evaluating a Prompt on Humanloop
To evaluate a Prompt stored on Humanloop you simply supply a path
to the Prompt and a list of Evaluators.
3. Evaluating a Prompt in Code
To evaluate a Prompt defined in code you can pass in the model
, template
and other Prompt parameters to the file
’s version
object.
Each example demonstrates a different way to use the evaluation.run
function. The function returns evaluation statistics that can be used to understand the performance of your LLM application according to the specified evaluators.
You can view the results of your evaluation in the Humanloop UI by navigating to the specified file path, or by checking the evaluation stats programmatically using the returned object’s report
field.