Evaluating LLM Applications

Humanloop’s evaluation framework allows you to test and track the performance of your LLM apps in a rigorous way.

A key part of successful prompt engineering and deployment for LLMs is a robust evaluation framework. In this section we provide guides for how to set up Humanloop’s evaluation framework for your Prompts and Tools.

The core entity in the Humanloop evaluation framework is an Evaluator - a function you define which takes an LLM-generated log as an argument and returns a judgment. The judgment is typically either a boolean or a number, indicating how well the model performed according to criteria you determine based on your use case.

Sources of Judgement

Currently, you can define three different Evaluator sources on Humanloop:

  • Code - using simple deterministic rules based judgments against attributes like cost, token usage, latency, regex rules on the output, etc. These are generally fast and cheap to run at scale.
  • AI - using other foundation models to provide judgments on the output. This allows for more qualitative and nuanced judgments for a fraction of the cost of human judgments.
  • Human - getting gold standard judgments from either end users of your application, or internal domain experts. This can be the most expensive and slowest option, but also the most reliable.

Online Monitoring vs. Offline Evaluation

Evaluators can be deployed on Humanloop to support both testing new versions of your Prompts and Tools during development and for monitoring live apps that are already in production.

Online Monitoring

Evaluators are run against the Logs generated by your AI applications. Typically, they are used to monitor deployed model performance over time and check for drift or degradation in performance. The Evaluator in this case only takes a single argument - the log generated by the model. The Evaluator is expected to return a judgment based on the Log, which can be used to trigger alerts or other actions in your monitoring system.

See our Monitoring guides for more details.

Offline Evaluations

Offline Evaluators are combined with predefined Datasets in order to evaluate your application as you iterate in your prompt engineering workflow, or to test for regressions in a CI environment.

A test Dataset is a collection of Datapoints, which are roughly analogous to unit tests or test cases in traditional programming. Each datapoint specifies inputs to your model and (optionally) some target data.

When you run an offline evaluation, a Log needs to be generated using the inputs of each Datapoint and the version of the application being evaluated. Evaluators then need to be run against each Log to provide judgements, which are then aggregated to provide an overall score for the application. Evaluators in this case take the generated Log and the testcase datapoint that gave rise to it as arguments.

See our guides on creating Datasets and running Evaluations for more details.

Humanloop runtime vs. your runtime

Evaluations require the following to be generated:

  1. Logs for the datapoints.
  2. Evaluator results for those generated logs.

Using the Evaluations API, Humanloop offers the ability to generate logs either within the Humanloop runtime, or within your own runtime. Similarly, Evaluators which are defined within the Humanloop UI can be executed in the Humanloop runtime, whereas Evaluators defined in your code can be executed in your runtime and the results posted back to Humanloop. This provides flexibility for supporting more complex evaluation workflows.

CI/CD Integration

Humanloop’s evaluation framework can be integrated into your CI/CD pipeline, allowing you to automatically test your AI applications as part of your development workflow. This integration enables you to catch potential regressions or performance issues before they make it to production.

One powerful way to leverage this integration is by triggering evaluation runs in GitHub Actions and having the results commented directly on your Pull Requests. This provides immediate feedback to developers and reviewers about the impact of changes on your AI application’s performance.

To set up CI/CD evaluation follow the guide on CI/CD Integration.