Changelog


Evals SDK Improvements

We’ve added a new run method for evals to our SDK. This provides a simpler entry point for evaluating your existing pipelines, both in your CICD and experimentation workflows. This is currently available in Beta on Python and will soon be added to the major versions of both Py and TS SDKs.

In order to run an eval via the SDK, you need to provide:

  1. A callable function that takes your inputs/messages and returns a string
  2. A Dataset of inputs/message to evaluate the function against
  3. A set of Evaluators to use to provide judgements

Here is a toy example using a simple OpenAI call as the function to evaluate.

1from humanloop import Humanloop
2from openai import OpenAI
3from dotenv import load_dotenv
4import os
5
6load_dotenv()
7hl = Humanloop(api_key=os.getenv("HUMANLOOP_KEY"))
8openai = OpenAI(api_key=os.getenv("OPENAI_KEY"))
9
10
11# First define the app you're evaluating
12def call_digital_twin(person: str, messages: list) -> str:
13 system_message = {
14 "role": "system",
15 "content": f"You are {person}"
16 }
17 chat_completion = openai.chat.completions.create(
18 model="gpt-3.5-turbo",
19 messages=[system_message] + messages,
20 )
21 answer = chat_completion.choices[0].message.content
22 return answer
23
24
25# Then run an eval specifying the file location on Humanloop
26checks = hl.evaluations.run(
27 name="Demo eval",
28 file={
29 "path": "digital-twin",
30 "callable": call_digital_twin
31 },
32 evaluators=[
33 {"path": "Latency"},
34 {"path": "Cost"},
35 {"path": "Correctness-AI"}
36 ],
37 dataset={
38 "path": "questions",
39 "datapoints": [
40 {
41 "inputs": {"person": "Albert Einstein"},
42 "messages": [{"role": "user", "content": "What is your most famous theory?"}]
43 },
44 {
45 "inputs": {"person": "Serena Williams"},
46 "messages": [{"role": "user", "content": "What trophy did you win most recently?"}]
47 },
48 {
49 "inputs": {"person": "Marie Curie"},
50 "messages": [{"role": "user", "content": "What was your greatest scientific achievement?"}]
51 },
52 {
53 "inputs": {"person": "Leonardo da Vinci"},
54 "messages": [{"role": "user", "content": "Which of your inventions are you most proud of?"}]
55 },
56 {
57 "inputs": {"person": "Rosa Parks"},
58 "messages": [{"role": "user", "content": "What motivated you to refuse giving up your seat?"}]
59 }
60 ]
61 },
62)

Running this will provide status info and an eval summary in your CLI and a new eval will appear on Humanloop at the displayed URL. Running it again under the same name will add runs to the existing eval.

Navigate to your evals:
https://app.humanloop.com/project/fl_euUV4BHoXqKWqFyZ1YD1o/evaluations/evr_6WhFaHdkbWH8ZaoddzyRD/stats
Running digital-twin flow callable over the Dataset questions using 4 workers
[########################################] 5/5 (100.00%) | DONE
⏳ Evaluation Progress
Total Logs: 10/10
Total Judgments: 30/30
📊 Eval results for digital-twin
+----------------+----------------------------------------+----------------------------------------+
| Version ID | flv_VIP1eiemqbpWmlsr84BwN (eb37773f39) | flv_VIP1eiemqbpWmlsr84BwN (9de378a165) |
+----------------+----------------------------------------+----------------------------------------+
| Added | 2024-10-08 03:46:11 | 2024-10-08 03:51:52 |
+----------------+----------------------------------------+----------------------------------------+
| Evaluators | | |
+----------------+----------------------------------------+----------------------------------------+
| Latency | 0.02 | 0.015 |
| Correctness-AI | 1.0 | 1.0 |
+----------------+----------------------------------------+----------------------------------------+

It returns a set of checks you can use to determine whether the eval passed or failed.

Introduce versioning

The only thing distinguishing different eval runs under the same eval name so far is the time stamp they were run.

It can also be helpful to record what the configuration of your system was when running the eval.

You can include arbitrary config within the version field of the file. If this version has been used before, Humanloop will automatically associate it to your run. If the config is new, we will automatically create a new version of your file for future reference.

1import inspect
2
3checks = hl.evaluations.run(
4 name="Demo eval",
5 file={
6 "path": "digital-twin",
7 "callable": call_digital_twin,
8 "version":{
9 "version":"0.2.4",
10 "code": inspect.getsource(call_digital_twin)
11 }
12 },
13 dataset={...},
14 evaluators=[...],
15)

Leverage native Prompts

Using hl.evaluations.run(...) will by default create a Flow on Humanloop. Flows have the advantage of being able to represent more complex traces, but can’t be run natively within the Humanloop Editor.

It’s also possible to adapt the run call to instead evaluate Prompts by defining the type as prompt and providing valid Prompt params in the version field.

1checks = hl.evaluations.run(
2 file={
3 "path": "digital-twin-prompt",
4 "type": "prompt",
5 "version": {
6 "model": "gpt-4o-mini",
7 "template": [{"role": "system", "content": f"You are {{person}}"}]
8 }
9 },
10 dataset={...},
11 evaluators=[...],
12)

Add Evaluator thresholds

You can also now provide a threshold value for each of your Evaluators. If provided, the checks return will determine whether the average performance of the Evaluator met the threshold or not.

1checks = hl.evaluations.run(
2 file={...},
3 dataset={...},
4 evaluators=[
5 {"path": "Latency"},
6 {"path": "Cost"},
7 {"path": "Correctness-AI", "threshold": 0.5}
8 ],
9)

Manage Directories via API

You can now manage directories directly using our API. This can be helpful for programmatically managing your workspace for bulk changes or dynamically creating folder structures.

To learn more about directories on Humanloop you can check out our Directories page.


Evals Comparison Mode

We’ve added a side-by-side comparison view to evals on Humanloop. This new view enables domain experts to view multiple outputs side-by-side and provide judgments with easy-to-use, configurable controls.

Comparison Mode in Evals

To start using this new view, choose a File and click on the Evaluations submenu. Select the eval you want to view and navigate to the Review tab. This is particularly useful when trying to compare and contrast the outputs from different versions of your AI apps when providing relative judgements.


Bedrock support for Llama models

We’ve added support for Llama models through our AWS Bedrock integration.

AWS Bedrock Llama models in model selection dropdown in a Prompt Editor on Humanloop

You can now select Llama models in the model selection dropdown in the Prompt Editor and start using them in your Prompts. Humanloop supports tool calling for Llama 3.1 models, helping you to build more powerful AI applications.


Evaluation Names

You can now name your Evaluations in the UI and via the API. This is helpful for more easily identifying the purpose of your different Evaluations, especially when multiple teams are running different experiments.

Evaluation with a name

In the API, pass in the name field when creating your Evaluation to set the name. Note that names must be unique for all Evaluations for a specific file. In the UI, navigate to your Evaluation and you will see an option to rename it in the header.


Introducing Flows

We’ve added a new key building block to our app with the first release of Flows. This release focuses on improving the code-first workflows for evaluating more complex AI applications like RAG and Agent-based apps.

Flows allow you to version your whole AI application on Humanloop (as opposed to just individual Prompts and Tools) and allows you to log and evaluate the full trace of the important processing steps that occur when running your app.

See our cookbook tutorial for examples on how to use Flows in your code.

Image of a Flow with logs

What’s next

We’ll soon be extending support for allowing Evaluators to access all Logs inside a trace. Additionally, we will build on this by adding UI-first visualisations and management of your Flows.

We’ll sunset Sessions in favour of Flows in the near future. Reach out to us for guidance on how to migrate your Session-based workflows to Flows.


Bedrock support for Anthropic models

We’ve introduced a Bedrock integration on Humanloop, allowing you to use Anthropic’s models via the Bedrock API, leveraging your AWS-managed infrastructure.

AWS Bedrock Claude models in model selection dropdown in a Prompt Editor on Humanloop

To set this up, head to the API Keys tab in your Organization settings here. Enter your AWS credentials and configuration.

Bedrock keys dialog in Humanloop app

Once you’ve set up your Bedrock keys, you can select the Anthropic models in the model selection dropdown in the Prompt Editor and start using them in your Prompts.


OpenAI o1

We added same day support for OpenAI’s new models, the o1 series. Unlike their predecessors, the o1 models have been designed to spend more time thinking before they respond. In practise this means that when you call the API, time and tokens are spent doing chain-of-thought reasoning before you receive a response back.

o1 in the Humanloop Editor

Read more about this new class of models in OpenAI’s release note and their documentation.

These models are still in Beta and don’t yet support streaming or tool use, the temperature has to be set to 1 and there are specific rate limits in place.


Evals CICD Improvements

We’ve expanded our evals API to include new fields that allow you to more easily check on progress and render summaries of your Evals directly in your deployment logs.

The stats response now contains a status you can poll and progess and report fields that you can print:

⏳ Evaluation Progress
Total Logs: 40/40
Total Judgments: 120/120
📊 Evaluation Results for evals_demo/answer-flow
+------------------------+---------------------------+---------------------------+
| Version id | flv_xo7ZxnkkvcFcDJ9pwSrA9 | flv_foxO18ZHEgxQmwYJO4bR1 |
+------------------------+---------------------------+---------------------------+
| Created | 2024-09-01 14:50:28 | 2024-09-02 14:53:24 |
+------------------------+---------------------------+---------------------------+
| Evaluators | | |
+------------------------+---------------------------+---------------------------+
| evals_demo/exact_match | 0.8 | 0.65 |
| evals_demo/levenshtein | 7.5 | 33.5 |
| evals_demo/reasoning | 0.3 | 0.05 |
+------------------------+---------------------------+---------------------------+
Navigate to Evaluation: https://app.humanloop.com/evaluations/evr_vXjRgufGzwuX37UY83Lr8
❌ Latest score [0.05] below the threshold [0.5] for evaluator evals_demo/reasoning.
❌ Regression of [-0.25] for evaluator evals_demo/reasoning

See how you can leverage Evals as part of your CICD pipeline to test for regressions in your AI apps in our reference example.


Get All Deployed Versions via API

We’ve introduced a new Files API in our v5 API resources that lets you query all files simultaneously. This is useful when managing your workflows on Humanloop and you wish to find all files that match specific criteria, such as having a deployment in a specific environment. Some of the supported filters to search with are file name, file type, and deployed environments. If you find there are additional access patterns you’d find useful, please reach out and let us know.