October | Humanloop Docs

Evals Comparison Mode shortcuts

October 25th, 2024

We’ve added keyboard shortcuts to the side-by-side comparison view to help you and your domain experts work through review tasks more efficiently and quickly.

While on the review tab, press the arrow keys ⬆️, ⬇️, ⬅️, or ➡️ to navigate between judgment cells, and press Enter to edit the judgment.

You can also press keys J and K to switch between datapoints without having to use your mouse or trackpad.

Onboarding improvements

October 24th, 2024

We’ve improved how we introduce Humanloop to new users. We’ve highlighted more the common workflow of triggering evals from code for existing AI apps.

Onboarding wizard with code snippet

Evals code example

When you first enter Humanloop, we’ll give you a code snippet demonstrating how to run an eval on a simple Prompt - you can skip this step and continue in the UI and we’ll create an example Prompt for you.

If you’re new to Humanloop, these will introduce our key concepts of our Prompts and Evals. If you’re already on Humanloop, you can find a similar example in our updated doc quickstarts.

Example Evaluators

As part of these improvements, Humanloop will now provide new organizations with a set of example Evaluators to showcase the range of use cases and what the different Evaluator types (AI, code, and Human) can be used for.

SDK utility

We’ve also continuing to extend the utilities within our SDKs, adding a humanloop.prompts.populate_template(...) utility function to the Python SDK to make it easier to use Prompt templates while making your own calls to the model provider. Coming to TypeScript soon.

Claude 3.5 Sonnet

October 20th, 2024

We added same day support for Anthropic’s new Claude 3.5 Sonnet model. This latest version is reported to have improved performance across the board over its predecessor, in particular on coding tasks - read more here.

Claude-3.5-Sonnet-in-Editor

We’ve added support both for our Anthropic and Bedrock provider integrations.

Humanloop Status Page

October 18th, 2024

We’ve published a public status page for any incidents or maintenance that may affect the Humanloop app, api, or website moving forward.

You can use this page to report problems and subscribe to timely updates on the status of our services.

This is part of an ongoing initiative to maintain reliability and trust in our services as we continue to scale.

Humanloop Status Page

Improved Dataset Upload

October 17th, 2024

We’ve added the ability to map your input and target columns to the columns in your .csv on upload. This provides more flexibility to users who predominately use the UI to manage Datasets.

When you upload a CSV via the Dataset Editor you will see a new mapping step that allows you to select which columns will be mapped to which dataset fields.

To learn more about Datasets on Humanloop you can check out our Datasets page.

Evaluate Flow Log contents

October 16th, 2024

Flow Logs allow users to represent complex multi-step apps on Humanloop. Each step can be a Prompt, Tool, or Evaluator Log; or even another Flow Log. Logs can also be nested so that you can represent your app’s execution trace.

Prior to now, Evaluators could only reference the inputs and outputs of Flow Logs when providing Judgments. We’ve now added the ability to access the entire contents of a Flow Log in an Evaluator. This allows you to write more complex Evaluators that can inspect the entire execution trace of your app.

How to use

The contents of the Flow Log are accessible via the new children field. Logs within the trace can also have children depending on the level of nesting in your code.

For example, if your Flow Log represent a conversation between a user and a chatbot, you can now write an Evaluator that inspects the entire conversation to make a judgement. Below is a simple example of checking how many steps there were in the conversation:

1 def count_number_steps(log):
2     """Counts the number of steps in the Flow Log."""
3     # This assumes there was no subsequent nesting
4     return len(log.get('children', []))

Or maybe you want to count how many Logs in the trace returned an empty output, where there may have been nesting:

1 def count_null_output_logs(log):
2     """Count the number of logs in the trace where output is null."""
3     def count_null_output(log):
4         """Helper function for recursively counting."""
5         null_count = 1 if log.get('output') is None else 0
6         for child in log.get('children', []):
7             null_count += count_null_output(child)
8         return null_count
9     return count_null_output(log)

You can access children within any of the Evaluator Editors.

Evaluate Flow Log Contents

Export Datasets

October 14th, 2024

You can now export your Datasets on Humanloop to .csv directly from within the UI. This allows you to more rapidly iterate on your Dataset in spreadsheet tools before re-uploading.

Export Datasets Action

You can find this option both within the Dataset Editor and from your existing versions on the Dataset Dashboard.

Evals Comparison Mode Improvements

October 12th, 2024

We’re continuing to invest in our review UI to make it easier for you and your domain experts to work through review tasks more efficiently and quickly.

More easily access additional context

You can now hover over judgments in the Review tab to access additional context. It’s especially useful to get information like instructions and additional outputs (such as when an LLM evaluator outputs rationale alongside the final judgment). You can also click to expand to the full drawer view. Comparison Mode Improvements - eval log details

Share deep links to specific datapoints

You can now share links to a specific datapoint in the Review tab. Simply select the desired datapoint and copy the URL. Team members who open the link will be directed to the same datapoint.

Navigate to specific datapoints by index

Using the new text box at the top left of the Review tab, you can now jump to a specific datapoint by index.

This can be helpful when you need to split up the review amongst multiple team members; each can take a different range of datapoints to review.

Comparison Mode Improvements - navigate by index

Improved Dataset Upload

October 10th, 2024

We’ve added the ability to map your input and target columns to the columns in your .csv on upload. This provides more flexibility to users who predominately use the UI to manage Datasets.

When you upload a CSV via the Dataset Editor you will see a new mapping step that allows you to select which columns will be mapped to which dataset fields.

To learn more about Datasets on Humanloop you can check out our Datasets page.

Evaluations SDK Improvements

October 3rd, 2024

We’ve added a new run method for evaluations to our SDK. This provides a simpler entry point for evaluating your existing pipelines, both in your CICD and experimentation workflows. This is currently available in Beta on Python and will soon be added to the major versions of both Py and TS SDKs.

In order to run an eval via the SDK, you need to provide:

A callable function that takes your inputs/messages and returns a string
A Dataset of inputs/message to evaluate the function against
A set of Evaluators to use to provide judgments

Here is a toy example using a simple OpenAI call as the function to evaluate.

1 from humanloop import Humanloop
2 from openai import OpenAI
3 from dotenv import load_dotenv
4 import os
5 
6 load_dotenv()
7 hl = Humanloop(api_key=os.getenv("HUMANLOOP_KEY"))
8 openai = OpenAI(api_key=os.getenv("OPENAI_KEY"))
9 
10 
11 # First define the app you're evaluating
12 def call_digital_twin(person: str, messages: list) -> str:
13     system_message = {
14         "role": "system",
15         "content": f"You are {person}"
16     }
17     chat_completion = openai.chat.completions.create(
18         model="gpt-3.5-turbo",
19         messages=[system_message] + messages,
20     )
21     answer = chat_completion.choices[0].message.content
22     return answer
23 
24 
25 # Then run an eval specifying the file location on Humanloop
26 checks = hl.evaluations.run(
27     name="Demo eval",
28     file={
29         "path": "digital-twin",
30         "callable": call_digital_twin
31     },
32     evaluators=[
33         {"path": "Latency"},
34         {"path": "Cost"},
35         {"path": "Correctness-AI"}
36     ],
37     dataset={
38         "path": "questions",
39         "datapoints": [
40             {
41                 "inputs": {"person": "Albert Einstein"},
42                 "messages": [{"role": "user", "content": "What is your most famous theory?"}]
43             },
44             {
45                 "inputs": {"person": "Serena Williams"},
46                 "messages": [{"role": "user", "content": "What trophy did you win most recently?"}]
47             },
48             {
49                 "inputs": {"person": "Marie Curie"},
50                 "messages": [{"role": "user", "content": "What was your greatest scientific achievement?"}]
51             },
52             {
53                 "inputs": {"person": "Leonardo da Vinci"},
54                 "messages": [{"role": "user", "content": "Which of your inventions are you most proud of?"}]
55             },
56             {
57                 "inputs": {"person": "Rosa Parks"},
58                 "messages": [{"role": "user", "content": "What motivated you to refuse giving up your seat?"}]
59             }
60         ]
61     },
62 )

Running this will provide status info and an eval summary in your CLI and a new eval will appear on Humanloop at the displayed URL. Running it again under the same name will add runs to the existing eval.

Navigate to your evaluations:
https://app.humanloop.com/project/fl_euUV4BHoXqKWqFyZ1YD1o/evaluations/evr_6WhFaHdkbWH8ZaoddzyRD/stats
Running digital-twin flow callable over the Dataset questions using 4 workers
[########################################] 5/5 (100.00%) | DONE
⏳ Evaluation Progress
Total Logs: 10/10
Total Judgments: 30/30
📊 Eval results for digital-twin 
+----------------+----------------------------------------+----------------------------------------+
|     Version ID | flv_VIP1eiemqbpWmlsr84BwN (eb37773f39) | flv_VIP1eiemqbpWmlsr84BwN (9de378a165) |
+----------------+----------------------------------------+----------------------------------------+
|          Added |          2024-10-08 03:46:11           |          2024-10-08 03:51:52           |
+----------------+----------------------------------------+----------------------------------------+
|     Evaluators |                                        |                                        |
+----------------+----------------------------------------+----------------------------------------+
|        Latency |                  0.02                  |                  0.015                 |
| Correctness-AI |                  1.0                   |                  1.0                   |
+----------------+----------------------------------------+----------------------------------------+

It returns a set of checks you can use to determine whether the eval passed or failed.

Introduce versioning

The only thing distinguishing different eval runs under the same eval name so far is the time stamp they were run.

It can also be helpful to record what the configuration of your system was when running the eval.

You can include arbitrary config within the version field of the file. If this version has been used before, Humanloop will automatically associate it to your run. If the config is new, we will automatically create a new version of your file for future reference.

1 import inspect
2 
3 checks = hl.evaluations.run(
4     name="Demo eval",
5     file={
6         "path": "digital-twin",
7         "callable": call_digital_twin,
8         "version":{
9             "version":"0.2.4",
10             "code": inspect.getsource(call_digital_twin)
11         } 
12     },
13     dataset={...},
14     evaluators=[...],
15 )

Leverage native Prompts

Using hl.evaluations.run(...) will by default create a Flow on Humanloop. Flows have the advantage of being able to represent more complex traces, but can’t be run natively within the Humanloop Editor.

It’s also possible to adapt the run call to instead evaluate Prompts by defining the type as prompt and providing valid Prompt params in the version field.

1 checks = hl.evaluations.run(
2     file={
3         "path": "digital-twin-prompt",
4         "type": "prompt",
5         "version": {
6             "model": "gpt-4o-mini",
7             "template": [{"role": "system", "content": f"You are {{person}}"}] 
8         }
9     },
10     dataset={...},
11     evaluators=[...],
12 )

Add Evaluator thresholds

You can also now provide a threshold value for each of your Evaluators. If provided, the checks return will determine whether the average performance of the Evaluator met the threshold or not.

1 checks = hl.evaluations.run(
2     file={...},
3     dataset={...},
4     evaluators=[
5         {"path": "Latency"},
6         {"path": "Cost"},
7         {"path": "Correctness-AI", "threshold": 0.5}
8     ],
9 )

Manage Directories via API

October 1st, 2024

You can now manage directories directly using our API. This can be helpful for programmatically managing your workspace for bulk changes or dynamically creating folder structures.

To learn more about directories on Humanloop you can check out our Directories page.