We’ve added the ability to map your input and target columns to the columns in your .csv on upload. This provides more flexibility to users who predominately use the UI to manage Datasets.
When you upload a CSV via the Dataset Editor you will see a new mapping step that allows you to select which columns will be mapped to which dataset fields.
To learn more about Datasets on Humanloop you can check out our Datasets page.
We’ve added a new run method for evals to our SDK. This provides a simpler
entry point for evaluating your existing pipelines, both in your CICD and experimentation
workflows. This is currently available in Beta on Python and will soon be added to the major versions of both Py and TS SDKs.
In order to run an eval via the SDK, you need to provide:
A callable function that takes your inputs/messages and returns a string
A Dataset of inputs/message to evaluate the function against
# Then run an eval specifying the file location on Humanloop
26
checks = hl.evaluations.run(
27
name="Demo eval",
28
file={
29
"path": "digital-twin",
30
"callable": call_digital_twin
31
},
32
evaluators=[
33
{"path": "Latency"},
34
{"path": "Cost"},
35
{"path": "Correctness-AI"}
36
],
37
dataset={
38
"path": "questions",
39
"datapoints": [
40
{
41
"inputs": {"person": "Albert Einstein"},
42
"messages": [{"role": "user", "content": "What is your most famous theory?"}]
43
},
44
{
45
"inputs": {"person": "Serena Williams"},
46
"messages": [{"role": "user", "content": "What trophy did you win most recently?"}]
47
},
48
{
49
"inputs": {"person": "Marie Curie"},
50
"messages": [{"role": "user", "content": "What was your greatest scientific achievement?"}]
51
},
52
{
53
"inputs": {"person": "Leonardo da Vinci"},
54
"messages": [{"role": "user", "content": "Which of your inventions are you most proud of?"}]
55
},
56
{
57
"inputs": {"person": "Rosa Parks"},
58
"messages": [{"role": "user", "content": "What motivated you to refuse giving up your seat?"}]
59
}
60
]
61
},
62
)
Running this will provide status info and an eval summary in your CLI and a new eval will appear on Humanloop at the
displayed URL. Running it again under the same name will add runs to the existing eval.
It returns a set of checks you can use to determine whether the eval passed or failed.
Introduce versioning
The only thing distinguishing different eval runs under the same eval name so far is the time stamp they were run.
It can also be helpful to record what the configuration of your system was when running the eval.
You can include arbitrary config within the version field of the file. If this
version has been used before, Humanloop will automatically associate it to your run. If the
config is new, we will automatically create a new version of your file for future reference.
1
import inspect
2
3
checks = hl.evaluations.run(
4
name="Demo eval",
5
file={
6
"path": "digital-twin",
7
"callable": call_digital_twin,
8
"version":{
9
"version":"0.2.4",
10
"code": inspect.getsource(call_digital_twin)
11
}
12
},
13
dataset={...},
14
evaluators=[...],
15
)
Leverage native Prompts
Using hl.evaluations.run(...) will by default create a Flow on Humanloop. Flows have
the advantage of being able to represent more complex traces, but can’t be run natively
within the Humanloop Editor.
It’s also possible to adapt the run call to instead evaluate Prompts
by defining the type as prompt and providing valid Prompt params in the version field.
1
checks = hl.evaluations.run(
2
file={
3
"path": "digital-twin-prompt",
4
"type": "prompt",
5
"version": {
6
"model": "gpt-4o-mini",
7
"template": [{"role": "system", "content": f"You are {{person}}"}]
8
}
9
},
10
dataset={...},
11
evaluators=[...],
12
)
Add Evaluator thresholds
You can also now provide a threshold value for each of your Evaluators.
If provided, the checks return will determine whether the average performance
of the Evaluator met the threshold or not.
You can now manage directories directly using our API. This can be helpful for programmatically managing your workspace for bulk changes or dynamically creating folder structures.
To learn more about directories on Humanloop you can check out our Directories page.
We’ve added a side-by-side comparison view to evals on Humanloop. This new view enables domain experts to view multiple outputs side-by-side and provide judgments with easy-to-use, configurable controls.
To start using this new view, choose a File and click on the Evaluations submenu. Select the eval you want to view and navigate to the Review tab. This is particularly useful when trying to compare and contrast the outputs from different versions of your AI apps when providing relative judgements.
We’ve added support for Llama models through our AWS Bedrock integration.
You can now select Llama models in the model selection dropdown in the Prompt Editor and start using them in your Prompts.
Humanloop supports tool calling for Llama 3.1 models, helping you to build more powerful AI applications.
You can now name your Evaluations in the UI and via the API. This is helpful for more easily identifying the purpose of your different Evaluations, especially when multiple teams are running different experiments.
In the API, pass in the name field when creating your Evaluation to set the name. Note that names must be unique for all Evaluations for a specific file. In the UI, navigate to your Evaluation and you will see an option to rename it in the header.
We’ve added a new key building block to our app with the first release of Flows. This release focuses on improving the code-first workflows for evaluating more complex AI applications like RAG and Agent-based apps.
Flows allow you to version your whole AI application on Humanloop (as opposed to just individual Prompts and Tools) and allows you to log and evaluate the full trace of the important processing steps that occur when running your app.
See our cookbook tutorial for examples on how to use Flows in your code.
What’s next
We’ll soon be extending support for allowing Evaluators to access all Logs inside a trace.
Additionally, we will build on this by adding UI-first visualisations and management of your Flows.
We’ll sunset Sessions in favour of Flows in the near future. Reach out to us for guidance on how to migrate your Session-based workflows to Flows.
We’ve introduced a Bedrock integration on Humanloop, allowing you to use Anthropic’s models via the Bedrock API, leveraging your AWS-managed infrastructure.
To set this up, head to the API Keys tab in your Organization settings here. Enter your AWS credentials and configuration.
Once you’ve set up your Bedrock keys, you can select the Anthropic models in the model selection dropdown in the Prompt Editor and start using them in your Prompts.
We added same day support for OpenAI’s new models, the o1 series. Unlike their predecessors, the o1 models have been designed to spend more time thinking before they respond.
In practise this means that when you call the API, time and tokens are spent doing chain-of-thought reasoning before you receive a response back.
These models are still in Beta and don’t yet support streaming or tool use, the temperature has to be set to 1 and there are specific rate limits in place.
We’ve expanded our evals API to include new fields that allow you to more easily check on progress and render summaries of your Evals directly in your deployment logs.
The stats response now contains a status you can poll and progess and report fields that you can print: