Committing and deploying UX improvements

We’ve made some improvements to the user experience around committing and deploying changes to your evaluators, tools and datasets.

Now, each editor has a consistent and reliable loading and saving experience. You can choose prior versions in the dropdown, making it easier to toggle between versions.

And, as you commit, you’ll also get the option to immediately deploy your changes. This reduces the number of steps needed to get your changes live.

Additional bug fixes:

  • Fixed the flickering issue on the datasets editor
  • Fixed the issue where the evaluator editor would lose the state of the debug drawer on commit.

Claude 3.5 Sonnet Support

Claude 3.5 Sonnet is now in Humanloop!

Sonnet is the latest and most powerful model from Anthropic.

2x the speed, 1/5th the cost, yet smarter than Claude 3 Opus.

Anthropic have now enabled streaming of tool calls too, which is supported in Humanloop now too.

Add your Anthropic key and select Sonnet in the Editor to give it a go.


Prompt and Tool version drawer in Evaluation reports

You can now click on the Prompt and Tool version tags within your Evaluation report to open a drawer with details. This helps provide the additional context needed when reasoning with the results without having to navigate awa

Prompt drawer in Evaluation report

Status of Human Evaluators

With Humanloop Evaluation Reports, you can leverage multiple Evaluators for comparing your Prompt and Tool variations. Evaluators can be of different types: code, AI or Human and the progress of the report is dependent on collecting all the required judgements. Human judgments generally take longer than the rest and are collected async by members of your team.

Human Evaluators

To better support this workflow, we’ve improved the UX around monitoring the status of judgments, with a new progress bar. Your Human Evaluators can now also update the status of the report when they’re done.

Human Evaluators

We’ve also added the ability to cancel ongoing Evaluations that are pending or running. Humanloop will then stop generating Logs and running Evaluators for this Evaluation report.

Faster Evaluations

Following the recent upgrades around Evaluation reports, we’ve improved the batching and concurrency for both calling models and getting the evaluation results. This has increased the speed of Evaluation report generation by 10x and the reports now update as new batches of logs and evaluations are completed to give a sense of intermediary progress.

Evaluation Comparison Reports

We’ve released Evaluation reports, which allows you to easily compare the performance of your different Prompts or Tools across multiple different Evaluator criteria.

This generalises our previous concept of Evaluation runs, extending it with multiple complimentary changes with getting more from your evals. All your existing Evaluation runs have been migrated to Evaluation reports with a single evaluated Prompt or Tool. You can easily extend these existing runs to cover additional Evaluators and Prompts/Tools with out having to regenerate existing logs.

Feature breakdown

We’ve introduced a new stats comparison view, including a radar chart that gives you a quick overview of how your versions compare across all Evaluators. Below it, your evaluated versions are shown in columns, forming a grid with a row per Evaluator you’ve selected.

The performance of each version for a given Evaluator is shown in a chart, where bar charts are used for boolean results, while box plots are used for numerical results providing an indication of variance within your Dataset.

Evaluation reports also introduce an automatic deduplication feature, which utilizes previous logs to avoid generating new logs for the same inputs. If a log already exists for a given evaluated-version-and-datapoint pair, it will automatically be reused. You can also override this behavior and force the generation of new logs for a report by creating a New Batch in the setup panel.

How to use Evaluation reports

To get started, head over to the Evaluations tab of the Prompt you’d like to evaluate, and click Evaluate in the top right.

This will bring you to a page where you can set up your Evaluation, choosing a Dataset, some versions to Evaluate and compare, and the Evaluators you’d like to use.

When you click Save, the Evaluation report will be created, and any missing Logs will be generated.

What’s next

We’re planning on improving the functionality of Evaluation reports by adding a more comprehensive detailed view, allowing you to get a more in-depth look at the generations produced by your Prompt versions. Together with this, we’ll also be improving Human evaluators so you can better annotate and aggregate feedback on your generations.

Azure Model Updates

You can now access the latest versions of GPT-4 and GPT-4o hosted on Azure in the Humanloop Editor and via our Chat endpoints.

Once you’ve configured your Azure key and endpoint in your organization’s provider settings, the model versions will show up in the Editor dropown as follows:

For more detail, please see the API documentation on our Logs endpoints.

Improved Logs Filtering

We’ve improved the ability to filter logs by time ranges. The API logs filter parameters for start_date and end_date now supports querying with more granularity. Previously the filters were limited to dates, such as 2024-05-22, now you can use hourly ranges as well, such as 2024-05-22 13:45.

For more detail, please see the API documentation on our Logs endpoints.

Monitoring with deployed Evaluators

You can now connect deployed Evaluator versions for online monitoring of your Prompts and Tools.

This enables you to update Evaluators for multiple Prompt or Tools when you deploy a new Evaluator version.


Same day support for OpenAIs new GPT4-Omni model! You can now use this within the Humanloop Editor and chat APIs.

Find out more from OpenAI here.

Logs for Evaluators

For AI and Code Evaluators, you can now inspect and reference their logs as with Prompts and Tools. This provides greater transparency into how they are being used and improves the ability to debug and improve.

Further improvements to Human Evaluators are coming very soon…

Creating a new Evaluator file

Improved Evaluator management

Evaluators are now first class citizens alongside Prompts, Tools and Datasets. This allows for easier re-use, version control and helps with organising your workspace within directories.

You can create a new Evaluator by choosing Evaluator in the File creation dialog in the sidebar or on your home page.

Creating a new Evaluator file

Migration and backwards compatibility

We’ve migrated all of your Evaluators previously managed within Prompts > Evaluations > Evaluators to new Evaluator files. All your existing Evaluation runs will remain unchanged and online Evaluators will continue to work as before. Moving forward you should use the new Evaluator file to make edits and manage versions.

Log drawer in Editor

You can now open up the Log drawer directly in the Editor.

This enables you to see exactly what was sent to the provider as well as the tokens used and cost. You can also conveniently add feedback and run evaluators on that specific Log, or add it to a dataset.

To show the Logs just click the arrow icon beside each generated message or completion.

Groq support (Beta)

We have introduced support for models available on Groq to Humanloop. You can now try out the blazingly fast generations made with the open-source models (such as Llama 3 and Mixtral 8x7B) hosted on Groq within our Prompt Editor.

Groq achieves faster throughput using specialized hardware, their LPU Inference Engine. More information is available in their FAQ and on their website.

Note that their API service, GroqCloud, is still in beta and low rate limits are enforced.

Llama 3

Llama 3, Meta AI’s latest openly-accessible model, can now be used in the Humanloop Prompt Editor.

Llama 3 comes in two variants: an 8-billion parameter model that performs similarly to their previous 70-billion parameter Llama 2 model, and a new 70-billion parameter model. Both of these variants have an expanded context window of 8192 tokens.

More details and benchmarks against other models can be found on their blog post and model card.

Humanloop supports Llama 3 on the Replicate model provider, and on the newly-introduced Groq model provider.

Anthropic tool support (Beta)

Our Editor and deployed endpoints now supports tool use with the Anthropic’s Claude3 models. Tool calling with Anthropic is still in Beta, so streaming is not important.

In order to user tool calling for Claude in Editor you therefore need to first turn off streaming mode in the menu dropdown to the right of the load button.

Cost, Tokens and Latency

We now compute Cost, Tokens and Latency for all Prompt logs by default across all model providers.

These values will now appear automatically as graphs in your Dashboard, as columns in your logs table and will be displayed in our Version and Log drawers.

Cohere Command-r

We’ve expanded the Cohere models with the latest command-r suite. You can now use these models in our Editor and via our APIs once you have set your Cohere API key.

More details can be found on their blog post.

Dataset Files & Versions

In our recent release, we promoted Datasets from being attributes managed within the context of a single Prompt, to a first-class Humanloop file type alongside Prompts and Tools.

This means you can curate Datasets and share them for use across any of the Prompts in your organization. It also means you get the full power of our file versioning system, allowing you track and commit every change you make Datasets and their Datapoints, with attribution and commit messages inspired by Git.

It’s now easy to understand which version of a Dataset was used in a given Evaluation run, and whether the most recent edits to the Dataset were included or not.

Read more on how to get started with datasets here.

This change lays the foundation for lots more improvements we have coming to Evaluations in the coming weeks. Stay tuned!

Mixtral 8x7B

Keeping you up to date with the latest open models, we’ve added support for Mixtral 8x7B to our Editor with a Replicate integration.

Mixtral 8x7B outperforms LLaMA 2 70B (already supported in Editor) with faster inference, with performance comparable to that of GPT-3.5. More details are available in its release announcement.

Additional Replicate models support via API

Through the Replicate model provider additional open models can be used by specifying a model name via the API. The model name should be of a similar form as the ref used when calling using Replicate’s Python SDK.

For example, Vicuna, an open-source chatbot model based on finetuning LLaMA can be used with the following model name alongside provider: "replicate" in your Prompt version.

Surfacing uncommitted Versions

We now provide the ability to access your uncommitted Prompt Versions and associated Logs.

Adding to our recent changes around the Commit flow for Versions, we’ve added the ability to view any uncommitted versions in your Versions and Logs tables. This can be useful if you need to recover or compare to a previous state during your Prompt engineering and Evaluation workflows.

Uncommitted Versions are created when you make generations in our Editor without first committing what you are working on. In future, it will also be possible to create uncommitted versions when logging or generating using the API.

We’ve added new filter tabs to the Versions and Logs table to enable this:

New **All** and From **Committed By Versions** filter tabs on the logs table.
New **Committed** and **Uncommitted** tabs on the Versions table of your Prompt dashboard.

Improved navigation & sidebar

We’ve introduced a sidebar for easier navigation between your Prompts and Tools.

As new language models unlock more complex use cases, you’ll be setting up and connecting Prompts, Tools, and Evaluators. The new layout better reflects these emerging patterns, and switching between your files is now seamless with the directory tree in the sidebar.

You can also bring up the search dialog with Cmd+K and switch to another file using only your keyboard.

Claude 3

Introducing same day support for the Claude 3 - Anthropics new industry leading models. Read more about the release here.

The release contains three models in ascending order of capability: Haiku, Sonnet, and Opus. This suite provides users with the different options to balance intelligence, speed, and cost for their specific use-cases.

Key take aways

  1. Performance - a new leader. The largest of the 3 models, Opus, is claimed to outperform GPT-4 and Gemini Ultra on key benchmarks such as MMLU and Hellaswag. It even reached 84.9% on the Humaneval coding test set (vs GPT-4’s 67%) 🤯
  2. 200k context window with near-perfect recall on selected benchmarks. Opus reports 99% accuracy on the NIAH test, which measures how accurately a model can recall information given to it in a large corpus of data.
  3. Opus has vision. Anthropic claim that performance here is on par with that of other leading models (ie GPT-4 and Gemini). They say it’s most useful for inputting graphs, slides etc. in an enterprise setting.
  4. Pricing - as compared to OpenAI:

Opus - 75(2.5xGPT4Turbo)Sonnet75 (2.5x GPT-4 Turbo) Sonnet - 15 (50% of GPT-4 Turbo)
Haiku - $1.25 (1.6x GPT-3.5)

  1. How you can use it: The Claude 3 family is now available on Humanloop. Bring your API key to test, evaluate and deploy the publicly available models - Opus and Sonnet.

New Tool creation flow

You can now create Tools in the same way as you create Prompts and Directories. This is helpful as it makes it easier to discover Tools and easier to quickly create new ones.

To create a new Tool simply press the New button from the directory of your choice and select one of our supported Tools, such as JSON Schema tool for function calling or our Pinecone tool to integrate with your RAG pipelines.

Tool editor and deployments

You can now manage and edit your Tools in our new Tool Editor. This is found in each Tool file and lets you create and iterate on your tools. As well, we have introduced deployments to Tools, so you can better control which versions of a tool are used within your Prompts.

Tool Editor

This replaces the previous Tools section which has been removed. The editor will let you edit any of the tool types that Humanloop supports (JSON Schema, Google, Pinecone, Snippet, Get API) and commit new Versions.


Tools can now be deployed. You can pick a version of your Tool and deploy it. When deployed it can be used and referenced in a Prompt editor.

And example of this, if you have a version of a Snippet tool with the signature snippet(key) with a key/value pair of “helpful”/“You are a helpful assistant”. You decide you would rather change the value to say “You are a funny assistant”, you can commit a version of the Tool with the updated key. This wont affect any of your prompts that reference the Snippet tool until you Deploy the second version, after which each prompt will automatically start using the funny assistant prompt.

Prompt labels and hover cards

We’ve rolled out a unified label for our Prompt Versions to allow you to quickly identify your Prompt Versions throughout our UI. As we’re rolling out these labels across the app, you’ll have a consistent way of interacting with and identifying your Prompt Versions.

Label and hover card for a deployed Prompt Version

The labels show the deployed status and short ID of the Prompt Version. When you hover over these labels, you will see a card that displays the commit message and authorship of the committed version.

You’ll be able to find these labels in many places across the app, such as in your Prompt’s deployment settings, in the Logs drawer, and in the Editor.

The Prompt Version label and hover card in a Prompt Editor

As a quick tip, the color of the checkmark in the label indicates that this is a version that has been deployed. If the Prompt Version has not been deployed, the checkmark will be black.

A Prompt Version that has not been deployed

Committing Prompt Versions

Building on our terminology improvements from Project -> Prompt, we’ve now updated Model Configs -> Prompt Versions to improve consistency in our UI.

This is part of a larger suite of changes to improve the workflows around how entities are managed on Humanloop and to make them easier to work with and understand. We will also be following up soon with a new and improved major version of our API that encapsulates all of our terminology improvements.

In addition to just the terminology update, we’ve improved our Prompt versioning functionality to now use commits that can take commit messages, where you can describe how you’ve been iterating on your Prompts.

We’ve removed the need for names (and our auto-generated placeholder names) in favour of using explicit commit messages.

We’ll continue to improve the version control and file types support over the coming weeks.

Let us know if you have any questions around these changes!

Online evaluators for monitoring Tools

You can now use your online evaluators for monitoring the logs sent to your Tools. The results of this can be seen in the graphs on the Tool dashboard as well as on the Logs tab of the Tool.

To enable Online Evaluations follow the steps seen in our Evaluate models online guide.

Logging token usage

We’re now computing and storing the number of tokens used in both the requests to and responses from the model.

This information is available in the logs table UI and as part of the log response in the API. Furthermore you can use the token counts as inputs to your code and LLM based evaluators.

The number of tokens used in the request is called prompt_tokens and the number of tokens used in the response is called output_tokens.

This works consistently across all model providers and whether or not you are you are streaming the responses. OpenAI, for example, do not return token usage stats when in streaming mode.

Prompt Version authorship

You can now view who authored a Prompt Version.

Prompt Version authorship in the Prompt Version slideover

We’ve also introduced a popover showing more Prompt Version details that shows when you mouseover a Prompt Version’s ID.

Prompt Version popover in the Logs slideover

Keep an eye out as we’ll be introducing this in more places across the app.

Filterable and sortable evaluations overview

We’ve made improvements to the evaluations runs overview page to make it easier for your team to find interesting or important runs.

The charts have been updated to show a single datapoint per run. Each chart represents a single evaluator, and shows the performance of the prompt tested in that run, so you can see at a glance how the performance your prompt versions have evolved through time, and visually spot the outliers. Datapoints are color-coded by the dataset used for the run.

The table is now paginated and does not load your entire project’s list of evaluation runs in a single page load. The page should therefore load faster for teams with a large number of runs.

The columns in the table are now filterable and sortable, allowing you to - for example - filter just for the completed runs which test two specific prompt versions on a specific datasets, sorted by their performance under a particular evaluator.

Here, we've filtered the table on completed runs that tested three specific prompt versions of interest, and sorted to show those with the worst performance on the Valid JSON evaluator.

Projects rename and file creation flow

We’ve renamed Projects to Prompts and Tools as part of our move towards managing Prompts, Tools, Evaluators and Datasets as special-cased and strictly versioned files in your Humanloop directories.

This is a purely cosmetic change for now. Your Projects (now Prompts and Tools) will continue to behave exactly the same. This is the first step in a whole host of app layout, navigation and API improvements we have planned in the coming weeks.

If you are curious, please reach out to learn more.

New creation flow

We’ve also updated our file creation flow UI. When you go to create projects you’ll notice they are called Prompts now.

Control logging level

We’ve added a save flag to all of our endpoints that generate logs on Humanloop so that you can control whether the request and response payloads that may contain sensitive information are persisted on our servers or not.

If save is set to false then no inputs, messages our outputs of any kind (including the raw provider request and responses) are stored on our servers. This can be helpful for sensitive use cases where you can’t for example risk PII leaving your system.

Details of the model configuration and any metadata you send are still stored. Therefore you can still benefit from certain types of evaluators such as human feedback, latency and cost, as well as still track important metadata over time that may not contain sensitive information.

This includes all our chat and completion endpoint variations, as well as our explicit log endpoint.

1from humanloop import Humanloop
3# You need to initialize the Humanloop SDK with your API Keys
4humanloop = Humanloop(api_key="<YOUR Humanloop API KEY>")
6# humanloop.complete_deployed(...) will call the active model config on your project.
7# You can optionally set the save flag to False
8complete_response = humanloop.complete_deployed(
9 save=False,
11 inputs={"question": "I have inquiry about by life insurance policy. Can you help?"},
14# You can still retrieve the data_id and output as normal
15data_id =[0].id
16output =[0].output
18# And log end user feedback that will still be stored, type="rating", value="good")

Logging provider request

We’re now capturing the raw provider request body alongside the existing provider response for all logs generated from our deployed endpoints.

This provides more transparency into how we map our provider agnostic requests to specific providers. It can also effective for helping to troubleshoot the cases where we return well handled provider errors from our API.

Add Evaluators to existing runs

You can now add an evaluator to any existing evaluation run. This is helpful in situations where you have no need to regenerate logs across a dataset, but simply want to run new evaluators across the existing run. By doing this instead of launching a fresh run, you can the save significant time & costs associated with unnecessarily regenerating logs, especially when working with large datasets.

Use the **Add Evaluator** button to run more evaluators across the logs in an existing evaluation run. This can be done on any runs, including those still running or already completed.

Improved Evaluation Debug Console

We’ve enhanced the usability of the debug console when creating and modifying evaluators. Now you can more easily inspect the data you are working with, and understand the root causes of errors to make debugging quicker and more intuitive.

On any row in the debug console, click the arrow next to a testcase to inspect the full entity in a slideover panel.

After clicking Run to generate a log from a testcase, you can inspect the full log right from the debug console, giving you clearer access to error messages or the model-generated content, as in the example below.

LLM Evaluators

We expect this feature to be most useful in the case of creating and debugging LLM evaluators. You can now inspect the log of the LLM evaluation itself right from the debug console, along with the original testcase and model-generated log, as described above.

After clicking Run on a testcase in the debug console, you’ll see the LLM Evaluation Log column populated with a button that opens a full drawer.

This is particularly helpful to verify that your evaluation prompt was correctly populated with data from the underlying log and testcase, and to help understand why the LLM’s evaluation output may not have been parsed correctly into the output values.

Tool projects

We have upgraded projects to now also work for tools. Tool projects are automatically created for tools you define as part of your model config in the Editor as well as tools managed at organization level.

It is now easier to access the logs from your tools and manage different versions like you currently do for your prompts.

Tool versioning

In the dashboard view, you can see the different versions of your tools. This will soon be expanded to link you to the source config and provide a more comprehensive view of your tool’s usage.


Any logs submitted via the SDK that relate to these tools will now appear in the Logs view of these projects. You can see this by following our sessions guide and logging a new tool via the SDK. This also works natively with online Evaluators, so you can start to layer in observability for the individual non-LLM components of your session

Offline Evaluations via SDK

You can trigger evaluations on your tools projects similar to how you would for an LLM project with model configs. This can be done by logging to the tool project, creating a dataset, and triggering an evaluation run. A good place to start would be the Set up evaluations using API guide.

Support for new OpenAI Models

Following OpenAI’s latest model releases, you will find support for all the latest models in our Playground and Editor.

GPT-3.5-Turbo and GPT-4-Turbo

If your API key has access to the models, you’ll see the new release gpt-4-0125-preview and gpt-3.5-turbo-0125 available when working in Playground and Editor. These models are more capable and cheaper than their predecessors - see the OpenAI release linked above for full details.

We also support the new gpt-4-turbo-preview model alias, which points to the latest gpt-4-turbo model without specifying a specific version.

Embedding Models

Finally, the new embedding models - text-embedding-3-small and text-embedding-3-large are also available for use via Humanloop. The small model is 5x cheaper than the previous generation ada-002 embedding model, while the larger model significantly improves performance and maps to a much larger embedding space.

Improved evaluation run launcher

We’ve made some usability enhancements to the launch experience when setting up batch generation & evaluation runs.

It’s now clearer which model configs, datasets and evaluators you’ve selected. It’s also now possible to specify whether you want the logs to be generated in the Humanloop runtime, or if you’re going to post the logs from your own infrastructure via the API.

Cancellable evaluation runs

Occasionally, you may launch an evaluation run and then realise that you didn’t configure it quite the way you wanted. Perhaps you want to use a different model config or dataset, or would like to halt its progress for some other reason.

We’ve now made evaluation runs cancellable from the UI - see the screenshot below. This is especially helpful if you’re running evaluations over large datasets, where you don’t want to unnecessarily consume provider credits.

Cancellation button in the evaluation run page.

Faster offline evaluations

We’ve introduced batching to our offline Evaluations to significantly speed up runtime performance and also improved the robustness to things going wrong mid-run.

In addition to our recent enhancements to the Evaluations API, we’ve also made some significant improvements to our underlying orchestration framework which should mean your evaluation runs are now faster and more reliable. In particular, we now batch generations across the run - by default in groups of five, being conscious of potential rate limit errors (though this will soon be configurable).

Each batch runs its generations concurrently, so you should see much faster completion times - especially in runs across larger datasets.

Evaluation API enhancements

We’ve started the year by enhancing our evaluations API to give you more flexibility for self-hosting whichever aspects of the evaluation workflow you need to run in your own infrastructure - while leaving the rest to us!

Mixing and matching the Humanloop-runtime with self-hosting

Conceptually, evaluation runs have two components:

  1. Generation of logs for the datapoints using the version of the model you are evaluating.
  2. Evaluating those logs using Evaluators.

Now, using the Evaluations API, Humanloop offers the ability to generate logs either within the Humanloop runtime, or self-hosted (see our guide on external generations for evaluations).

Similarly, evaluating of the logs can be performed in the Humanloop runtime (using evaluators that you can define in-app), or self-hosted (see our guide on self-hosted evaluations).

It is now possible to mix-and-match self-hosted and Humanloop-runtime logs and evaluations in any combination you wish.

When creating an Evaluation (via the improved UI dialogue or via the API), you can set the new hl_generated flag to False to indicate that you are posting the logs from your own infrastructure. You can then also include an evaluator of type External to indicate that you will post evaluation results from your own infrastructure.

You can now also include multiple evaluators on any run, and these can include a combination of External (i.e. self-hosted) and Humanloop-runtime evaluators.

Human Evaluators

We’ve introduced a new special type of ‘Human’ Evaluator to compliment our existing code and AI based Evaluators.

There are many important evaluation use cases that require input from your internal domain experts, or product teams. Typically this is where you would like a gold standard judgement of how your LLM app is performing.

Our new Human Evaluator allows you to trigger a batch evaluation run as normal (from our UI as part of your prompt engineering process, or using our SDK as part of your CI/CD pipeline) and then queues the results ready for a human to provide feedback.

Once completed, the feedback is aggregated to give a top-line summary of how the model is performing. It can also be combined with automatic code and AI evaluators in a single run.

Set up your first Human Evaluator run by following our guide.

Return inputs flag

We’ve introduced a return_inputs flag on our chat and completion endpoints to improve performance for larger payloads.

As context model windows get increasingly larger, for example Claude with 200k tokens, it’s important to make sure our APIs remain performant. A contributor to response times is the size of the response payload being sent over the wire.

When you set this new flag to false, our responses will no longer contain the inputs that were sent to the model and so can be significantly smaller. This is the first in a sequence of changes to add more control to the caller around API behaviour.

As always, we welcome suggestions, requests, and feedback should you have any.


You can now use Google’s latest LLMs, Gemini, in Humanloop.


To use Gemini, first go to and generate an API key. Then, save this under the “Google” provider on your API keys page.

Head over to the playground, and you should see gemini-pro and gemini-pro-vision in your list of models.

You can also now use Gemini through the Humanloop API’s /chatendpoints.


Gemini offers support for multi-turn chats, tool calling, and multi-modality.

However, note that while gemini-pro supports multi-turn chats and tool calling, it does not support multi-modality. On the other hand, gemini-pro-vision supports multi-modality but not multi-turn chats or tool calling. Refer to Gemini’s docs for more details.

When providing images to Gemini, we’ve maintained compatibility with OpenAI’s API. This means that when using Humanloop, you can provide images either via a HTTP URL or with a base64-encoded data URL.

Chat sessions in Editor

Your chat messages in Editor are now recorded as part of a session so you can more easily keep track of conversations.

After chatting with a saved prompt, go to the sessions tab and your messages will be grouped together.

If you want to do this with the API, it can be as simple as setting the session_reference_id– see docs on sessions.

Improved Evaluator UI

We’ve improved the experience of creating and debugging your evaluators.

Now that you can access any property of the objects you’re testing we’ve cleaned up the debug panel to make easier to view the testcases that you load from a dataset or from your projects.

We’ve also clarified what the return types are expected as you create your evaluators.

Prompt diffs

Following our recent introduction of our .prompt file, you can now compare your model configs within a project with our new ‘diff’ view.

As you modify and improve upon your model configs, you might want to remind yourself of the changes that were made between different versions of your model config. To do so, you can now select 2 model configs in your project dashboard and click Compare to bring up a side-by-side comparison between them. Alternatively, open the actions menu and click Compare to deployed.

This diff compares the .prompt files representing the two model configs, and will highlight any differences such as in the model, hyperparameters, or prompt template.

LLM evals - improved data access

In order to help you write better LLM evaluator prompts, you now have finer-grained access to the objects you are evaluating.

It’s now possible to access any part of the log and testcase objects using familiar syntax like log.messages[0].content. Use the debug console to help understand what the objects look like when writing your prompts.

Tool linking

It’s now possible to manage tool definitions globally for your organization and re-use them across multiple projects by linking them to your model configs.

Prior to this change, if you wanted to re-use the same tool definition across multiple model configs, you had to copy and paste the JSON schema snippet defining the name, description and parameters into your Editor for each case. And if you wanted to make changes to this tool, you would have to recall which model configs it was saved to prior and update them inline 1 by 1.

You can achieve this tool re-use by first defining an instance of our new JsonSchema tool available as another option in your global Tools tab. Here you can define a tool once, such as get_current_weather(location: string, unit: 'celsius' | 'fahrenheit'), and then link that to as many model configs as you need within the Editor as shown below.

Importantly, updates to the get_current_weather JsonSchema tool defined here will then propagate automatically to all the model configs you’ve linked it to, without having to publish new versions of the prompt.

The old behaviour of defining the tool inline as part of your model config definition is still available for the cases where you do want changes in the definition of the tool to lead to new versions of the model-config.

Set up the tool

Navigate to the tools tab in your organisation and select the JsonSchema tool card.

With the dialog open, define your tool with name, description, and parameters values. Our guide for using OpenAI Function Calling in the playground can be a useful reference in this case.

Using the tool

In the editor of your target project, link the tool by pressing the Add Tool button and selecting your get_current_weather tool.

Improved log table UI

We’ve updated how we show logs and datapoints in their respective tables. You can now see the stack of inputs and messages in a cleaner interface rather than having them spread into separate columns.

Part of the updated Log Table. Inputs are now stacked with a more consistent and less-busy UI.

There will be more updates soon to improve how logs and prompts are shown in tables and the drawers soon, so if you have ideas for improvements please let us know.

Introducing .prompt files

We’re introducing a .prompt file format for representing model configs in a format that’s both human-readable and easy to work with.

For certain use cases it can be helpful for engineers to also store their prompts alongside their app’s source code in their favourite version control system. The .prompt file is the appropriate artefact for this.

These .prompt files can be retrieved through both the API and through the Humanloop app.

Exporting via API

To fetch a .prompt file via the API, make POST request to{id}/export, where {id} is the ID of the model config (beginning with config_).

Export from Humanloop

You can also export an existing model config as a .prompt file from the app. Find the model config within the project’s dashboard’s table of model configs and open the actions menu by clicking the three dots. Then click Export .prompt. (You can also find this button within the drawer that opens after clicking on on a model config’s row).


Additionally, we’ve added the ability to view and edit your model configs in a .prompt file format when in Editor. Press Cmd-Shift-E when in editor to swap over to a view of your .prompt file.

More details on our .prompt file format are available here. We’ll be building on this and making it more powerful. Stay tuned.

Improved RBACs

We’ve introduced more levels to our roles based access controls (RBACs).

We now distinguish between different roles to help you better manage your organization’s access levels and permissions on Humanloop.

This is the first in a sequence of upgrades we are making around RBACs.

Organization roles

Everyone invited to the organization can access all projects currently (controlling project access coming soon).

A user can be one of the following rolws:

**Admin:**The highest level of control. They can manage, modify, and oversee the organization’s settings and have full functionality across all projects.

Developer:(Enterprise tier only) Can deploy prompts, manage environments, create and add API keys, but lacks the ability to access billing or invite others.

Member:(Enterprise tier only) The basic level of access. Can create and save prompts, run evaluations, but not deploy. Can not see any org-wide API keys.

RBACs summary

Here is the full breakdown of roles and access:

Create and manage Prompts✔️✔️✔️
Inspect logs and feedback✔️✔️✔️
Create and manage evaluators✔️✔️✔️
Run evaluations✔️✔️✔️
Create and manage datasets✔️✔️✔️
Create and manage API keys✔️✔️
Manage prompt deployments✔️✔️
Create and manage environments✔️✔️
Send invites✔️
Set user roles✔️
Manage billing✔️
Change organization settings✔️

Self hosted evaluations

We’ve added support for managing evaluations outside of Humanloop in your own code.

There are certain use cases where you may wish to run your evaluation process outside of Humanloop, where the evaluator itself is defined in your code as opposed to being defined using our Humanloop runtime.

For example, you may have implemented an evaluator that uses your own custom model, or has to interact with multiple systems. In which case, it can be difficult to define these as a simple code or LLM evaluator within your Humanloop project.

With this kind of setup, our users have found it very beneficial to leverage the datasets they have curated on Humanloop, as well as consolidate all of the results alongside the prompts stored on Humanloop.

To better support this setting, we’re releasing additional API endpoints and SDK utilities. We’ve added endpoints that allow you to:

  • Retrieve your curated datasets
  • Trigger evaluation runs
  • Send evaluation results for your datasets generated using your custom evaluators

Below is a code snippet showing how you can use the latest version of the Python SDK to log an evaluation run to a Humanloop project. For a full explanation, see our guide on self-hosted evaluations.

1from humanloop import Humanloop
3API_KEY = ...
4humanloop = Humanloop(api_key=API_KEY)
6# 1. Retrieve a dataset
8datapoints = humanloop.datasets.list_datapoints(DATASET_ID).records
10# 2. Create an external evaluator
11evaluator = humanloop.evaluators.create(
12 name="My External Evaluator",
13 description="An evaluator that runs outside of Humanloop runtime.",
14 type="external",
15 arguments_type="target_required",
16 return_type="boolean",
18# Or, retrieve an existing one:
19# evaluator = humanloop.evaluators.get(EVALUATOR_ID)
21# 3. Retrieve a model config
22CONFIG_ID = ...
23model_config = humanloop.model_configs.get(CONFIG_ID)
25# 4. Create the evaluation run
26PROJECT_ID = ...
27evaluation_run = humanloop.evaluations.create(
28 project_id=PROJECT_ID,
29 config_id=CONFIG_ID,
30 evaluator_ids=[EVALUATOR_ID],
31 dataset_id=DATASET_ID,
34# 5. Iterate the datapoints and trigger generations
35logs = []
36for datapoint in datapoints:
37 log = humanloop.chat_model_config(
38 project_id=PROJECT_ID,
40 inputs=datapoint.inputs,
41 messages=[
42 {key: value for key, value in dict(message).items() if value is not None}
43 for message in datapoint.messages
44 ],
46 ).data[0]
47 logs.append((log, datapoint))
49# 6. Evaluate the results.
50# In this example, we use an extremely simple evaluation, checking for an exact
51# match between the target and the model's actual output.
52for (log, datapoint) in logs:
53 # The datapoint target tells us the correct answer.
54 target = str(["answer"])
56 # The log output is what the model said.
57 model_output = log.output
59 # The evaluation is a boolean, indicating whether the model was correct.
60 result = target == model_output
62 # Post the result back to Humanloop.
63 evaluation_result_log = humanloop.evaluations.log_result(
67 result=result,
68 )
70# 7. Complete the evaluation run.
71humanloop.evaluations.update_status(, status="completed")

Chat response

We’ve updated the response models of all of our /chat API endpoints to include an output message object.

Up to this point, our chat and completion endpoints had a unified response model, where the content of the assistant message returned by OpenAI models was provided in the common output field for each returned sample. And any tool calls made were provided in the separate tool_calls field.

When making subsequent chat calls, the caller of the API had to use these fields to create a message object to append to the history of messages. So to improve this experience we now added an output_message field to the chat response. This is additive and does not represent a breaking change.


2 "project_id": "pr_GWx6n0lv6xUu3HNRjY8UA",
3 "data": [
4 {
5 "id": "data_Vdy9ZoiFv2B7iYLIh15Jj",
6 "index": 0,
7 "output": "Well, I gotta say, ...",
8 "raw_output": "Well, I gotta say...",
9 "finish_reason": "length",
10 "model_config_id": "config_VZAPd51sJH7i3ZsjauG2Q",
11 "messages": [
12 {
13 "content": "what's your best guess...",
14 "role": "user",
15 }
16 ],
17 "tool_calls": null
18 }
19 ],


2 "project_id": "pr_GWx6n0lv6xUu3HNRjY8UA",
3 "data": [
4 {
5 "id": "data_Vdy9ZoiFv2B7iYLIh15Jj",
6 "output_message": {
7 "content": "Well, I gotta say, ...",
8 "name": null,
9 "role": "assistant",
10 "tool_calls": null
11 },
12 "index": 0,
13 "output": "Well, I gotta say, ...",
14 "raw_output": "Well, I gotta say...",
15 "finish_reason": "length",
16 "model_config_id": "config_VZAPd51sJH7i3ZsjauG2Q",
17 "messages": [
18 {
19 "content": "what's your best guess...",
20 "role": "user",
21 }
22 ],
23 "tool_calls": null,
24 }
25 ],

Snippet tool

We’ve added support for managing common text ‘snippets’ (or ‘passages’, or ‘chunks’) that you want to reuse across your different prompts.

This functionality is provided by our new Snippet tool. A Snippet tool acts as a simple key/value store, where the key is the name of the common re-usable text snippet and the value is the corresponding text.

For example, you may have some common persona descriptions that you found to be effective across a range of your LLM features. Or maybe you have some specific formatting instructions that you find yourself re-using again and again in your prompts.

Before now, you would have to copy and paste between your editor sessions and keep track of which projects you edited. Now you can instead inject the text into your prompt using the Snippet tool.

Set up the tool

Navigate to the tools tab in your organisation and select the Snippet tool card.

When the dialog opens, start adding your key/value pairs. In the example below we’ve defined an Assistants snippet tool that can be used manage some common persona descriptions we feed to the LLM.

You can have up to 10 key/value snippets in a single snippet tool.

The name field will be how you’ll access this tool in the editor. By setting the value as assistant below it means in the editor you’ll be able to access this specific tool by using the syntax {{ assistant(key) }}.

The key is how you’ll access the snippet later, so it’s recommended to choose something short and memorable.

The value is the passage of text that will be included in your prompt when it is sent to the model.

Use the tool

Now your Snippets are set up, you can use it to populate strings in your prompt templates across your projects. Double curly bracket syntax is used to call a tool in the template. Inside the curly brackets you call the tool.

The tool requires an input value to be provided for the key. In our editor environment the result of the tool will be shown populated top right above the chat.

Above we created an Assistants tool. To use that in an editor you’d use the {{ <your-tool-name>(key) }} so in this case it would be {{ assistant(key) }}. When adding that you get an inputs field appear where you can specify your key, in the screenshot above we used the helpful key to access the You are a helpful assistant. You like to tell jokes and if anyone asks your name is Sam.string. This input field can be used to experiment with different key/value pairs to find the best one to suit your prompt.

The snippet will only render in the preview after running the chat

If you want to see the corresponding snippet to the key you either need to first run the conversation to fetch the string and see it in the preview.

If you have a specific key you would like to hardcode in the prompt, you can define it using the literal key value: {{ <your-tool-name>("key") }}, so in this case it would be {{ assistant("helpful") }}.

This is particularly useful because you can define passages of text once in a snippet tool and reuse them across multiple prompts, without needing to copy/paste them and manually keep them all in sync.

What’s next

Explore our other tools such as the Google or Pinecone Search. If you have other ideas for helpful integrations please reach out and let us know.

Quality-of-life app improvements

We’ve been shipping some quality-of-life “little big things” to improve your every day usage of the platform.

Project switcher throughout the app

We’ve added the project switcher throughout the app so its easier to jump between Projects from anywhere

The project switcher is now available everywhere.

We’ve tidied up the Editor

With all the new capabilities and changes (tools, images and more) we need to keep a tight ship to stop things from becoming too busy.

We’re unifying how we show all your logged generations, in the editor, and in the logs and sessions. We’ve also changed the font to Inter to be legible at small font sizes.

The Editor and other places have had a clean up to aid the new capabilites of tool calling and vision.

No more accidental blank messages

We’ve also fixed issues where empty messages would get appended to the chat.

We’ve improved keyboard navigation

The keyboard shortcuts have been updated so its now easier to navigate in the log tables (up/down keys), and to run generations in Editor (cmd/ctrl + enter).

Thanks for all your requests and tips. Please keep the feedback coming!

Claude 2.1

Today, Anthropic released its latest model, Claude 2.1, and we’ve added support for it in the Humanloop app.

The new model boasts a 200K context window and a reported 2x decrease in hallucination rates.

Additionally, this model introduces tool use to the line-up of Anthropic models. The feature is presently in beta preview, and we’ll be adding support for it to Humanloop in the coming days.

Read more about Claude 2.1 in the official release notes.

Parallel tool calling

We’ve added support for parallel tool calls in our Editor and API.

With the release of the latest OpenAI turbo models, the model can choose to respond with more than one tool call for a given query; this is referred to as parallel tool calling.

Editor updates

You can now experiment with this new feature in our Editor:

  • Select one of the new turbo models in the model dropdown.
  • Specify a tool in your model config on the left hand side.
  • Make a request that would require multiple calls to answer correctly.
  • As shown here for a weather example, the model will respond with multiple tool calls in the same message

API implications

We’ve added an additional field tool_calls to our chat endpoints response model that contains the array of tool calls returned by the model. The pre-existing tool_call parameter remains but is now marked as deprecated.

Each element in the tool_calls array has an id associated to it. When providing the tool response back to the model for one of the tool calls, the tool_call_id must be provided, along with role=tool and the content containing the tool response.

1from humanloop import Humanloop
3# Initialize the Humanloop SDK with your API Keys
4humanloop = Humanloop(api_key="<YOUR Humanloop API KEY>")
6# form of message when providing the tool response to the model
7chat_response = humanloop.chat_deployed(
8 project_id="<YOUR PROJECT ID>",
9 messages: [
10 {
11 "role": "tool",
12 "content": "Horribly wet"
13 "tool_call_id": "call_dwWd231Dsdw12efoOwdd"
14 }
15 ]

Python SDK improvements

We’ve improved the response models of our Python SDK and now give users better control over HTTPs timeout settings.

Improved response model types

As of versions >= 0.6.0, our Python SDK methods now return Pydantic models instead of typed dicts. This improves developer ergonomics around typing and validations.

  • Previously, you had to use the […] syntax to access response values:
1chat_response =
2 # parameters
3 )
  • With Pydantic-based response values, you now can use the . syntax to access response values. To access existing response model from < 0.6.0, use can still use the .raw namespace as specified in the Raw HTTP Response section.
1chat_response =
2 # parameters
3 )

🚧 Breaking change

Moving to >= 0.6.0 does represent a breaking change in the SDK. The underlying API remains unchanged.

Support for timeout parameter

The default timeout used by aiohttp, which our SDK uses is 300 seconds. For very large prompts and the latest models, this can cause timeout errors to occur.

In the latest version of Python SDKs, we’ve increased the default timeout value to 600 seconds and you can update this configuration if you are still experiencing timeout issues by passing the new timeout argument to any of the SDK methods. For example passingtimeout=1000 will override the timeout to 1000 seconds.

Multi-modal models

We’ve introduced support for multi-modal models that can take both text and images as inputs!

We’ve laid the foundations for multi-modal model support as part of our Editor and API. The first model we’ve configured is OpenAI’s GPT-4 with Vision (GPT-4V). You can now select gpt-4-vision-preview in the models dropdown and add images to your chat messages via the API.

Let us know what other multi-modal models you would like to see added next!

Editor quick start

To get started with GPT-4V, go to the Playground, or Editor within your project.

  • Select gpt-4-vision-preview in the models dropdown.
  • Click the Add images button within a user’s chat message.
  • To add an image, either type a URL into the Image URL textbox or select “Upload image” to upload an image from your computer. If you upload an image, it will be converted to a Base64-encoded data URL that represents the image.
  • Note that you can add multiple images

To view the images within a log, find the log within the logs table and click on it to open it in a drawer. The images in each chat message be viewed within this drawer.

API quick start

Assuming you have deployed your gpt-4-vision-preview based model config, you can now also include images in messages via the API.

1from humanloop import Humanloop
3# Initialize the Humanloop SDK with your API Keys
4humanloop = Humanloop(api_key="<YOUR Humanloop API KEY>")
6# humanloop.chat_deployed(...) will call the active model config on your project.
7chat_response = humanloop.chat_deployed(
8 project_id="<YOUR PROJECT ID>",
9 messages: [
10 {
11 "role": "user",
12 "content": [
13 {
14 "type": "image_url",
15 "image_url": {
16 "detail": "high",
17 "url": ""
18 }
19 }
20 ]

Any generations made will also be viewable from within your projects logs table.


There are some know limitations with the current preview iteration of OpenAI’s GPT-4 model to be aware of:

  • Image messages are only supported by the gpt-4-vision-preview model in chat mode.
  • GPT-4V model does not support tool calling or JSON mode.
  • You cannot add images to the first system message.

JSON mode and seed parameters

We’ve introduced new model config parameters for JSON mode and Seed in our Editor and API.

With the introduction of the new OpenAI turbo models you can now set additional properties that impact the behaviour of the model; response_format and seed.

Further details

See further guidance from OpenAI on the JSON response format here and reproducing outputs using the seed parameter here.

These new parameters can now optionally contribute to your model config in our Editor and API. Updated values for response_format or seed will constitute new versions of your model on Humanloop.

JSON mode prompts

When using JSON mode with the new turbo models, you should still include formatting instructions in your prompt.

In fact, if you do not include the word ‘json’ anywhere in your prompt, OpenAI will return a validation error currently.

LLM Evaluators

Until now, it’s been possible to trigger LLM-based evaluations by writing Python code that uses the Humanloop API to trigger the LLM generations.

Today, in order to make this increasingly important workflow simpler and more intuitive, we’re releasing LLM Evaluators, which require no Python configuration.

From the Evaluations page, click New Evaluator and select LLM Evaluator.

You can now choose between the existing Python Evaluators and our new LLM Evaluators.

Instead of a code editor, the right hand side of the page is now a prompt editor for defining instructions to the LLM Evaluator. Underneath the prompt, you can configure the parameters of the Evaluator (things like model, temperature etc.) just like any normal model config.

LLM Evaluator Editor.

In the prompt editor, you have access to a variety of variables that correspond to data from the underlying Log that you are trying to evaluate. These use the usual {{ variable }} syntax, and include:

  • log_inputs - the input variables that were passed in to the prompt template when the Log was generated
  • log_prompt - the fully populated prompt (if it was a completion mode generation)
  • log_messages - a JSON representation of the messages array (if it was a chat mode generation)
  • log_output - the output produced by the model
  • log_error - if the underlying Log was an unsuccessful generation, this is the error that was produced
  • testcase - when in offline mode, this is the testcase that was used for the evaluation.

Take a look at some of the presets we’ve provided on the left-hand side of the page for inspiration.

LLM Evaluator presets. You'll likely need to tweak these to fit your use case.

At the bottom of the page you can expand the debug console - this can be used verify that your Evaluator is working as intended. We’ve got further enhancements coming to this part of the Evaluator Editor very soon.

Since an LLM Evaluator is just another model config managed within Humanloop, it gets its own project. When you create an LLM Evaluator, you’ll see that a new project is created in your organisation with the same name as the Evaluator. Every time the Evaluator produces a Log as part of its evaluation activity, that output will be visible in the Logs tab of that project.

Improved evaluator editor

Given our current focus on delivering a best-in-class evaluations experience, we’ve promoted the Evaluator editor to a full-page screen in the app.

In the left-hand pane, you’ll find drop-downs to:

  • Select the mode of the Evaluator - either Online or Offline, depending on whether the Evaluator is intended to run against pre-defined testcases or against live production Logs
  • Select the return type of the Evaluator - either boolean or number

Underneath that configuration you’ll find a collection of presets.

Preset selector.

Evaluation comparison charts

We’ve added comparison charts to the evaluation runs page to help you better compare your evaluation results. These can be found in the evaluations run tab for each of your projects.

Comparing runs

You can use this to compare specific evaluation runs by selecting those in the runs table. If you don’t select any specific rows the charts show an averaged view of all the previous runs for all the evaluators.

Hiding a chart

To hide a chart for a specific evaluator you can hide the column in the table and it will hide the corresponding chart.

Comparison mode in Editor

You can now compare generations across Model Configs and inputs in Editor!

Quick start

To enter comparison mode, click New panel in the dropdown menu adds a new blank panel to the right.

Duplicate panel adds a new panel containing the same information as your current panel.


Clicking **New panel** in the dropdown menu...

... will open a new panel to the right.

Each panel is split into two section: a Model Config section at the top and an Inputs & Chat section at the bottom. These can be collapsed and resized to suit your experimentation.

If you’ve made changes in one panel, you can copy the changes you’ve made using the Copy button in the subsection’s header and paste it in the target panel using its corresponding Paste button.

The **Copy** button on the left panel will copy the new chat template...
... and the **Paste** button on the right panel will then update its chat template.

Other changes

Our recently-introduced local history has also been upgraded to save your full session even when you have multiple panels open.

The toggle to completion mode and the button to open history have now been moved into the new dropdown menu.

Improved evaluation runs

You can now trigger runs against multiple model configs simultaneously.

This improves your ability to compare and evaluate changes across your prompts. We’ve also removed the summary cards. In their place, we’ve added a table that supports sorting and rearranging of columns to help you better interrogate results.

Multiple model configs

To run evaluations against multiple model configs it’s as simple as selecting the targeted model configs in the run dialog, similar to before, but multiple choices are now supported. This will trigger multiple evaluation runs at once, with each model config selected as a target.

Evaluation table

We’ve updated our evaluation runs with a table to help view the outcomes of runs in a more condensed form. It also allows you to sort results and trigger re-runs easier. As new evaluators are included, a column will be added automatically to the table.

Re-run previous evaluations

We’ve exposed the re-run option in the table to allow you to quickly trigger runs again, or use older runs as a way to preload the dialog and change the parameters such as the target dataset or model config.

New OpenAI turbos

Off the back of OpenAI’s dev day we’ve added support for the new turbo models that were announced:

  • gpt-4-1106-preview
  • gpt-3.5-turbo-1106

Both of these models add a couple of nice capabilities:

  • Better instruction following performance
  • JSON mode that forces the model to return valid JSON
  • Can call multiple tools at once
  • Set a seed for reproducible outputs

You can now access these in your Humanloop Editor and via the API.

Improved logs drawer

You can now resize the message section in the Logs and Session drawers, allowing you to review your logs more easily.

To resize the message section we’ve introduced a resize bar that you can drag up or down to give yourself the space needed. To reset the layout back to default just give the bar a double click.

Local editor history

The Humanloop playground and editor now save history locally as you make edits, giving you complete peace of mind that your precisely-crafted prompts will not be lost due to an accidental page reload or navigating away.

Local history entries will be saved as you use the playground (e.g. as you modify your model config, make generations, or add messages). These will be visible under the Local tab within the history side panel. Local history is saved to your browser and is only visible to you.

Our shared history feature, where all playground generations are saved, has now been moved under the Shared tab in the history side panel.

Project folders

You can now organize your projects into folders!

Logging in to Humanloop will bring you to the new page where you can start arranging your projects.

Navigate into folders and open projects by clicking on the row. To go back to a parent folder, click on the displayed breadcrumbs (e.g. “Projects” or “Development” in the above screenshot).

Searching will give you a list of directories and projects with a matching name.

Moving multiple projects

You can move a group of projects and directories by selecting them and moving them together.

  1. Select the projects you want to move.
    Tip: Put your cursor on a project row and press [x] to select the row.
  2. To move the selected projects into a folder, drag and drop them into the desired folder.

To move projects out of a folder and into a parent folder, you can drag and drop them onto the parent folder breadcrumbs:

To move projects into deeply nested folders, it might be easier to select your target directory manually. To do so, select the projects you wish to move and then click the blue Actions button and then click Move … to bring up a dialog allowing you to move the selected projects.

If you prefer the old view, we’ve kept it around for now. Let us know what you’re missing from the new view so we can improve it.

The [Go to old layout] button will take you to the previous view without folders.


We’ve introduced Datasets to Humanloop. Datasets are collections of Datapoints, which represent input-output pairs for an LLM call.

We recently released Datasets in our Evaluations beta, by the name Evaluation Testsets. We’re now promoting the concept to a first-class citizen within your projects. If you’ve previously been using testsets in the evaluations beta, you’ll see that your testsets have now automatically migrated to datasets.

Datasets can be created via CSV upload, converting from existing Logs in your project, or by API requests.

See our guides on datasets, which show how to upload from CSV and perform a batch generation across the whole dataset.

A single dataset that has been added to a project, with 9 datapoints.

Clicking into a dataset, you can explore its datapoints.

Datapoints are pre-defined input-output pairs.

A dataset contains a collection of prompt variable inputs (the dynamic values which are interpolated into your model config prompt template at generation-time), as well as a collection of messages forming the chat history, and a target output with data representing what we expect the model to produce when it runs on those inputs.

Datasets are useful for evaluating the behaviour of you model configs across a well-defined collection of test cases. You can use datasets to check for regressions as you iterate your model configs, knowing that you are checking behaviour against a deterministic collection of known important examples.

Datasets can also be used as collections of input data for fine-tuning jobs.

GET API tool

We’ve added support for a tool that can make GET calls to an external API.

This can be used to dynamically retrieve context for your prompts. For example, you may wish to get additional information about a user from your system based on their ID, or look up additional information based on a query from a user.

To set up the tool you need to provide the following details for your API:

Tool parameterDescriptionExample
NameA unique tool name to reference as a call signature in your promptsget_api_tool
URLThe URL for your API endpoint
API Key HeaderThe authentication header required by your endpoint.X-API-KEY
API KeyThe API key value to use in the authentication header.sk_1234567891011121314
Query parametersA comma delimited list of the query parameters to set when making requests.user_query, client_id

Define your API

First you will need to define your API. For demo purposes, we will create a mock endpoint in postman. Our mock endpoint simply returns details about a mock user given their user_id.

A call to our Mock API in Python is as follows; note the query parameter user_id

1import requests
3url = ""
4headers = {
7response = requests.request("GET", url, headers=headers)

And returns the response:

2 "user_id", "012345678910",
3 "name": "Albert",
4 "company": "Humanloop",
5 "role": "Engineer"

We can now use this tool to inject information for a given user into our prompts.

Set up the tool

Navigate to the tools tab in your organisation and select the Get API Call tool card:

Configure the tool with your API details:

Use the tool

Now your API tool is set up, you can use it to populate input variables in your prompt templates. Double curly bracket syntax is used to call a tool in the template. The call signature is the unique tool name with arguments for the query parameters defined when the tool was set up.

In our mock example, the signature will be: get_user_api(user_id).

An example prompt template using this tool is:

$You are a helpful assistant. Please draft an example job role summary for the following user:
>User details: {{ get_user_api(user_id) }}
>Keep it short and concise.

The tool requires an input value to be provided for user_id. In our playground environment the result of the tool will be shown populated top right above the chat:

What’s next

Explore more complex examples of context stuffing such as defining your own custom RAG service.

Evaluations improvements

We’ve released a couple of minor useability improvements in the evaluations workflow.

Summary statistics for evaluation runs

When reviewing past runs of evaluations, you can now see summary statistics for each evaluator before clicking into the detail view, allowing for easier comparison between runs.

Re-running evaluations

To enable easier re-running of past evaluations, you can now click the Re-run button in the top-right of the evaluation detail view.

Editor - copy tools

Our Editor environment let’s users incorporate OpenAI function calling into their prompt engineering workflows by defining tools. Tools are made available to the model as functions to call using the same universal JSON schema format.

As part of this process it can be helpful to copy the full JSON definition of the tool for quickly iterating on new versions, or copy and pasting it into code. You can now do this directly from the tool definition in Editor:

Selecting the Copy button adds the full JSON definition of the tool to your clipboard:

2 "name": "get_current_weather",
3 "description": "Get the current weather in a given location",
4 "parameters": {
5 "type": "object",
6 "properties": {
7 "location": {
8 "type": "string",
9 "name": "Location",
10 "description": "The city and state, e.g. San Francisco, CA"
11 },
12 "unit": {
13 "type": "string",
14 "name": "Unit",
15 "enum": [
16 "celsius",
17 "fahrenheit"
18 ]
19 }
20 },
21 "required": [
22 "location"
23 ]
24 }

Single sign on (SSO)

We’ve added support for SOO to our signup, login and invite flows. By default users can now use their Gmail accounts to access Humanloop.

For our enterprise customers, this also unlocks the ability for us to more easily support their SAML-based single sign-on (SSO) set ups.

Organization slug in URLs

We have altered routes specific to your organization to include the organization slug. The organization slug is a unique value that was derived from your organization name when your organization was created.

For project paths we’ve dropped the projects label in favour of a more specific project label.

An example of what this looks like can be seen below:

Existing bookmarks and links will continue to work

When a request is made to one of the legacy URL paths, we’ll redirect it to the corresponding new path. Although the legacy routes are still supported, we encourage you to update your links and bookmarks to adopt the new naming scheme.

Updating your organization slug

The organization slug can be updated by organization administrators. This can be done by navigating to the general settings page. Please exercise caution when changing this, as it will affect the URLs across the organization.

Allow trusted email domains

You can now add trusted email domains to your organization. Adding trusted email domains allows new users, when creating an account with a matching email, to join your organization without requiring an invite.

Managing trusted domains

Adding and removing trusted email domains is controlled from your organizations General settings page.

Only Admins can manage trusted domains for an organization.

To add a new trusted domain press the Add domain button and enter the domains trusted by your organization. The domains added here will check against new users signing up to Humanloop and if there is a match those users will be given the option to join your organization.

Signup for new users

New users signing up to Humanloop will see the following screen when they signup with an email that matches and organizations trusted email domain. By pressing Join they will be added to the matching organization.

Editor - insert new message within existing chat

You can now insert a new message within an existing chat in our Editor. Click the plus button that appears between the rows.

Claude instant 1.2

We’ve added support for Anthropic’s latest model Claude instant 1.2! Claude Instant is the faster and lower-priced yet still very capable model from Anthropic, great for use cases where low latency and high throughput are required.

You can use Claude instant 1.2 directly within the Humanloop playground and deployment workflows.

Read more about the latest Claude instant model here.

Offline evaluations with testsets

We’re continuing to build and release more functionality to Humanloop’s evaluations framework!

Our first release provided the ability to run online evaluators in your projects. Online evaluators allow you to monitor the performance of your live deployments by defining functions which evaluate all new datapoints in real time as they get logged to the project.

Today, to augment online evaluators, we are releasing offline evaluators as the second part of our evaluations framework.

Offline evaluators provide the ability to test your prompt engineering efforts rigorously in development and CI. Offline evaluators test the performance of your model configs against a pre-defined suite of testcases - much like unit testing in traditional programming.

With this framework, you can use test-driven development practices to iterate and improve your model configs, while monitoring for regressions in CI.

To learn more about how to use online and offline evaluators, check out the Evaluate your model section of our guides.

Improved error handling

We’ve unified how errors returned by model providers are handled and enabled error monitoring using eval functions.

A common production pain point we see is that hosted SOTA language models can still be flaky at times, especially at real scale. With this release, Humanloop can help users better understand the extent of the problem and guide them to different models choices to improve reliability.

Unified errors

Our users integrate the Humanloop /chat and /completion API endpoints as a unified interface into all the popular model providers including OpenAI, Anthropic, Azure, Cohere, etc. Their Humanloop projects can then be used to manage model experimentation, versioning, evaluation and deployment.

Errors returned by these endpoints may be raised by the model provider’s system. With this release we’ve updated our API to map all the error behaviours from different model providers to a unified set of error response codes.

We’ve also extended our error responses to include more details of the error with fields for type, message, code and origin. The origin field indicates if the error originated from one of the integrated model providers systems, or directly from Humanloop.

For example, for our /chat endpoint where we attempt to call OpenAI with an invalid setting for max_tokens, the message returned is that raised by OpenAI and the origin is set to OpenAI.

2 "type": "unprocessable_entity_error",
3 "message": "This model's maximum context length is 4097 tokens. However, you requested 10000012 tokens (12 in the messages, 10000000 in the completion). Please reduce the length of the messages or completion.",
4 "code": 422,
5 "origin": "OpenAI"

Monitor model reliability with evals

With this release, all errors returned from the different model providers are now persisted with the corresponding input data as datapoints on Humanloop. Furthermore this error data is made available to use within evaluation functions.

You can now turn on the Errors eval function, which tracks overall error rates of the different model variations in your project. Or you can customise this template to track more specific error behaviour.

Errors evaluation function template now available

OpenAI functions in Playground

We’ve added support for OpenAI functions to our playground!

This builds on our API support and allows you to easily experiment with OpenAI functions within our playground UI.

OpenAI functions are implemented as tools on Humanloop. Tools follow the same universal json-schema definition as OpenAI functions. You can now define tools as part of your model configuration in the playground. These tools are sent as OpenAI functions when running the OpenAI chat models that support function calling.

The model can choose to return a JSON object containing the arguments needed to call a function. This object is displayed as a special assistant message within the playground. You can then provide the result of the call in a message back to the model to consider, which simulates the function calling workflow.

Use tools in Playground

Take the following steps to use tools for function calling in the playground:

  1. Find tools: Navigate to the playground and locate the Tools section. This is where you’ll be able to manage your tool definitions.

  1. Create a new tool: Click on the “Add Tool” button. There are two options in the dropdown: create a new tool or to start with one of our examples. You define your tool using the json-schema syntax. This represents the function definition sent to OpenAI.

  1. Edit a tool: To edit an existing tool, simply click on the tool in the Tools section and make the necessary changes to its json-schema definition. This will result in a new model configuration.

  1. Run a model with tools: Once you’ve defined your tools, you can run the model by pressing the “Run” button.
    1. If the model chooses to call a function, an assistant message will be displayed with the corresponding tool name and arguments to use.
    2. A subsequent Tool message is then displayed to simulate sending the results of the call back to the model to consider.

  1. Save your model config with tools by using the Save button. Model configs with tools defined can then deployed to environments as normal.

Coming soon

Provide the runtime for your tool under the existing pre-defined Tools section of your organization on Humanloop.

Llama 2

We’ve added support for Llama 2!

You can now select llama70b-v2 from the model dropdown in the Playground and Editor. You don’t currently need to provide an API key or any other special configuration to get Llama 2 access via Humanloop.

Llama 2 is available in Playground and Editor for all Humanloop users.

Read more about the latest version of Llama here and in the original announcement.


We’ve added Evaluators to Humanloop in beta!

Evaluators allow you to quantitatively define what constitutes a good or bad output from your models. Once set up, you can configure an Evaluators to run automatically across all new datapoints as they appear in your project; or, you can simply run it manually on selected datapoints from the Data tab.

We’re going to be adding lots more functionality to this feature in the coming weeks, so check back for more!

Create an Evaluator

If you’ve been given access to the feature, you’ll see a new Evaluations tab in the Humanloop app. To create your first evaluation function, select + New Evaluator. In the dialog, you’ll be presented with a library of example Evaluators, or you can start from scratch.

We offer a library of example Evaluators to get you started.

We’ll pick Valid JSON for this guide.

Evaluator editor.

In the editor, provide details of your function’s name, description and return type. In the code editor, you can provide a function which accepts a datapoint argument and should return a value of the chosen type.

Currently, the available return types for an Evaluators are number and boolean. You should ensure that your function returns the expected data type - an error will be raised at runtime if not.

The Datapoint argument

The datapoint passed into your function will be a Python dict with the following structure.

2 "id":"data_XXXX", # Datapoint id
3 "model_config": {...}, # Model config used to generate the datapoint
4 "inputs": {...}, # Model inputs (interpolated into the prompt)
5 "output": "...", # Generated output from the model
6 "provider_latency": 0.6, # Provider latency in seconds
7 "metadata": {...}, # Additional metadata attached to the logged datapoint
8 "created_at": "...", # Creation timestamp
9 "feedback": [...] # Array of feedback provided on the datapoint

To inspect datapoint dictionaries in more detail, click Random selection in the debug console at the bottom of the window. This will load a random set of five datapoints from your project, exactly as they will be passed into the Evaluation Function.

The debug console - load datapoints to inspect the argument passed into Evaluators.

For this demo, we’ve created a prompt which asks the model to produce valid JSON as its output. The Evaluator uses a simple json.loads call to determine whether the output is validly formed JSON - if this call raises an exception, it means that the output is not valid JSON, and we return False.

1import json
3def check_valid_json(datapoint):
4 try:
5 return json.loads(datapoint["output"]) is not None
6 except:
7 return False


Once you have drafted a Python function, try clicking the run button next to one of the debug datapoints in the debug console. You should shortly see the result of executing your function on that datapoint in the table.

A `True` result from executing the **Valid JSON** Evaluators on the datapoint.

If your Evaluator misbehaves, either by being invalid Python code, raising an unhandled exception or returning the wrong type, an error will appear in the result column. You can hover this error to see more details about what went wrong - the exception string is displayed in the tooltip.

Once you’re happy with your Evaluator, click Create in the bottom left of the dialog.

Activate / Deactivate an Evaluator

Your Evaluators are available across all your projects. When you visit the Evaluations tab from a specific project, you’ll see all Evaluators available in your organisation.

Each Evaluator has a toggle. If you toggle the Evaluator on, it will run on every new datapoint that gets logged to that project. (Switch to another project and you’ll see that the Evaluator is not yet toggled on if you haven’t chosen to do so).

You can deactivate an Evaluator for a project by toggling it back off at any time.

Aggregations and Graphs

At the top of the Dashboard tab, you’ll see new charts for each activated Evaluation Function. These display aggregated Evaluation results through time for datapoints in the project.

At the bottom of the Dashboard tab is a table of all the model configs in your project. That table will display a column for each activated Evaluator in the project. The data displayed in this column is an aggregation of all the Evaluation Results (by model config) for each Evaluator. This allows you to assess the relative performance of your models.

Evaluation Results through time, by model config. In this example, one of the model configs is not producing Valid JSON outputs, while the other is about 99% of the time.


For the purposes of both the charts and the model configs table, aggregations work as follows for the different return types of Evaluators:

  • Boolean: percentage returning True of the total number of evaluated datapoints
  • Number: average value across all evaluated datapoints

Data logs

In the Data tab, you’ll also see that a column is visible for each activated Evaluator, indicating the result of running the function on each datapoint.

The **Data** tab for a project, showing the **Valid JSON** Evaluation Results for a set of datapoints.

From this tab, you can choose to re-run an Evaluator on a selection of datapoints. Either use the menu at the far right of a single datapoint, or select multiple datapoints and choose Run evals from the Actions menu in the top right.

Available Modules

The following Python modules are available to be imported in your Evaluation Function:

  • math
  • random
  • datetime
  • json (useful for validating JSON grammar as per the example above)
  • jsonschema (useful for more fine-grained validation of JSON output - see the in-app example)
  • sqlglot (useful for validating SQL query grammar)
  • requests (useful to make further LLM calls as part of your evaluation - see the in-app example for a suggestion of how to get started).

Let us know if you would like to see more modules available.

Chain LLM calls

We’ve introduced sessions to Humanloop, allowing you to link multiple calls together when building a chain or agent.

Using sessions with your LLM calls helps you troubleshoot and improve your chains and agents.

Trace of an Agent's steps logged as a session

Adding a datapoint to a session

To log your LLM calls to a session, you just need to define a unique identifier for the session and pass it into your Humanloop calls with session_reference_id.

For example, using uuid4() to generate this ID,

1import uuid
2session_reference_id = str(uuid.uuid4())
4response = humanloop.complete(
5 project="sessions_example_assistant",
6 model_config={
7 "prompt_template": "Question: {{user_request}}\nGoogle result: {{google_answer}}\nAnswer:\n",
8 "model": "text-davinci-002",
9 "temperature": 0,
10 },
11 inputs={"user_request": user_request, "google_answer": google_answer},
12 session_reference_id=session_reference_id,

Similarly, our other methods such as humanloop.complete_deployed(),, and humanloop.log() etc. support session_reference_id.

If you’re using our API directly, you can pass session_reference_id within the request body in your POST /v4/completion etc. endpoints.

Further details

For a more detailed walkthrough on how to use session_reference_id, check out our guide that runs through how to record datapoints to a session in an example script.

Introducing Tools

Today we’re announcing Tools as a part of Humanloop.

Tools allow you to connect an LLM to any API and to an array of data sources to give it extra capabilities and access to private data. Under your organization settings on Humanloop you can now configure and manage tools in a central place.

Read more on our blog and see an example of setting up a tool for semantic search.

OpenAI functions API

We’ve updated our APIs to support OpenAI function calling.

OpenAI functions are now supported as tools on Humanloop. This allows you to pass tool definitions as part of the model configuration when calling our chat and log endpoints. For the latest OpenAI models gpt-3.5-turbo-0613 and gpt-4-0613 the model can then choose to output a JSON object containing arguments to call these tools.

This unlocks getting more reliable structured data back from the model and makes it easier to create useful agents.

Recap on OpenAI functions

As described in the OpenAI documentation, the basic steps for using functions are:

  1. Call one of the models gpt-3.5-turbo-0613 and gpt-4-0613 with a user query and a set of function definitions described using the universal json-schema syntax.
  2. The model can then choose to call one of the functions provided. If it does, a stringified JSON object adhering to your json schema definition will be returned.
  3. You can then parse the string into JSON in your code and call the chosen function with the provided arguments (NB: the model may hallucinate or return invalid json, be sure to consider these scenarios in your code).
  4. Finally call the model again by appending the function response as a new message. The model can then use this information to respond to the original use query.

OpenAI have provided a simple example in their docs for a get_current_weather function that we will show how to adapt to use with Humanloop:

1import openai
2import json
5# Example dummy function hard coded to return the same weather
6# In production, this could be your backend API or an external API
7def get_current_weather(location, unit="fahrenheit"):
8 """Get the current weather in a given location"""
9 weather_info = {
10 "location": location,
11 "temperature": "72",
12 "unit": unit,
13 "forecast": ["sunny", "windy"],
14 }
15 return json.dumps(weather_info)
18def run_conversation():
19 # Step 1: send the conversation and available functions to GPT
20 messages = [{"role": "user", "content": "What's the weather like in Boston?"}]
21 functions = [
22 {
23 "name": "get_current_weather",
24 "description": "Get the current weather in a given location",
25 "parameters": {
26 "type": "object",
27 "properties": {
28 "location": {
29 "type": "string",
30 "description": "The city and state, e.g. San Francisco, CA",
31 },
32 "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
33 },
34 "required": ["location"],
35 },
36 }
37 ]
38 response = openai.ChatCompletion.create(
39 model="gpt-3.5-turbo-0613",
40 messages=messages,
41 functions=functions,
42 function_call="auto", # auto is default, but we'll be explicit
43 )
44 response_message = response["choices"][0]["message"]
46 # Step 2: check if GPT wanted to call a function
47 if response_message.get("function_call"):
48 # Step 3: call the function
49 # Note: the JSON response may not always be valid; be sure to handle errors
50 available_functions = {
51 "get_current_weather": get_current_weather,
52 } # only one function in this example, but you can have multiple
53 function_name = response_message["function_call"]["name"]
54 fuction_to_call = available_functions[function_name]
55 function_args = json.loads(response_message["function_call"]["arguments"])
56 function_response = fuction_to_call(
57 location=function_args.get("location"),
58 unit=function_args.get("unit"),
59 )
61 # Step 4: send the info on the function call and function response to GPT
62 messages.append(response_message) # extend conversation with assistant's reply
63 messages.append(
64 {
65 "role": "function",
66 "name": function_name,
67 "content": function_response,
68 }
69 ) # extend conversation with function response
70 second_response = openai.ChatCompletion.create(
71 model="gpt-3.5-turbo-0613",
72 messages=messages,
73 ) # get a new response from GPT where it can see the function response
74 return second_response

Using with Humanloop tools

OpenAI functions are treated as tools on Humanloop. Tools conveniently follow the same universal json-schema definition as OpenAI functions.

We’ve expanded the definition of our model configuration to also include tool definitions. Historically the model config is made up of the chat template, choice of base model and any hyper-parameters that change the behaviour of the model.

In the cases of OpenAIs gpt-3.5-turbo-0613 and gpt-4-0613 models, any tools defined as part of the model config are passed through as functions for the model to use.

You can now specify these tools when using the Humanloop chat endpoint (as a replacement for OpenAI’s ChatCompletion), or when using the Humanloop log endpoint in addition to the OpenAI calls:

Chat endpoint

We show here how to update the run_conversation() method from the OpenAI example to instead use the Humanloop chat endpoint with tools:

1from humanloop import Humanloop
3hl = Humanloop(
4 # get your API key here:
5 api_key="YOUR_API_KEY",
8def run_conversation():
9 # Step 1: send the conversation and available functions to GPT
10 messages = [{"role": "user", "content": "What's the weather like in Boston?"}]
11 # functions are referred to as tools on Humanloop, but follows the same schema
12 tools = [
13 {
14 "name": "get_current_weather",
15 "description": "Get the current weather in a given location",
16 "parameters": {
17 "type": "object",
18 "properties": {
19 "location": {
20 "type": "string",
21 "description": "The city and state, e.g. San Francisco, CA",
22 },
23 "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
24 },
25 "required": ["location"],
26 },
27 }
28 ]
29 response =
30 project="Assistant",
31 model_config={
32 "model": "gpt-3.5-turbo-0613",
33 "tools": tools
34 },
35 messages=messages
36 )
37 response =[0]
39 # Step 2: check if GPT wanted to call a tool
40 if response.get("tool_call"):
41 # Step 3: call the function
42 # Note: the JSON response may not always be valid; be sure to handle errors
43 available_functions = {
44 "get_current_weather": get_current_weather,
45 } # only one function in this example, but you can have multiple
46 function_name = response_message["function_call"]["name"]
47 fuction_to_call = available_functions[function_name]
48 function_args = json.loads(response["tool_call"]["arguments"])
49 function_response = fuction_to_call(
50 location=function_args.get("location"),
51 unit=function_args.get("unit"),
52 )
54 # Step 4: send the response back to the model
55 messages.append(response_message)
56 messages.append(
57 {
58 "role": "tool",
59 "name": function_name,
60 "content": function_response,
61 }
62 )
63 second_response =
64 project="Assistant",
65 model_config={
66 "model": "gpt-3.5-turbo-0613",
67 "tools": tools
68 },
69 messages=messages
70 )
71 return second_response

After running this snippet, the model configuration recorded on your project in Humanloop will now track what tools were provided to the model and the logged datapoints will provide details of the tool called to inspect:

Log endpoint

Alternatively, you can also use the explicit Humanloop log alongside your existing OpenAI calls to achieve the same result:

1from humanloop import Humanloop
3hl = Humanloop(
4 # get your API key here:
5 api_key="YOUR_API_KEY",
8def run_conversation():
9 # Step 1: send the conversation and available functions to GPT
10 messages = [{"role": "user", "content": "What's the weather like in Boston?"}]
11 functions = [
12 {
13 "name": "get_current_weather",
14 "description": "Get the current weather in a given location",
15 "parameters": {
16 "type": "object",
17 "properties": {
18 "location": {
19 "type": "string",
20 "description": "The city and state, e.g. San Francisco, CA",
21 },
22 "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
23 },
24 "required": ["location"],
25 },
26 }
27 ]
28 response = openai.ChatCompletion.create(
29 model="gpt-3.5-turbo-0613",
30 messages=messages,
31 functions=functions,
32 function_call="auto", # auto is default, but we'll be explicit
33 )
34 response_message = response["choices"][0]["message"]
36 # log the result to humanloop
37 log_response = hl.log(
38 project="Assistant",
39 model_config={
40 "model": "gpt-3.5-turbo-0613",
41 "tools": tools,
42 },
43 messages=messages,
44 tool_call=response_message.get("function_call")
45 )
47 # Step 2: check if GPT wanted to call a function
48 if response_message.get("function_call"):
49 # Step 3: call the function
50 # Note: the JSON response may not always be valid; be sure to handle errors
51 available_functions = {
52 "get_current_weather": get_current_weather,
53 } # only one function in this example, but you can have multiple
54 function_name = response_message["function_call"]["name"]
55 fuction_to_call = available_functions[function_name]
56 function_args = json.loads(response_message["function_call"]["arguments"])
57 function_response = fuction_to_call(
58 location=function_args.get("location"),
59 unit=function_args.get("unit"),
60 )
62 # Step 4: send the info on the function call and function response to GPT
63 messages.append(response_message) # extend conversation with assistant's reply
64 messages.append(
65 {
66 "role": "function",
67 "name": function_name,
68 "content": function_response,
69 }
70 ) # extend conversation with function response
71 second_response = openai.ChatCompletion.create(
72 model="gpt-3.5-turbo-0613",
73 messages=messages,
74 ) # get a new response from GPT where it can see the function response
76 log_response = hl.log(
77 project="Assistant",
78 model_config={
79 "model": "gpt-3.5-turbo-0613",
80 "tools": tools,
81 },
82 messages=messages,
83 output=second_response["choices"][0]["message"]["content"],
84 )
85 return second_response

Coming soon

Support for defining tools in the playground!

Deployment environments

We’ve added support for environments to your deployments in Humanloop!

This enables you to deploy your model configurations to specific environments. You’ll no longer have to duplicate your projects to manage the deployment workflow between testing and production. With environments, you’ll have the control required to manage the full LLM deployment lifecycle.

Enabling environments for your organisation

Every organisation automatically receives a default production environment. For any of your existing projects that had active deployments define, these have been automatically migrated over to use the default environment with no change in behaviour for the APIs.

You can create additional environments with custom names by visiting your organisation’s environments page.

Creating an environment

Enter a custom name in the create environment dialog. Names have a constraint in that they must be unique within an organisation.

The environments you define for your organisation will be available for each project and can be viewed in the project dashboard once created.

The default environment

By default, the production environment is marked as the Default environment. This means that all API calls targeting the “Active Deployment,” such as Get Active Config or Chat Deployed will use this environment. You can update the default environment on the organisations environments page.

Changing the default environment will take immediate effect, so ensure that this change is planned and does not disrupt your production workflows.

The same caution should be exercised when renaming environments. Renaming will affect each project, so plan the change carefully to avoid any impact on your production workflows.

Using environments

Once created on the environments page, environments can be used for each project and are visible in the respective project dashboards.

You can deploy directly to a specific environment by selecting it in the Deployments section.

Alternatively, you can deploy to multiple environments simultaneously by deploying a Model Config from either the Editor or the Model Configs table.

Using environments via API

For v4.0 API endpoints that support Active Deployments, such as Get Active Config or Chat Deployed, you can now optionally point to a model configuration deployed in a specific environment by including an optional additional environment field.

You can find this information in our v4.0 API Documentation or within the environment card in the Project Dashboard under the “Use API” option.

Clicking on the “Use API” option will provide code snippets that demonstrate the usage of the environment variable in practice.

Improved Python SDK streaming response

We’ve improved our Python SDK’s streaming response to contain the datapoint ID. Using the ID, you can now provide feedback to datapoints created through streaming.

The humanloop.chat_stream() and humanloop.complete_stream() methods now yield a dictionary with output and id.

1{'output': '...', 'id': 'data_...'}

Install the updated SDK with

$pip install --upgrade humanloop

Example snippet

import asyncio
from humanloop import Humanloop
humanloop = Humanloop(
async def main():
response = await humanloop.chat_stream(
"role": "user",
"content": "Explain asynchronous programming.",
"model": "gpt-3.5-turbo",
"max_tokens": -1,
"temperature": 0.7,
"chat_template": [
"role": "system",
"content": "You are a helpful assistant who replies in the style of {{persona}}.",
"persona": "the pirate Blackbeard",
async for token in response.content:
print(token) # E.g. {'output': 'Ah', 'id': 'data_oun7034jMNpb0uBnb9uYx'}

OpenAI Azure support

We’ve just added support for Azure deployments of OpenAI models to Humanloop!

This update adds the ability to target Microsoft Azure deployments of OpenAI models to the playground and your projects. To set this up, visit your organization’s settings.

Enabling Azure OpenAI for your organization

As a prerequisite, you will need to already be setup with Azure OpenAI Service. See the Azure OpenAI docs for more details. At the time of writing, access is granted by application only.

Click the Setup button and provide your Azure OpenAI endpoint and API key.

Your endpoint can be found in the Keys & Endpoint section when examining your resource from the Azure portal. Alternatively, you can find the value in Azure OpenAI Studio > Playground > Code View. An example endpoint is:

Your API keys can also be found in the Keys & Endpoint section when examining your resource from the Azure portal. You can use either KEY1 or KEY2.

Working with Azure OpenAI models

Once you’ve successfully enabled Azure OpenAI for your organization, you’ll be able to access it through the playground and in your projects in exactly the same way as your existing OpenAI and/or Anthropic models.

REST API and Python / TypeScript support

As with other model providers, once you’ve set up an Azure OpenAI-backed model config, you can call it with the Humanloop REST API or our SDKs.

1import { Humanloop } from "humanloop";
3const humanloop = new Humanloop({
4 apiKey: "API_KEY",
7const chatResponse = await{
8 project: "project_example",
9 messages: [
10 {
11 role: "user",
12 content: "Write me a song",
13 },
14 ],
15 provider_api_keys: {
16 openai_azure: OPENAI_AZURE_API_KEY,
17 openai_azure_endpoint: OPENAI_AZURE_ENDPOINT,
18 },
19 model_config: {
20 model: "my-azure-deployed-gpt-4",
21 temperature: 1,
22 },

In the model_config.model field, provide the name of the model that you deployed from the Azure portal (see note below for important naming conventions when setting up your deployment in the Azure portal).

The request will use the stored organization level key and endpoint you configured above, unless you override this on a per-request basis by passing both the endpoint and API key in the provider_api_keys field, as shown in the example above.

Note: Naming Model Deployments

When you deploy a model through the Azure portal, you’ll have the ability to provide your deployment with a unique name. For instance, if you choose to deploy an instance of gpt-35-turbo in your OpenAI Service, you may choose to give this an arbitrary name like my-orgs-llm-model.

In order to use all Humanloop features with your Azure model deployment, you must ensure that your deployments are named either with an unmodified base model name like gpt-35-turbo, or the base model name with a custom prefix like my-org-gpt-35-turbo. If your model deployments use arbitrary names which do not prefix a base model name, you may find that certain features such as setting max_tokens=-1 in your model configs fail to work as expected.

Project Editor

We’ve introduced an Editor within each project to help you make it easier to to change prompts and bring in project specific data.

The Editor will load up the currently active model config, and will save the generations in the project’s data table.

You can now also bring datapoints directly to the Editor. Select any datapoints you want to bring to Editor (also through x shortcut) and you can choose to open them in Editor (or e shortcut)

Press `e` while selecting a datapoint to bring it into Editor

We think this workflow significantly improves the workflow to go from interesting datapoint to improved model config. As always, let us know if you have other feedback.


We’ve just added support for Cohere to Humanloop!

This update adds Cohere models to the playground and your projects - just add your Cohere API key in your organization’s settings. As with other providers, each user in your organization can also set a personal override API key, stored locally in the browser, for use in Cohere requests from the Playground.

Enabling Cohere for your organization

Add your Cohere API key to your organization settings to start using Cohere models with Humanloop.

Working with Cohere models

Once you’ve successfully enabled Cohere for your organization, you’ll be able to access it through the playground and in your projects, in exactly the same way as your existing OpenAI and/or Anthropic models.

REST API and Python / TypeScript support

As with other model providers, once you’ve set up a Cohere-backed model config, you can call it with the Humanloop REST API or our SDKs.

1import { Humanloop } from "humanloop";
3const humanloop = new Humanloop({
4 apiKey: "API_KEY",
7const chatResponse = await{
8 project: "project_example",
9 messages: [
10 {
11 role: "user",
12 content: "Write me a song",
13 },
14 ],
15 provider_api_keys: {
16 cohere: COHERE_API_KEY,
17 },
18 model_config: {
19 model: "command",
20 temperature: 1,
21 },

If you don’t provide a Cohere API key under the provider_api_keys field, the request will fall back on the stored organization level key you configured above.

Improved Python SDK

We’ve just released a new version of our Python SDK supporting our v4 API!

This brings support for:

  • 💬 Chat mode
  • 📥 Streaming support humanloop.chat_stream(...)
  • 🕟 Async methods humanloop.acomplete(...)


pip install --upgrade humanloop

Example usage

1complete_response = humanloop.complete(
2 project="sdk-example",
3 inputs={
4 "text": "Llamas that are well-socialized and trained to halter and lead after weaning and are very friendly and pleasant to be around. They are extremely curious and most will approach people easily. However, llamas that are bottle-fed or over-socialized and over-handled as youth will become extremely difficult to handle when mature, when they will begin to treat humans as they treat each other, which is characterized by bouts of spitting, kicking and neck wrestling.[33]",
5 },
6 model_config={
7 "model": "gpt-3.5-turbo",
8 "max_tokens": -1,
9 "temperature": 0.7,
10 "prompt_template": "Summarize this for a second-grade student:\n\nText:\n{{text}}\n\nSummary:\n",
11 },
12 stream=False,

Migration from 0.3.x

For those coming from an older SDK version, this introduces some breaking changes. A brief highlight of the changes:

  • The client initialization step of hl.init(...) is now humanloop = Humanloop(...).
    • Previously provider_api_keys could be provided in hl.init(...). They should now be provided when constructing Humanloop(...) client.
    • 1humanloop = Humanloop(
      2 api_key="YOUR_API_KEY",
      3 openai_api_key="YOUR_OPENAI_API_KEY",
      4 anthropic_api_key="YOUR_ANTHROPIC_API_KEY",
  • hl.generate(...)’s various call signatures have now been split into individual methods for clarity. The main ones are:
    • humanloop.complete(project, model_config={...}, ...) for a completion with the specified model config parameters.
    • humanloop.complete_deployed(project, ...) for a completion with the project’s active deployment.

TypeScript SDK

We now have a fully typed TypeScript SDK to make working with Humanloop even easier.

You can use this with your JavaScript, TypeScript or Node projects.


$npm i humanloop

Example usage

1import { Humanloop } from "humanloop"
3const humanloop = new Humanloop({
4 // Defining the base path is optional and defaults to
5 // basePath: "",
6 apiKey: 'API_KEY',
10const chatResponse = await{
11 "project": "project_example",
12 "messages": [
13 {
14 "role": "user",
15 "content": "Write me a song",
16 }
17 ],
18 "provider_api_keys": {
19 "openai": OPENAI_API_KEY
20 },
21 "model_config": {
22 "model": "gpt-4",
23 "temperature": 1,
24 },

We’ve added keyboard shortcuts to the datapoint viewer

g for good
b for bad

and j / k for next/prev

This should help you for quickly annotating data within your team.

You can also link to specific datapoint in the URL now as well.

ChatGPT support

ChatGPT is here! It’s called ‘gpt-3.5-turbo’. Try it out today in playground and on the generate endpoint.

Faster and 10x cheaper than text-davinci-003.

Faster datapoints table loading

Initial datapoints table is now twice as fast to load! And it will continue to get faster.

Ability to open datapoint in playground

Added a way to go from the datapoint drawer to the playground with that datapoint loaded. Very convenient for trying tweaks to a model config or understanding an issue, without copy pasting.

Markdown view and completed prompt templates

We’ve added a tab to the datapoint drawer so you can see the prompt template filled in with the inputs and output.

We’ve also button in the top right hand corner (or press M) to toggle on/off viewing the text as markdown.