July

Improved error handling

July 30th, 2023

We’ve unified how errors returned by model providers are handled and enabled error monitoring using eval functions.

A common production pain point we see is that hosted SOTA language models can still be flaky at times, especially at real scale. With this release, Humanloop can help users better understand the extent of the problem and guide them to different models choices to improve reliability.

Unified errors

Our users integrate the Humanloop /chat and /completion API endpoints as a unified interface into all the popular model providers including OpenAI, Anthropic, Azure, Cohere, etc. Their Humanloop projects can then be used to manage model experimentation, versioning, evaluation and deployment.

Errors returned by these endpoints may be raised by the model provider’s system. With this release we’ve updated our API to map all the error behaviours from different model providers to a unified set of error response codes.

We’ve also extended our error responses to include more details of the error with fields for type, message, code and origin. The origin field indicates if the error originated from one of the integrated model providers systems, or directly from Humanloop.

For example, for our /chat endpoint where we attempt to call OpenAI with an invalid setting for max_tokens, the message returned is that raised by OpenAI and the origin is set to OpenAI.

1{
2 "type": "unprocessable_entity_error",
3 "message": "This model's maximum context length is 4097 tokens. However, you requested 10000012 tokens (12 in the messages, 10000000 in the completion). Please reduce the length of the messages or completion.",
4 "code": 422,
5 "origin": "OpenAI"
6}

Monitor model reliability with evals

With this release, all errors returned from the different model providers are now persisted with the corresponding input data as datapoints on Humanloop. Furthermore this error data is made available to use within evaluation functions.

You can now turn on the Errors eval function, which tracks overall error rates of the different model variations in your project. Or you can customise this template to track more specific error behaviour.

Errors evaluation function template now available

OpenAI functions in Playground

July 25th, 2023

We’ve added support for OpenAI functions to our playground!

This builds on our API support and allows you to easily experiment with OpenAI functions within our playground UI.

OpenAI functions are implemented as tools on Humanloop. Tools follow the same universal json-schema definition as OpenAI functions. You can now define tools as part of your model configuration in the playground. These tools are sent as OpenAI functions when running the OpenAI chat models that support function calling.

The model can choose to return a JSON object containing the arguments needed to call a function. This object is displayed as a special assistant message within the playground. You can then provide the result of the call in a message back to the model to consider, which simulates the function calling workflow.

Use tools in Playground

Take the following steps to use tools for function calling in the playground:

  1. Find tools: Navigate to the playground and locate the Tools section. This is where you’ll be able to manage your tool definitions.

  1. Create a new tool: Click on the “Add Tool” button. There are two options in the dropdown: create a new tool or to start with one of our examples. You define your tool using the json-schema syntax. This represents the function definition sent to OpenAI.

  1. Edit a tool: To edit an existing tool, simply click on the tool in the Tools section and make the necessary changes to its json-schema definition. This will result in a new model configuration.

  1. Run a model with tools: Once you’ve defined your tools, you can run the model by pressing the “Run” button.
    1. If the model chooses to call a function, an assistant message will be displayed with the corresponding tool name and arguments to use.
    2. A subsequent Tool message is then displayed to simulate sending the results of the call back to the model to consider.

  1. Save your model config with tools by using the Save button. Model configs with tools defined can then deployed to environments as normal.

Coming soon

Provide the runtime for your tool under the existing pre-defined Tools section of your organization on Humanloop.


Llama 2

July 24th, 2023

We’ve added support for Llama 2!

You can now select llama70b-v2 from the model dropdown in the Playground and Editor. You don’t currently need to provide an API key or any other special configuration to get Llama 2 access via Humanloop.

Llama 2 is available in Playground and Editor for all Humanloop users.

Read more about the latest version of Llama here and in the original announcement.


Claude 2

July 17th, 2023

We’ve added support for Anthropic’s latest model Claude 2.0!

Read more about the latest Claude here.


Evaluators

July 7th, 2023

We’ve added Evaluators to Humanloop in beta!

Evaluators allow you to quantitatively define what constitutes a good or bad output from your models. Once set up, you can configure an Evaluators to run automatically across all new datapoints as they appear in your project; or, you can simply run it manually on selected datapoints from the Data tab.

We’re going to be adding lots more functionality to this feature in the coming weeks, so check back for more!

Create an Evaluator

If you’ve been given access to the feature, you’ll see a new Evaluations tab in the Humanloop app. To create your first evaluation function, select + New Evaluator. In the dialog, you’ll be presented with a library of example Evaluators, or you can start from scratch.

We offer a library of example Evaluators to get you started.

We’ll pick Valid JSON for this guide.

Evaluator editor.

In the editor, provide details of your function’s name, description and return type. In the code editor, you can provide a function which accepts a datapoint argument and should return a value of the chosen type.

Currently, the available return types for an Evaluators are number and boolean. You should ensure that your function returns the expected data type - an error will be raised at runtime if not.

The Datapoint argument

The datapoint passed into your function will be a Python dict with the following structure.

1{
2 "id":"data_XXXX", # Datapoint id
3 "model_config": {...}, # Model config used to generate the datapoint
4 "inputs": {...}, # Model inputs (interpolated into the prompt)
5 "output": "...", # Generated output from the model
6 "provider_latency": 0.6, # Provider latency in seconds
7 "metadata": {...}, # Additional metadata attached to the logged datapoint
8 "created_at": "...", # Creation timestamp
9 "feedback": [...] # Array of feedback provided on the datapoint
10}

To inspect datapoint dictionaries in more detail, click Random selection in the debug console at the bottom of the window. This will load a random set of five datapoints from your project, exactly as they will be passed into the Evaluation Function.

The debug console - load datapoints to inspect the argument passed into Evaluators.

For this demo, we’ve created a prompt which asks the model to produce valid JSON as its output. The Evaluator uses a simple json.loads call to determine whether the output is validly formed JSON - if this call raises an exception, it means that the output is not valid JSON, and we return False.

1import json
2
3def check_valid_json(datapoint):
4 try:
5 return json.loads(datapoint["output"]) is not None
6 except:
7 return False

Debugging

Once you have drafted a Python function, try clicking the run button next to one of the debug datapoints in the debug console. You should shortly see the result of executing your function on that datapoint in the table.

A `True` result from executing the **Valid JSON** Evaluators on the datapoint.

If your Evaluator misbehaves, either by being invalid Python code, raising an unhandled exception or returning the wrong type, an error will appear in the result column. You can hover this error to see more details about what went wrong - the exception string is displayed in the tooltip.

Once you’re happy with your Evaluator, click Create in the bottom left of the dialog.

Activate / Deactivate an Evaluator

Your Evaluators are available across all your projects. When you visit the Evaluations tab from a specific project, you’ll see all Evaluators available in your organisation.

Each Evaluator has a toggle. If you toggle the Evaluator on, it will run on every new datapoint that gets logged to that project. (Switch to another project and you’ll see that the Evaluator is not yet toggled on if you haven’t chosen to do so).

You can deactivate an Evaluator for a project by toggling it back off at any time.

Aggregations and Graphs

At the top of the Dashboard tab, you’ll see new charts for each activated Evaluation Function. These display aggregated Evaluation results through time for datapoints in the project.

At the bottom of the Dashboard tab is a table of all the model configs in your project. That table will display a column for each activated Evaluator in the project. The data displayed in this column is an aggregation of all the Evaluation Results (by model config) for each Evaluator. This allows you to assess the relative performance of your models.

Evaluation Results through time, by model config. In this example, one of the model configs is not producing Valid JSON outputs, while the other is about 99% of the time.

Aggregation

For the purposes of both the charts and the model configs table, aggregations work as follows for the different return types of Evaluators:

  • Boolean: percentage returning True of the total number of evaluated datapoints
  • Number: average value across all evaluated datapoints

Data logs

In the Data tab, you’ll also see that a column is visible for each activated Evaluator, indicating the result of running the function on each datapoint.

The **Data** tab for a project, showing the **Valid JSON** Evaluation Results for a set of datapoints.

From this tab, you can choose to re-run an Evaluator on a selection of datapoints. Either use the menu at the far right of a single datapoint, or select multiple datapoints and choose Run evals from the Actions menu in the top right.

Available Modules

The following Python modules are available to be imported in your Evaluation Function:

  • math
  • random
  • datetime
  • json (useful for validating JSON grammar as per the example above)
  • jsonschema (useful for more fine-grained validation of JSON output - see the in-app example)
  • sqlglot (useful for validating SQL query grammar)
  • requests (useful to make further LLM calls as part of your evaluation - see the in-app example for a suggestion of how to get started).

Let us know if you would like to see more modules available.


Chain LLM calls

July 5th, 2023

We’ve introduced sessions to Humanloop, allowing you to link multiple calls together when building a chain or agent.

Using sessions with your LLM calls helps you troubleshoot and improve your chains and agents.

Trace of an Agent's steps logged as a session

Adding a datapoint to a session

To log your LLM calls to a session, you just need to define a unique identifier for the session and pass it into your Humanloop calls with session_reference_id.

For example, using uuid4() to generate this ID,

1import uuid
2session_reference_id = str(uuid.uuid4())
3
4response = humanloop.complete(
5 project="sessions_example_assistant",
6 model_config={
7 "prompt_template": "Question: {{user_request}}\nGoogle result: {{google_answer}}\nAnswer:\n",
8 "model": "text-davinci-002",
9 "temperature": 0,
10 },
11 inputs={"user_request": user_request, "google_answer": google_answer},
12 session_reference_id=session_reference_id,
13)

Similarly, our other methods such as humanloop.complete_deployed(), humanloop.chat(), and humanloop.log() etc. support session_reference_id.

If you’re using our API directly, you can pass session_reference_id within the request body in your POST /v4/completion etc. endpoints.

Further details

For a more detailed walkthrough on how to use session_reference_id, check out our guide that runs through how to record datapoints to a session in an example script.


Introducing Tools

July 3rd, 2023

Today we’re announcing Tools as a part of Humanloop.

Tools allow you to connect an LLM to any API and to an array of data sources to give it extra capabilities and access to private data. Under your organization settings on Humanloop you can now configure and manage tools in a central place.

Read more on our blog and see an example of setting up a tool for semantic search.


OpenAI functions API

July 3rd, 2023

We’ve updated our APIs to support OpenAI function calling.

OpenAI functions are now supported as tools on Humanloop. This allows you to pass tool definitions as part of the model configuration when calling our chat and log endpoints. For the latest OpenAI models gpt-3.5-turbo-0613 and gpt-4-0613 the model can then choose to output a JSON object containing arguments to call these tools.

This unlocks getting more reliable structured data back from the model and makes it easier to create useful agents.

Recap on OpenAI functions

As described in the OpenAI documentation, the basic steps for using functions are:

  1. Call one of the models gpt-3.5-turbo-0613 and gpt-4-0613 with a user query and a set of function definitions described using the universal json-schema syntax.
  2. The model can then choose to call one of the functions provided. If it does, a stringified JSON object adhering to your json schema definition will be returned.
  3. You can then parse the string into JSON in your code and call the chosen function with the provided arguments (NB: the model may hallucinate or return invalid json, be sure to consider these scenarios in your code).
  4. Finally call the model again by appending the function response as a new message. The model can then use this information to respond to the original use query.

OpenAI have provided a simple example in their docs for a get_current_weather function that we will show how to adapt to use with Humanloop:

1import openai
2import json
3
4
5# Example dummy function hard coded to return the same weather
6# In production, this could be your backend API or an external API
7def get_current_weather(location, unit="fahrenheit"):
8 """Get the current weather in a given location"""
9 weather_info = {
10 "location": location,
11 "temperature": "72",
12 "unit": unit,
13 "forecast": ["sunny", "windy"],
14 }
15 return json.dumps(weather_info)
16
17
18def run_conversation():
19 # Step 1: send the conversation and available functions to GPT
20 messages = [{"role": "user", "content": "What's the weather like in Boston?"}]
21 functions = [
22 {
23 "name": "get_current_weather",
24 "description": "Get the current weather in a given location",
25 "parameters": {
26 "type": "object",
27 "properties": {
28 "location": {
29 "type": "string",
30 "description": "The city and state, e.g. San Francisco, CA",
31 },
32 "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
33 },
34 "required": ["location"],
35 },
36 }
37 ]
38 response = openai.ChatCompletion.create(
39 model="gpt-3.5-turbo-0613",
40 messages=messages,
41 functions=functions,
42 function_call="auto", # auto is default, but we'll be explicit
43 )
44 response_message = response["choices"][0]["message"]
45
46 # Step 2: check if GPT wanted to call a function
47 if response_message.get("function_call"):
48 # Step 3: call the function
49 # Note: the JSON response may not always be valid; be sure to handle errors
50 available_functions = {
51 "get_current_weather": get_current_weather,
52 } # only one function in this example, but you can have multiple
53 function_name = response_message["function_call"]["name"]
54 fuction_to_call = available_functions[function_name]
55 function_args = json.loads(response_message["function_call"]["arguments"])
56 function_response = fuction_to_call(
57 location=function_args.get("location"),
58 unit=function_args.get("unit"),
59 )
60
61 # Step 4: send the info on the function call and function response to GPT
62 messages.append(response_message) # extend conversation with assistant's reply
63 messages.append(
64 {
65 "role": "function",
66 "name": function_name,
67 "content": function_response,
68 }
69 ) # extend conversation with function response
70 second_response = openai.ChatCompletion.create(
71 model="gpt-3.5-turbo-0613",
72 messages=messages,
73 ) # get a new response from GPT where it can see the function response
74 return second_response
75
76
77print(run_conversation())

Using with Humanloop tools

OpenAI functions are treated as tools on Humanloop. Tools conveniently follow the same universal json-schema definition as OpenAI functions.

We’ve expanded the definition of our model configuration to also include tool definitions. Historically the model config is made up of the chat template, choice of base model and any hyper-parameters that change the behaviour of the model.

In the cases of OpenAIs gpt-3.5-turbo-0613 and gpt-4-0613 models, any tools defined as part of the model config are passed through as functions for the model to use.

You can now specify these tools when using the Humanloop chat endpoint (as a replacement for OpenAI’s ChatCompletion), or when using the Humanloop log endpoint in addition to the OpenAI calls:

Chat endpoint

We show here how to update the run_conversation() method from the OpenAI example to instead use the Humanloop chat endpoint with tools:

1from humanloop import Humanloop
2
3hl = Humanloop(
4 # get your API key here: https://app.humanloop.com/account/api-keys
5 api_key="YOUR_API_KEY",
6)
7
8def run_conversation():
9 # Step 1: send the conversation and available functions to GPT
10 messages = [{"role": "user", "content": "What's the weather like in Boston?"}]
11 # functions are referred to as tools on Humanloop, but follows the same schema
12 tools = [
13 {
14 "name": "get_current_weather",
15 "description": "Get the current weather in a given location",
16 "parameters": {
17 "type": "object",
18 "properties": {
19 "location": {
20 "type": "string",
21 "description": "The city and state, e.g. San Francisco, CA",
22 },
23 "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
24 },
25 "required": ["location"],
26 },
27 }
28 ]
29 response = hl.chat(
30 project="Assistant",
31 model_config={
32 "model": "gpt-3.5-turbo-0613",
33 "tools": tools
34 },
35 messages=messages
36 )
37 response = response.body.data[0]
38
39 # Step 2: check if GPT wanted to call a tool
40 if response.get("tool_call"):
41 # Step 3: call the function
42 # Note: the JSON response may not always be valid; be sure to handle errors
43 available_functions = {
44 "get_current_weather": get_current_weather,
45 } # only one function in this example, but you can have multiple
46 function_name = response_message["function_call"]["name"]
47 fuction_to_call = available_functions[function_name]
48 function_args = json.loads(response["tool_call"]["arguments"])
49 function_response = fuction_to_call(
50 location=function_args.get("location"),
51 unit=function_args.get("unit"),
52 )
53
54 # Step 4: send the response back to the model
55 messages.append(response_message)
56 messages.append(
57 {
58 "role": "tool",
59 "name": function_name,
60 "content": function_response,
61 }
62 )
63 second_response = hl.chat(
64 project="Assistant",
65 model_config={
66 "model": "gpt-3.5-turbo-0613",
67 "tools": tools
68 },
69 messages=messages
70 )
71 return second_response

After running this snippet, the model configuration recorded on your project in Humanloop will now track what tools were provided to the model and the logged datapoints will provide details of the tool called to inspect:

Log endpoint

Alternatively, you can also use the explicit Humanloop log alongside your existing OpenAI calls to achieve the same result:

1from humanloop import Humanloop
2
3hl = Humanloop(
4 # get your API key here: https://app.humanloop.com/account/api-keys
5 api_key="YOUR_API_KEY",
6)
7
8def run_conversation():
9 # Step 1: send the conversation and available functions to GPT
10 messages = [{"role": "user", "content": "What's the weather like in Boston?"}]
11 functions = [
12 {
13 "name": "get_current_weather",
14 "description": "Get the current weather in a given location",
15 "parameters": {
16 "type": "object",
17 "properties": {
18 "location": {
19 "type": "string",
20 "description": "The city and state, e.g. San Francisco, CA",
21 },
22 "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
23 },
24 "required": ["location"],
25 },
26 }
27 ]
28 response = openai.ChatCompletion.create(
29 model="gpt-3.5-turbo-0613",
30 messages=messages,
31 functions=functions,
32 function_call="auto", # auto is default, but we'll be explicit
33 )
34 response_message = response["choices"][0]["message"]
35
36 # log the result to humanloop
37 log_response = hl.log(
38 project="Assistant",
39 model_config={
40 "model": "gpt-3.5-turbo-0613",
41 "tools": tools,
42 },
43 messages=messages,
44 tool_call=response_message.get("function_call")
45 )
46
47 # Step 2: check if GPT wanted to call a function
48 if response_message.get("function_call"):
49 # Step 3: call the function
50 # Note: the JSON response may not always be valid; be sure to handle errors
51 available_functions = {
52 "get_current_weather": get_current_weather,
53 } # only one function in this example, but you can have multiple
54 function_name = response_message["function_call"]["name"]
55 fuction_to_call = available_functions[function_name]
56 function_args = json.loads(response_message["function_call"]["arguments"])
57 function_response = fuction_to_call(
58 location=function_args.get("location"),
59 unit=function_args.get("unit"),
60 )
61
62 # Step 4: send the info on the function call and function response to GPT
63 messages.append(response_message) # extend conversation with assistant's reply
64 messages.append(
65 {
66 "role": "function",
67 "name": function_name,
68 "content": function_response,
69 }
70 ) # extend conversation with function response
71 second_response = openai.ChatCompletion.create(
72 model="gpt-3.5-turbo-0613",
73 messages=messages,
74 ) # get a new response from GPT where it can see the function response
75
76 log_response = hl.log(
77 project="Assistant",
78 model_config={
79 "model": "gpt-3.5-turbo-0613",
80 "tools": tools,
81 },
82 messages=messages,
83 output=second_response["choices"][0]["message"]["content"],
84 )
85 return second_response
86
87
88print(run_conversation())

Coming soon

Support for defining tools in the playground!