October 3, 2024

Evals SDK Improvements

We’ve added a new run method for evals to our SDK. This provides a simpler entry point for evaluating your existing pipelines, both in your CICD and experimentation workflows. This is currently available in Beta on Python and will soon be added to the major versions of both Py and TS SDKs.

In order to run an eval via the SDK, you need to provide:

  1. A callable function that takes your inputs/messages and returns a string
  2. A Dataset of inputs/message to evaluate the function against
  3. A set of Evaluators to use to provide judgements

Here is a toy example using a simple OpenAI call as the function to evaluate.

1from humanloop import Humanloop
2from openai import OpenAI
3from dotenv import load_dotenv
4import os
5
6load_dotenv()
7hl = Humanloop(api_key=os.getenv("HUMANLOOP_KEY"))
8openai = OpenAI(api_key=os.getenv("OPENAI_KEY"))
9
10
11# First define the app you're evaluating
12def call_digital_twin(person: str, messages: list) -> str:
13 system_message = {
14 "role": "system",
15 "content": f"You are {person}"
16 }
17 chat_completion = openai.chat.completions.create(
18 model="gpt-3.5-turbo",
19 messages=[system_message] + messages,
20 )
21 answer = chat_completion.choices[0].message.content
22 return answer
23
24
25# Then run an eval specifying the file location on Humanloop
26checks = hl.evaluations.run(
27 name="Demo eval",
28 file={
29 "path": "digital-twin",
30 "callable": call_digital_twin
31 },
32 evaluators=[
33 {"path": "Latency"},
34 {"path": "Cost"},
35 {"path": "Correctness-AI"}
36 ],
37 dataset={
38 "path": "questions",
39 "datapoints": [
40 {
41 "inputs": {"person": "Albert Einstein"},
42 "messages": [{"role": "user", "content": "What is your most famous theory?"}]
43 },
44 {
45 "inputs": {"person": "Serena Williams"},
46 "messages": [{"role": "user", "content": "What trophy did you win most recently?"}]
47 },
48 {
49 "inputs": {"person": "Marie Curie"},
50 "messages": [{"role": "user", "content": "What was your greatest scientific achievement?"}]
51 },
52 {
53 "inputs": {"person": "Leonardo da Vinci"},
54 "messages": [{"role": "user", "content": "Which of your inventions are you most proud of?"}]
55 },
56 {
57 "inputs": {"person": "Rosa Parks"},
58 "messages": [{"role": "user", "content": "What motivated you to refuse giving up your seat?"}]
59 }
60 ]
61 },
62)

Running this will provide status info and an eval summary in your CLI and a new eval will appear on Humanloop at the displayed URL. Running it again under the same name will add runs to the existing eval.

Navigate to your evals:
https://app.humanloop.com/project/fl_euUV4BHoXqKWqFyZ1YD1o/evaluations/evr_6WhFaHdkbWH8ZaoddzyRD/stats
Running digital-twin flow callable over the Dataset questions using 4 workers
[########################################] 5/5 (100.00%) | DONE
⏳ Evaluation Progress
Total Logs: 10/10
Total Judgments: 30/30
📊 Eval results for digital-twin
+----------------+----------------------------------------+----------------------------------------+
| Version ID | flv_VIP1eiemqbpWmlsr84BwN (eb37773f39) | flv_VIP1eiemqbpWmlsr84BwN (9de378a165) |
+----------------+----------------------------------------+----------------------------------------+
| Added | 2024-10-08 03:46:11 | 2024-10-08 03:51:52 |
+----------------+----------------------------------------+----------------------------------------+
| Evaluators | | |
+----------------+----------------------------------------+----------------------------------------+
| Latency | 0.02 | 0.015 |
| Correctness-AI | 1.0 | 1.0 |
+----------------+----------------------------------------+----------------------------------------+

It returns a set of checks you can use to determine whether the eval passed or failed.

Introduce versioning

The only thing distinguishing different eval runs under the same eval name so far is the time stamp they were run.

It can also be helpful to record what the configuration of your system was when running the eval.

You can include arbitrary config within the version field of the file. If this version has been used before, Humanloop will automatically associate it to your run. If the config is new, we will automatically create a new version of your file for future reference.

1import inspect
2
3checks = hl.evaluations.run(
4 name="Demo eval",
5 file={
6 "path": "digital-twin",
7 "callable": call_digital_twin,
8 "version":{
9 "version":"0.2.4",
10 "code": inspect.getsource(call_digital_twin)
11 }
12 },
13 dataset={...},
14 evaluators=[...],
15)

Leverage native Prompts

Using hl.evaluations.run(...) will by default create a Flow on Humanloop. Flows have the advantage of being able to represent more complex traces, but can’t be run natively within the Humanloop Editor.

It’s also possible to adapt the run call to instead evaluate Prompts by defining the type as prompt and providing valid Prompt params in the version field.

1checks = hl.evaluations.run(
2 file={
3 "path": "digital-twin-prompt",
4 "type": "prompt",
5 "version": {
6 "model": "gpt-4o-mini",
7 "template": [{"role": "system", "content": f"You are {{person}}"}]
8 }
9 },
10 dataset={...},
11 evaluators=[...],
12)

Add Evaluator thresholds

You can also now provide a threshold value for each of your Evaluators. If provided, the checks return will determine whether the average performance of the Evaluator met the threshold or not.

1checks = hl.evaluations.run(
2 file={...},
3 dataset={...},
4 evaluators=[
5 {"path": "Latency"},
6 {"path": "Cost"},
7 {"path": "Correctness-AI", "threshold": 0.5}
8 ],
9)