Evals SDK Improvements
We’ve added a new run
method for evals to our SDK. This provides a simpler
entry point for evaluating your existing pipelines, both in your CICD and experimentation
workflows. This is currently available in Beta on Python and will soon be added to the major versions of both Py and TS SDKs.
In order to run an eval via the SDK, you need to provide:
- A callable function that takes your inputs/messages and returns a string
- A Dataset of inputs/message to evaluate the function against
- A set of Evaluators to use to provide judgements
Here is a toy example using a simple OpenAI call as the function to evaluate.
Running this will provide status info and an eval summary in your CLI and a new eval will appear on Humanloop at the
displayed URL. Running it again under the same name
will add runs to the existing eval.
It returns a set of checks you can use to determine whether the eval passed or failed.
Introduce versioning
The only thing distinguishing different eval runs under the same eval name
so far is the time stamp they were run.
It can also be helpful to record what the configuration of your system was when running the eval.
You can include arbitrary config within the version
field of the file
. If this
version
has been used before, Humanloop will automatically associate it to your run. If the
config is new, we will automatically create a new version of your file for future reference.
Leverage native Prompts
Using hl.evaluations.run(...)
will by default create a Flow on Humanloop. Flows have
the advantage of being able to represent more complex traces, but can’t be run natively
within the Humanloop Editor.
It’s also possible to adapt the run
call to instead evaluate Prompts
by defining the type
as prompt and providing valid Prompt params in the version
field.
Add Evaluator thresholds
You can also now provide a threshold value for each of your Evaluators.
If provided, the checks
return will determine whether the average performance
of the Evaluator met the threshold or not.