November 8, 2023
Improved evaluation runs
You can now trigger runs against multiple model configs simultaneously.
This improves your ability to compare and evaluate changes across your prompts. We’ve also removed the summary cards. In their place, we’ve added a table that supports sorting and rearranging of columns to help you better interrogate results.
Multiple model configs
To run evaluations against multiple model configs it’s as simple as selecting the targeted model configs in the run dialog, similar to before, but multiple choices are now supported. This will trigger multiple evaluation runs at once, with each model config selected as a target.
Evaluation table
We’ve updated our evaluation runs with a table to help view the outcomes of runs in a more condensed form. It also allows you to sort results and trigger re-runs easier. As new evaluators are included, a column will be added automatically to the table.
Re-run previous evaluations
We’ve exposed the re-run option in the table to allow you to quickly trigger runs again, or use older runs as a way to preload the dialog and change the parameters such as the target dataset or model config.
New OpenAI turbos
Off the back of OpenAI’s dev day we’ve added support for the new turbo models that were announced:
- gpt-4-1106-preview
- gpt-3.5-turbo-1106
Both of these models add a couple of nice capabilities:
- Better instruction following performance
- JSON mode that forces the model to return valid JSON
- Can call multiple tools at once
- Set a seed for reproducible outputs
You can now access these in your Humanloop Editor and via the API.