January
Aggregate statistics for Eval Runs
January 21st, 2025
We’ve added aggregate statistics to the Runs table to help you quickly compare performance across different Evaluators. You can view these statistics in the Runs tab of any Evaluation that contains Evaluators.
For boolean Evaluators, we show the percentage of true
judgments. For number Evaluators, we display the average value.
For select and multi-select Evaluators, we display a bar chart showing the distribution of the judgments.
Additional icons indicate the status of the Run, relevant to the aggregate stat:
- A spinning icon indicates that not all Logs have judgments, and the Run is currently being executed. The displayed aggregate statistic may not be final.
- A clock icon shows that not all Logs have judgments, though the Run is not currently being executed
- A red warning icon indicates errors when running the Evaluator
Hover over these icons or aggregate statistics to view more details in the tooltip, such as the number of judgments and the number of errors (if any).
Select Eval Runs for comparison
January 21st, 2025
You can now more easily compare your relevant Runs by selecting them in the Runs tab.
To filter to a subset of Runs, go to the Runs tab and select them by clicking the checkbox or by pressing x
with your cursor on the row.
Then, go to the Stats or Review tab to see the comparison between the selected Runs. Your control Run will always be included in the comparison.
Judgment filters in Review view
January 17th, 2025
You can now filter Logs by judgments in the Review tab of an Evaluation. This feature allows you to quickly retrieve specific Logs, such as those marked as “Good” or “Bad” by a subject-matter expert, or those with latency below a certain threshold.
To filter Logs, click on the Filter button in the Review tab to set up your first filter.