September 5, 2024

Evals CICD Improvements

We’ve expanded our evals API to include new fields that allow you to more easily check on progress and render summaries of your Evals directly in your deployment logs.

The stats response now contains a status you can poll and progess and report fields that you can print:

⏳ Evaluation Progress
Total Logs: 40/40
Total Judgments: 120/120
📊 Evaluation Results for evals_demo/answer-flow
+------------------------+---------------------------+---------------------------+
| Version id | flv_xo7ZxnkkvcFcDJ9pwSrA9 | flv_foxO18ZHEgxQmwYJO4bR1 |
+------------------------+---------------------------+---------------------------+
| Created | 2024-09-01 14:50:28 | 2024-09-02 14:53:24 |
+------------------------+---------------------------+---------------------------+
| Evaluators | | |
+------------------------+---------------------------+---------------------------+
| evals_demo/exact_match | 0.8 | 0.65 |
| evals_demo/levenshtein | 7.5 | 33.5 |
| evals_demo/reasoning | 0.3 | 0.05 |
+------------------------+---------------------------+---------------------------+
Navigate to Evaluation: https://app.humanloop.com/evaluations/evr_vXjRgufGzwuX37UY83Lr8
❌ Latest score [0.05] below the threshold [0.5] for evaluator evals_demo/reasoning.
❌ Regression of [-0.25] for evaluator evals_demo/reasoning

See how you can leverage Evals as part of your CICD pipeline to test for regressions in your AI apps in our reference example.