LLM Observability

Humanloop allows you to monitor LLMs which extends beyond simple logging but also allows you to track and police the high-level behavior of your LLMs

At the core of Humanloop’s monitoring system are evaluators - functions you define that analyze LLM-generated logs and produce evaluations. These evaluations can be boolean flags or numerical scores, providing insights into how well your model is performing based on criteria specific to your use case.

Evaluators in the monitoring context act as continuous checks on your deployed models, helping you maintain quality, detect anomalies, and ensure your LLMs are behaving as expected in the production environment.

Types

Humanloop supports three types of evaluators for monitoring:

  • Code based - Using our in-browser editor, define simple Python functions to act as evaluators. These run automatically on your logs.
  • LLM as judge - Use LLMs to evaluate the outputs of other Prompts or Tools. Our editor lets you create prompts that pass log data to a model for assessment. This is ideal for subjective evaluations like tone and factual accuracy. These also run automatically.
  • Human evaluators - Collect feedback from human evaluators using our feedback API. This allows you to incorporate human judgment or in-app actions into your monitoring process.

Both code-based and LLM-based evaluators run automatically on your logs, while human evaluators provide a way to incorporate manual feedback when needed.

Monitoring vs Evaluation

While monitoring and evaluation are closely related, they serve different purposes in the lifecycle of your LLM-powered applications:

  • Monitoring is the continuous assessment of your deployed models in production environments. It involves real-time analysis of logs generated by your live system, providing immediate insights into performance and behavior.

  • Evaluation, on the other hand, typically refers to offline testing and assessment during the development phase or for periodic performance checks.

Humanloop’s monitoring capabilities allow you to set up evaluators that automatically run on logs from your production environment, giving you real-time insights into your model’s performance.

For detailed information on offline evaluation and testing during development, please refer to our Evaluation guide.