Integrating Humanloop | Humanloop Docs

Humanloop is the central registry for your AI features. It versions the key artifacts — the Prompts, Tools, Flows, and Datasets — while capturing logs to enable you to run evals and monitor performance to make reliable AI applications.

Humanloop is flexible on how you integrate it into your applications, you can:

Store the artifacts in your codebase or within Humanloop
Execute AI logic in code or on Humanloop
Run evals in your runtime or on Humanloop

This facilitates collaboration on AI features with non-technical subject matter experts while maintaining your software development practices.

Choosing an integration approach

Your integration approach will depend on the complexity of your project, the structure of your team, and your deployment requirements.

UI vs Code Prompt Management

You should choose one of two ways for managing prompts and AI artifacts: “UI first” or “code first”. You’ll likely want to use UI-first approach if your domain experts are involved in prompt development.

UI-first prompt management

You create and iterate on your Prompts, Tools, Flows and Datasets in the Humanloop UI. Prompts can be synced back to your codebase or called directly via the API.

This is ideal for situations where you have subject matter experts or project managers who are primarily responsible for the “how it works” of your AI features.

Benefits:

✅ Faster iteration which can be decoupled from code releases
✅ Easier collaboration with subject matter experts

Drawbacks:

❌ Does not support complex prompt chains or flows

Code-first prompt management

You define and maintain prompts, tools and agents in code and then log to Humanloop.

Benefits:

✅ Full flexibility of code for developing your prompts or complex agents
✅ Maintains a point-in-time snapshot of both code and prompts
✅ Maintains your existing CI/CD practices

Drawbacks:

❌ Harder for domain experts to collaborate
❌ Less convenient for iterating on prompts and running evaluations
❌ Tightly coupled to code releases, means slower iteration

Proxy vs Direct Calls

You have two options for calling your Prompts: “Proxy via Humanloop” or “Direct Calls + Logging”. You’ll likely want to use the proxy approach if you want to centralize your logging and versioning and the direct call approach if you want to minimize latency.

Proxy via Humanloop

Your calls to large language models go via Humanloop, which forwards requests to model providers and logs the results automatically.

Benefits:

✅ Simplifies integration
✅ Centralized logging and versioning
✅ Easier hot-swapping of Prompts

Drawbacks:

❌ Adds a slight latency overhead
❌ Adds Humanloop as a dependency to your critical path

Direct Calls + Logging

Your app calls the model provider directly, then logs results to Humanloop.

Benefits:

✅ No extra latency
✅ Can call custom providers and services

Drawbacks:

❌ Requires additional logging logic

The rest of this guide gives more details on the different approaches.

Prompt Management

Prompt management is the process of creating, updating, and versioning your Prompts, Tools, Flows and Datasets. AI artifacts should be version-controlled just like application code, however unlike application code, often the people that best understand how they should work are domain experts, not developers. Humanloop provides a UI-based environment for domain expert involvement, while still being able to sync the back to your codebase and version control.

Versioning

Humanloop automatically tracks all versions of your Tools, Flows, Agents and Datasets, whether created in the UI or in code. Each version is uniquely identified by hashing its parameters. Humanloop computes this hash based on the template, the temperature and other parameters and logs accordingly. Every time a Prompt is called or an LLM response is logged, the resulting Log is associated with a specific version.

This versioning allows you to:

Evaluate: compare the performance of different versions
Auditability: review changes over time
Deployment control: manage rollbacks and staged releases

Create and manage your Prompts in code or in the UI

Humanloop supports both UI and code-based prompt management, giving you flexibility for your team’s needs. You can always mix both approaches — starting in the UI and syncing to code, or vice versa.

Use code-first management if:

You are creating a complex agent or workflow

If an LLM call is embedded within a structured workflow with dependencies on the schemas before and after the step, managing it in code is usually more practical.
You are creating your Prompts dynamically

For example, if you are dynamically setting the template or available tools based on app state or user inputs, you’ll find it easier to set this in code.

Note that with template variables ({{ variable }}) and jinja syntax, you should put the dynamic state as inputs into the template.

Use the UI for AI development if:

Your domain experts are responsible for prompt iteration.

The UI enables non-technical team members to contribute effectively.
You want an interactive environment for prompt engineering

Your domain experts can load in examples, replay interesting datapoints, and run evaluations with ease.
You’re iterating on existing Prompts

Even if the Prompt was initially developed in code, the UI is a more convenient place to iterate on it

Syncing Humanloop to your codebase

If Humanloop is the source of truth, and you wish to have the state of your Prompts in your codebase, you can sync the latest versions back to your codebase.

Each file can be serialised and stored in version control. The .prompt format is a human-readable representation of your Prompts that can be stored in git, and all other files can be serialised as JSON.

You can use the List Files API to fetch the latest versions of your files and store them in your codebase.

Decoupling AI updates from your software lifecycle

For many applications you want to update your AI behavior faster than your traditional software release process. Humanloop allows you to iterate on prompts separately while maintaining control over which versions are used in production.

Environment labels

Humanloop supports environment labels (e.g., "development", "staging", "production") that allow you to deploy specific prompt versions without modifying code.

How it works:

When calling/fetching your Prompts in code, specify the environment e.g. production instead of a fixed version ID.
In the UI (or via API), promote a new version to the desired environment.
Your application now uses the updated Prompt without any code changes ✨

Calling your Prompts, Agents and Flows

You have two options for calling LLMs with your prompts in production:

1. Use the Humanloop Proxy for simplicity

Proxying your LLM calls via Humanloop simplifies logging, versioning, and provider management. Humanloop provides a unified API that forwards requests to the LLM provider and automatically logs the results.

How it works:

You call the Humanloop API (e.g.Prompts.call()) instead of each model provider.
Humanloop fetches/creates the prompt, calls the provider and logs and returns the result.

✅ Easier integration – fetches latest version, calls provider and logs result in one call
✅ Unified interface – easy swapping between model providers
✅ Automatic versioning – ensures all interactions are versioned
⚠️ Adds slight latency – due to the extra network hop
⚠️ Adds to the critical path - for uptime critical applications you may want an enterprise SLA

Using the Humanloop proxy can accelerate development since you can swap out models or adjust prompt templates on the platform side without changing code.

2. Direct call and logging

With this you call the model provider directly and log results manually to Humanloop. This minimizes latency and allows custom models.

How it works:

You fetch the latest version of the prompt from Humanloop (or have it in code)
You send the requests to OpenAI, Anthropic, etc.
You log responses separately to Humanloop.

✅ No extra network hop – ideal for latency-sensitive use cases
✅ Supports proprietary/custom models
⚠️ Requires manual logging – you must track metadata yourself

Choosing the Right Approach

For most teams, proxying via Humanloop is the easiest choice. If performance is critical, use direct calls but ensure proper logging.

Use Case	Best Approach
Fast iteration & centralized logging	Proxy via Humanloop
Low-latency, mission-critical apps	Direct calls + logging
Using a custom/proprietary model	Direct calls + logging

Running Evals

Given the non-deterministic nature of LLMs and the challenging subjective tasks they’re applied to, evaluation is integral to the development of AI features.

Evaluations are where Evaluators are run on the Logs of your Prompts, Flows and Agents. The Logs can be production data, or created by running a Dataset through the application you’re testing.

Evaluators in Humanloop runtime versus your runtime

Evaluators are used to evaluate the performance of your Prompts, Flows and Agents. They can be defined on Humanloop as a Python function, an LLM call, or as a “human” feedback.

Your Evaluators and Datasets can be stored in code, or in Humanloop. Evaluations can be run in code or through the UI.

Evaluators which are defined within the Humanloop UI can be executed in the Humanloop runtime, whereas Evaluators defined in your code can be executed in your runtime and the results posted back to Humanloop. This provides flexibility for supporting more complex evaluation workflows.

Our recommendation is to store them on Humanloop where possible so that you can benefit from:

✅ Parallel execution
✅ Automatic logging of results
✅ Versioning of evaluators
✅ Integration with your existing CI/CD
✅ Easy triggering of evals from the UI or from your CI/CD
✅ Re-use of your evaluators as online monitoring evaluators

What’s Next?

You should now have an understanding of the flexibility of Humanloop’s integration patterns, that can cater for subject matter experts and developers to collaborate on AI features.