LLM Evals Done Right - Lessons from Bryan Bischof of Hex AI

Raza Habib

When it comes to evaluating AI products built with large language models (LLMs), most teams know it's important. However, many are approaching it in the wrong way. They rely too heavily on eyeballing a few examples, get fixated on building one "god metric" that tries to capture everything, or gate releases on the wrong criteria. As a result, they miss critical failure modes or simply fail to get preformance to the level needed for a compelling product.

Bryan Bischof, who leads AI at Hex, believes there's a better way. Hex has successfully deployed AI agents that help data scientists automate analyses by generating SQL queries, transforming data, and creating visualizations based on natural language questions. A key part of their success has been their approach to evaluation that breaks the problem down into granular, user-centric components.

I recently sat down with Bischof to understand how Hex approaches evals and what lessons other teams can take from their experience.

Subscribe to Humanloop’s new podcast, High Agency, on YouTube, Spotify, or Apple Podcasts._

Before diving into the details of their evaluation strategy, it's important to understand the context in which they're operating. Hex is building complex AI agents that are not just spitting out one-off responses to user queries. They're engaging in multi-step workflows, piecing together SQL queries, data transformations, and visualizations to deliver a complete analysis. This introduces challenges beyond those of a typical question-answering system.

To get these complex agents to work reliably, the Hex team had to make careful choices about system design:

1. Carefully choose the right set of "tools" to expose to the agent

AI agents are LLM models augmented with the ability to call APIs or “tools”. There's a tradeoff between the generality of each tool the agent can use and the total number of tools. Too general and the agent becomes unreliable, too narrow tools and the number of tools you need explodes. If there are too many tools to chose from it becomes nearly impossibly for the LLM to pick the right one. Hex solved this by mapping the tools needed dirctly to the workflows data scientists were already doing in their product.

"All of the agent capabilities that we have built out, they have started as individual capabilities that the user can do. Before we build a sub agent who's able to generate the relevant SQL on the way to answering your question, we first build a capability that generates appropriate SQL. We see how users use it."

2. Make it easy for users to correct AI agents whilst they are acting

AI agents spawn multiple subprocesses that are eventuall brought together and later processes might depend on the results of earlier ones. When Hex's agents comes up with a multi-step plan, Hex turns this into a directed acyclic graph (DAG) to track dependencies between steps. Crucially, this DAG is reactive. If a human spots an issue and corrects an upstream step, this automatically propagates to update the downstream steps.

"If you want to make a change and you go upstream in your agent DAG and you make a change to this agent's response, that should percolate through. And so we also have reactive agent flows. So you can make a change and it percolates all the way through. It re-prompts and it rebuilds."

3. Keep the human in the loop

Rather than fire-and-forget agents that go off and do complex tasks unsupervised, Hex focuses on keeping the human closely involved. Each step of the agent's plan corresponds to a cell in the Hex datascience notebook that the user can see and edit. This makes it easy to jump in and course-correct if needed.

With this foundation in place, the Hex team then built a robust evaluation framework to ensure their agents were delivering reliable, high-quality results. Here's where they diverge from the typical approach:

4. Build a suite of granular evaluators, not one overall metric

Many teams try to oversimplify evals into a single number. Bischof argues this is misguided. No one metric can capture all the nuances of what makes an AI agent's output "good". Instead, Hex has many binary evaluators that each assess specific aspects of the output, providing a holistic picture of quality without over-simplifying.

5. Evaluators should mirror the ideal user experience

Too often, teams choose eval metrics that are convenient to measure but don't actually align with what users care about. Hex takes the opposite approach. They work backwards from the ideal user experience to define a set of evaluators that codify what really matters. This ensures they're gating releases on metrics that translate to real user value, not just gaming a score.

6. Immerse yourself in the data

Many teams look at their eval results in aggregate but fail to dig into the specifics of where the model is failing and why. Bischof advises teams to constantly scrutinize the inputs and outputs flowing through the system to uncover opportunities for improvement. Schedule dedicated time to analyze it as a team every week. The answers are in the data.

"When I talk to AI engineers who come from data science, they're like, I look at the data so much. I'm looking at the inputs and looking at the outputs. I'm trying to figure out like what's going on all the time. And I think that is the sort of alpha that people with a lot of experience in data science and machine learning come at this problem with is: you will answer most of the hard product questions by looking at the data."

Hex's journey demonstrates that with careful system design and a user-centric approach to evaluation, it is possible to build reliable AI agents on top of LLMs today. By choosing the right abstractions, keeping humans in the loop, and ruthlessly evaluating quality against metrics that matter, teams can tame the complexity of these models and ship products with confidence.

Those that stick with the old paradigm of chasing a single number or relying on gut feel are leaving tremendous value on the table. It's time to rethink how we evaluate AI products in the age of LLMs. Hex is showing us the way.

If you're looking for tools to help implement this approach, Humanloop can help. Our platform makes it easy to build granular evaluators and dig into the data to uncover insights. We've helped companies like Filevine,, Vanta, and Duolingo ship high-quality AI products with confidence. If you'd like to learn more, we'd love to chat. Book a demo today at humanloop.com.

Chapters

[00:00:00] Introduction

Raza Habib: I'm delighted to be joined today by Brian Bischoff, who has an extraordinary CV when it comes to AI, starting from a hardcore maths PhD coming as a physicist and then had data science roles at some of the most interesting companies out there, Stitch Fix, Blue Bottle. And now the reason he's on the show today is he leads AI at Hex where he's been bringing, LLMs and AI to data science notebooks and to the workflows of data analysts and data scientists. So Brian, it's a pleasure to have you on the show.

Bryan Bischof: Thanks a lot. I'm really excited to be here and yeah, excited to chat.

[00:01:20] The challenges of evaluating AI agents

Raza Habib: Thanks very much. So I'm going to dive straight in because my first question to you is related to Hex AI product and most companies have failed to get AI agents into production You're one of the only ones that I know who have succeeded here So to start with what have you guys figured out that other people got wrong?

Bryan Bischof: So I think one thing I would say is like, we also failed. We just have succeeded and failed at getting agents into production. and maybe that was the first thing that was Kind of noteworthy was our very first attempt at agents and prod. It didn't go super well, but we were getting a lot of like unexpected behavior and a lot of sort of like, not quite the death loops that you hear about, with some people's agent applications, but we were getting too high of entropy as I sometimes talk about.

[00:03:45] How Hex's AI agents work

Bryan Bischof: Let me set the stage actually. So you ask a question. And that's a question about data science. And we want to dispatch a collection of agents to help with different stages of that work. One thing that is like quite common at this point, even there's a recent Claude blog post, discussing this is having one agent make a plan and then having that plan consists of individual steps and then have the agents go and work on those individual steps, somewhat in parallel, somewhat in sequence, but regardless. That's the agent like planning paradigm at this point. And we too started with that paradigm. The challenge is that we allowed for quite high, diversity in the plan and Right off the bat, it seemed like it was like going really well. We were getting exciting results.

[00:06:30] The importance of choosing the right tools for agents

Bryan Bischof: One of the ways that we like constrain that and we started getting things a little bit more under control and what I believe like led us to success in getting agents abroad is this precise part here was we were more prescriptive about the types of plans that could be generated. That doesn't mean that we tighten down how much planning could be made up by the agent, but what we did tighten down is all of the like specific types of steps that could be executed in the plan.

[00:10:00] Building a reactive DAG for agent plans

Bryan Bischof: Our agent results, we tend to have associated to explicit hex functionality. So a lot of these agents are responsible for one part of the magic analysis, as we call it. The magic analysis is a chain of cells. Each of those individual cells is an agent. They're operating both independently and together. And this raises another technical challenge that we had to work through, and this is more on the engineering side, was how do you infer the state that other agents are going to need access to, to be able to do their work? Some of the prompts in downstream agent requests require information from the original prompt or from an agent response upstream. So we actually construct a DAG under the hood of what we think the plan infers, what, in that DAG, which steps require upstream response values. And then we also do things like reactivity. And this is the other thing that was like pretty interesting, was If things don't go well, if you want to make a change and you go upstream in your like agent DAG and you make a change to this agent's like response, that should percolate through. And so we also have reactive agent flows. So you can make a change and it percolates all the way through. It re prompts and it rebuilds.

[00:12:15] Keeping humans in the loop

Bryan Bischof: I am not super bullish on the Devin style, like click go, come back in two days and it's all done. I'm more bullish on the user interacting with the agent pipeline. The agents are only there to do more things simultaneously and reactively based on one another, but they're all still fundamentally coming from the interactivity paradigm. The user is right there to observe and to reflect. People love to talk about like agent reflection. They love to talk about like, Oh, you have an agent who's supervising. They're just reflecting on the other agents and they're constantly doing this feedback loop. Well, if you keep those loops tighter and you make it really, really easy for the user to get in there and make the adjustments necessary to get things back on the rails. And you have less of these, like, death loops. And you have a lot more, frankly, like, interaction from the user. That's great. We've put a lot of effort into making our UX feel very interactive and feel like editing and iterating is super organic.

[00:15:30] Why you need granular evaluators, not one metric

Bryan Bischof: I think every data scientist or like ML practitioner feels like evals are a little bit in the like, label for an old idea, and I think there can be a little bit of like frustration from ML people around evals. Because we've been writing evals and building data sets to like offline evaluate our model forever.

Raza Habib: Yeah, although I guess one thing I, I feel is different now compared to before, and tell me if you disagree, is that at least before we were generally doing tasks where it was easier to say there was a right answer, or the generality was sufficiently confined that like, I could calculate an F1 or a precision or, okay, maybe in some cases I had to do like a rouge or a blue score or something, but rarely did I really feel like the only way to score this was with a subjective judgment, and post LLMs I feel like that much more often.

Bryan Bischof: Well, I worked in fashion recommendations, in fashion recommendations, I can promise you the one thing you don't know is, is this a good recommendation for a person? There is no easy way to sit down and say. Like, yeah, this is clearly a good recommendation. I mean, I personally, I'll have people send me a piece of clothing and say like, isn't this cool? And I'm like, no, actually I don't like that. And I'm like, oh, I'm so surprised you don't like that. And they're like, I thought I really understood your style. And you're like, nobody understands my style. Um, and the reality is like, It's hard. It's really hard. How do you build evals for fashion recommenders? How do you build evals for coffee recommenders? What about YouTube recommenders? And so like, I think my rejoinder to people that say like, no, it's fundamentally harder is, yeah, it was always hard. You oversimplified or you work downstream of the hardware. And so that's my like general feeling on the topic.

[00:20:00] Aligning evaluators with user experience

Bryan Bischof: I think my more specific feeling on the topic is this isn't necessarily a bad thing. You always have to bite off what you can chew. And so you always have to say like, okay, you know what? I can't capture Brian's style perfectly, but what I can do is I can understand If I'm like in the right neighborhood, did he click on this item? Maybe he didn't buy it, but did he click it? I got some attention. Did he buy something that's like visually similar? Did he buy something that has this color pattern? What about this silhouette? This is an oversized shirt. If he owns this shirt, maybe he's like open to oversized. So we've always been making hedges where we say we will understand if things are good or bad. Based on analogy and that is a okay. I claim that like evals these days are the same. And they should be like approached very similarly. I'm really concerned when I hear people Trying to think too holistically about the like output. Instead of trying to break off little things and assess the outcome.

[00:24:11] The power of immersing yourself in the data

Bryan Bischof: When I talk to AI engineers who come from data science, they're like, I look at the data so much. I'm looking at the inputs and looking at the outputs. I'm trying to figure out like what's going on all the time. And I think that is the sort of alpha that people with a lot of experience in data science and machine learning come at this problem with is: you will answer most of the hard product questions by looking at the data. And the data is going to be generated as you go. It is going to be generated by the users, and it's gonna be generated by the agent. And you will generate so much alpha for your own product by staring at that data.

[00:28:00] Lessons for other teams building with LLMs

Bryan Bischof: I have a friend, Eugene, he spends like, hours every week, just looking at the data of the application that they've built my team. We have now a weekly recurring like evals party. eval slash like looking at the data party, this was inspired from talking to, one of the AI managers at Notion. Like this is the way, like you have to immerse yourself and you have to just get absolutely filthy in that data. That's the alpha in my opinion from people like me.

[00:32:00] How Humanloop can help

Raza Habib: Fantastic. Yeah, that, that, that really resonates. I mean, it was literally the first thing that we built at HumanLoop was like logging and observability, right? Because until you can, a lot of people dump the logs of like their inputs and outputs somewhere in the view of like, at some point we'll make them useful. But they don't sort of serve it up in a way that actually allows them or like, to frequently incorporate that into their day to day work.

Raza Habib: Cool. Well, Brian, it's been an absolute pleasure. There are a ton of gold and nuggets for people in there who are trying to think about building AI products to take away. I'm sure we could have filled a second hour, but but I really enjoyed it. So thanks very much.

Bryan Bischof: Thanks a lot. Thanks for having me.

About the author

avatar
Name
Raza Habib
Role
Cofounder and CEO
Twitter
𝕏RazRazcle
Raza is the CEO and Cofounder at Humanloop. He was inspired to work on AI as “the most transformative technology in our lifetimes” after studying under Prof David Mackay while doing Physics at Cambridge. Raza was the founding engineer of Monolith AI – applying AI to mechanical engineering, and has built speech systems at Google AI. He has a PhD in Machine Learning from UCL.