Why Your AI Product Needs Evals with Hamel Husain

Raza Habib

Hamel Husain is a seasoned AI consultant and engineer with experience at companies like GitHub, DataRobot, and Airbnb. He is a trailblazer in AI development, known for his innovative work in literate programming and AI-assisted development tools. Shawn Wang (aka Swyx) is the host of the Latent Space podcast, the author of the essay "Rise of the AI Engineer", and the founder of the AI Engineer World Fair. In this episode, Hamel and Swyx share their unique insights on building effective AI products, the critical importance of evaluations, and their vision for the future of AI engineering.

Subscribe to Humanloop’s new podcast, High Agency, on YouTube, Spotify, or Apple Podcasts

Chapters

00:00 - Introduction and recent AI advancements
06:14 - The critical role of evals in AI product development
15:33 - Common pitfalls in AI product development
26:33 - Literate programming: A new paradigm for AI development
39:58 - Answer AI and innovative approaches to software development
51:56 - Integrating AI with literate programming environments
58:47 - The importance of understanding AI prompts
01:00:37 - Assessing the current state of AI adoption
01:07:10 - Challenges in evaluating AI models

Podcast:

Raza Habib (00:00:55): I'm Raza Habib, so delighted to be joined today by Hamel Husain, who has been working recently as an AI consultant to a lot of companies who are building with LLMs. Has a long history as a machine learning engineer at GitHub, at DataRobot, at Airbnb and elsewhere and is also working at Answer AI, kind of AI lab that I'm hoping he'll tell us about. So just excited to dig in and learn from your experience.

Hamel Husain (00:01:18): Awesome. Yeah, really glad to be here.

Raza Habib (00:01:20): So I thought one place we could start before we dive into the details of your consulting experience and elsewhere is just this week Anthropic released a new model Claude 3. It seems to be making a lot of waves. It's now sitting at the top of the LMSYS leaderboard and so I just wanted to get from both of you guys your initial takes on that. What do you think this means for people building with models? Is it going to change what people should be doing? Have you had a chance to play with them yet?

Hamel Husain (00:01:45): I've had a little bit of chance to play with it. I haven't thought about it too much.

Swyx (00:01:49): Yeah, maybe I'm just desensitized at this point. My snarky comment is how are you going to be an AI influencer if you don't say how this changes everything, every single model release, it has to be the most groundbreaking thing ever.

Raza Habib (00:02:03): But I actually think this is the first interesting one in a while. It changes everything. I don't think it changed everything, but I think for two years we've all been sitting around being like, when's the next big thing? And I do think there is something interesting about a small model that hopefully is using test time compute rather than just scale to improve things. Now we have maybe a second vector of model improvement, but I haven't had time since it came out yet to properly sit down and play with it and run it against evals and benchmarks. And so I thought that Swyx might have done this and so I'm sort of curious to hear your first takes. Is it going to change everything?

Swyx (00:02:41): It changes some things for sure. Actually I've been running this Twitter poll where basically the only thing I say is, Claude 3 vibe check week one. And I let people rate game changing, good, meh and see results. And the last status, it was only 37% of the responses were actually positive and 40 something percent were no opinion yet. So I think a lot of people are more measured than maybe the influencers and I think that's pretty good. But I think then you also see the people who are maybe the top of the scale in terms of skeptical but informed and actually made an effort to try it, which is not everybody. And so that would be Terence Tao, George Hotz, and I forget the last person I listed who are like, you can say their opinion cannot be bought, they're objective and they really just care about model performance and all of them are impressed today. I'll put myself in that category as well. So what I do is first I run the latest space discord where a lot of people share their personal impressions and I get the general sense from them. Nathan Lambert, who is a previous guest of ours now working at AI2 as a research scientist, posted a really interesting evaluation where he posted this riddle about I think the bushel of wheat, the chicken and the fox crossing the river.

Raza Habib (00:03:45): Yeah. So I think I've seen this. This is sort of a riddle that's commonly used as a gotcha for GPT-4 because it gets it wrong in a kind of very consistent way.

Swyx (00:03:53): It could be a gotcha, it could be not, it could just be a straight riddle and it might be memorizable, but the steps that were laid out were pretty convincing. So I thought that was pretty interesting.

Raza Habib (00:04:02): The one that for me, I was both excited and disappointed by this fact was Matt Clifford posted a couple of examples on Twitter of cryptic crossword clues that I think his grandmother had written. So they're definitely not in the training set and they were hard cryptic crossword clues. So for people who are not familiar with cryptic crosswords, it's a puzzle type clue. So it's not just finding a synonymous word, which you would expect to be easy for an LLM, but it's a complicated bit of wordplay and I've enjoyed the fact for the last year that GPT-4 kind of sucks at cryptic crosswords because I like to use it as a kind of collaborative friend to do them with because if it was really good at them, it would solve them. There'd be no fun, but it's good. It was good enough that I would get some benefit, but it wouldn't just solve it for me.

(00:04:41) GPT-4 just smashed the... Claude 3, sorry, GPT-4... Claude 3. So the previous models, and did you do mini versus preview and Claude 3? I didn't run this eval. He ran it, so you'd have to ask him, but that for me was at least something that made me sit up and want to pay attention to this. That plus the fact that this is a new axis along which things might improve. The reason I thought it might be an interesting place to start our conversation today though is I think it gets at something that I know is a topic dear to your heart, Hamel, which is how hard it is to actually evaluate these systems now and as the models get better and better, actually finding test cases that allow you to differentiate between them and test base capability becomes harder that that's testing at the level of the models themselves.

Hamel Husain (00:05:27): Yeah, it's because when you're building an AI product, after you have your MVP and you get started, you can quickly get to a place where you don't know if you're making progress or not. You can vibe check it and see, okay, it's failing here, it's failing there, but it's really hard to iterate after a certain point with just vibe checks. And I saw a lot of people really struggling with that. At first, I was hesitant to even talk about it because I thought it was very, I was like, oh, people must know that you need to have evals or look at your data. And I first started talking about it maybe on some podcasts and then I decided, okay, might as well just write about it because I'm repeating the same thing over and over again. But yeah, I mean I think people struggle with that and they struggle with how to get started, what do you measure, how to write the evals, how should you develop a nose for looking at data?

(00:06:22) There's a certain instinct that you can develop if you practice looking at data a lot and doing error analysis on your logs and things like that to find where things are going wrong and refining that over time is just a bit of a skill to it. That's really helpful. I think I developed a skill like classic machine learning, like pre-LLMs if you're building any kind of machine learning system that was absolutely critical to that. It's one of those skills though that's not really taught in an academic sense. There's not that many places that it's taught. It's definitely been written about before in some applied machine learning books and stuff like that. There's one place where it is kind of taught, which is Kaggle actually.

(00:07:09) So Kaggle is not really teaching you about evals. You just lose if you don't do it, there's no way you can make any progress at all. And that's like a microcosm of a specific type of machine learning problem and workflow where you have the data for you and things like that. But it's really interesting. I think that's where people struggle is like, okay, where to develop this intuition, how to begin. It's a little bit scary for people when you have a stochastic system. It's a little bit different than okay writing unit tests for code where it's deterministic. You have an idea of what you want exactly. And then I think with AI, it's a little bit counterintuitive that you should look at data to a lot of people because they live in a regime where everything is automated. It's magic and what do you mean I need to look at the data? Like a crazy person. This is manual work of looking at data. Can't something automate that for me, looking at my data, why do I need to do this almost clerical feeling job of opening a spreadsheet or something and looking at data or looking at logs, that sounds like the wrong thing to do. People have the intuition and so I think that they just get skipped for whatever reason and people need to be told, actually, you should be doing this and writing evaluations and things like that based upon what you find.

Raza Habib (00:08:29): To what extent is this just because it sounds to me like this is an upshot of the fact that LLMs have expanded who's able to build with AI products. So I feel like what you're saying would probably not be, as you said, news to a machine learning engineer or someone who's been building ML models for a long time. But part of what's changed is who's building with AI, and I was actually, you ran a very viral LLM course earlier this year and one of the things I was curious about is who's coming to you, who's in that audience and who are you seeing building with LLMs now? Is this generalist software engineers? Was it mostly hobbyists? When you're in large companies, who are the people you are seeing take on this work?

Hamel Husain (00:09:08): Yeah, it's mostly software engineers. It's not machine learning people, it's mostly software engineers, full stack engineers, not machine learning engineers, which is great actually. That's why the whole field has taken off because it's really opened up the persona and it's opened up the market to who can build AI stuff.

Raza Habib (00:09:29): Do you buy Swyx's thesis of the AI engineer as a new role emerging for this or how do you imagine this playing out in terms of does this become the skillset that every software engineer needs to have? Is there a new type of engineer that's going to emerge?

Hamel Husain (00:09:42): Yeah, I do buy into it. I think it's not even if I buy into it, it's already there and it's not only there. It's actually really important also for machine learning people to become versed at building applications too. So personas like me who know nothing about the JavaScript ecosystem, it's not even about language. Forget language for a second. I think it's very important for even people like me to figure out how to develop, have good application development skills, otherwise we're kind of leaving a lot on the table.

Swyx (00:10:23): I always want to push back on that you can't learn everything, right?

Hamel Husain (00:10:27): True. Yes.

Swyx (00:10:28): So you got to pick your battles and specialize and you have a very valuable skillset that maybe requires working at Airbnb and GitHub and what have you to really flex and you should make use of that and let the commoditized stuff be done by other software engineers who can spend six months going through free code camp and do that stuff.

Hamel Husain (00:10:51): That's a great question. I think there's definitely, there's a lot of surface area and it's definitely the application development layer is very complex and it takes a lot of craftsmanship to be good at that. And it's actually, yeah, there's something to be said about specialization. I think in the AI field, we've been chasing unicorns for a really long time in the sense, okay, one person, it's not just chasing unicorns for hype. There's the idea of like, hey, can you do be truly full stack, at least try to is a goal worth aspiring to because it actually, if you can do that, it's actually really valuable. And so it is at least the idea like, hey, because of AI, maybe you can do it. Maybe you can span a bigger surface area than you were able to before.

Raza Habib (00:11:39): Something you said before also really resonated with me as well, which is it's not a machine learning engineer or a traditional software engineer. It is neither of those two things because it bridges these gaps in interesting ways. So we kind of see this at Humanloop where people need to come in and build evals, and so that looking at the data and building an evaluation and running on a test set has a lot of the flavor of traditional machine learning engineer work, but the things that we're evaluating are very different. So whereas a traditional ML engineer is looking at the accuracy on the model or they're looking at kind of a loss training curve or something like that, these people care about end product metrics. They're looking at did the user click on the thing or was the tone of voice correct or did the model hallucinate in a way that's much more product focused? And so it is bridging gaps between I think people who are thinking about product and people who are thinking about models in a way that neither of these two things did before.

Hamel Husain (00:12:32): Yeah, I think so. And I think there's some intuition around people that have said, hey, we expect there to be unicorns that are just one or two people because of AI and working backwards. What does that mean? It's like, okay, I think if you believe in that or if you kind of buy into that possibility, then spanning that more of that space or spanning the whole space, that's what you'll have to do.

Raza Habib (00:12:56): But do you buy into that possibility?

Hamel Husain (00:12:58): Yeah, I think so. I mean at least I want to try really hard. I want to try to, and so I might try in my own really strange, esoteric ways that don't make sense to most people through Answer AI is an example of a really interesting research lab that is taking,

Raza Habib (00:13:16): Can you tell us maybe I think a lot of people listening won't have heard of it. What's the background and then maybe dive into your piece of it.

Hamel Husain (00:13:23): So Answer AI is a research industrial research lab, and that's a little bit of a vague term on purpose, I think. And essentially it's like the goal is to build lots of AI products really fast based upon and then push the envelope basically R&D, but not just doing, actually doing lots of the D and innovating on things like the AI itself, but also the entire software development stack with the goal of these companies that can be run by one or two people and maybe many of them.

Raza Habib (00:13:59): So what are some of the early products that you guys are working on?

Hamel Husain (00:14:01): So I can talk about some aspects of things. So if you look at the AI engineer, the one that Swyx has articulated, there's like, okay, there's a machine learning engineer and there's the front end engineer and the full stack application engineer. There's a spectrum and machine learning engineers we're stuck on this left hand side of that spectrum and we're like, well, we want to do application development too. We want to build products and create companies, but that's a whole different surface area that takes lots of craftsmanship and different languages and do you really have, so one kind of really bold approach is, I mean unify that whole stack in one language, one language, one set of things, and even innovate on the whole development workflow from the ground up to really do things like literate programming. Think about AI coding with AI a little bit differently and really dial it in, make it as efficient as possible. And then even the application development layer, the front end frameworks you've seen Fast HTML. So all of that is kind of saying, okay, can we create some way end-to-end stack that a Python developer machine learning engineer can also go end to end, which is an interesting experiment. It sounds crazy, but that's why it's interesting.

Hamel Husain (00:15:32): Current state is, I mean it's early days, but there's definitely lots of progress. There's lots of things being built on this substrate, this new development environment already. So if you look at Fast HTML that has been built with NB dev with the literate programming environment, but that's also been built with these new ways of infusing AI into the literate programming environment, which we can talk about what that means. That sounds vague.

Raza Habib (00:16:01): I definitely want to talk about, let's close the loop on this LLM evals discussion that we started and then I would love to dive into literate programming and what it might mean for the future of this stuff as well. But we've kind of established why we think LLM evals are important. I think you have more practical experience than most. What would be your nuts and bolts advice to someone who's trying to figure this out for the first time? What works, what doesn't? Can you just talk us through that a little bit?

Hamel Husain (00:16:26): It's fairly straightforward to get started with evals. What you want to do is do error analysis on your AI. So what does that mean? What is error analysis? Error analysis is it's looking at your data and starting to categorize the different kinds of errors you might be experiencing and basically when you encounter an error, think about a test you can write, and sometimes that's a code-based test, like an assertion. Sometimes it's LLM as a judge. Now when you get into LLM as a judge, there's a specific way that you need to do that so that you can trust it, which is a deeper topic, but you should definitely start with the code base assertions you can get really far with, if you're looking at your data closely enough, I always find that you'll find enough errors to write code base assertions in most cases, not all the time.

Raza Habib (00:17:15): So step one, log the data, look at it by hand, start to classify different error cases and before you do anything else, the simplest error cases can just be code-based assertions systems should not do X. Okay, so let's say I've done that. I've kind of graduated past my baby steps on evals. What should I do next?

Hamel Husain (00:17:32): Yeah, so after you categorize your errors, you should do some basic analytics on them. What part of my product has the most errors and why? And develop some intuition around, hey, what kind of, is it a prompting problem? Is it a RAG optimization problem? Is it some other thing that you're not really sure about? And see, okay, what is the lowest hanging fruit and iterate on that and iterating on that serves two purposes. One is to make progress, but also is to test your eval system more generally. Say like, hey, I made a change. I was able to very quickly see the result of that and understand the impact on my system.

Raza Habib (00:18:18): And would you be able to give some examples, maybe a specific product that you helped work on if you're able to talk about it or the types of evals you're creating? Because a big spectrum of types of tests that you could do help people get a concrete example.

Hamel Husain (00:18:31): So one example is a product I work on for a company called WeChat. And so WeChat is a real estate CRM product and the name of their AI is Lucy, so I'll just refer to as Lucy. And so if you start doing error analysis on Lucy, you might categorize it by different areas. So Lucy has different features. There's a listing finder, there's a contact creator, there's a social media campaign creator. There's all kinds of different things. Now, social media campaign creator sounds fancy, but it's more of entry point to doing that. You can start categorizing the product in different ways, different types of functionality. Looking at that data, we found lots of dumb errors like hey, it's emitting the UUID of the user from the system prompt into the message, we don't want that or it's not. In the Lucy's case, there's interleaving prose or text and UI elements that the front end had to render and it wasn't quite getting, that wasn't quite doing it the way we wanted beyond a structured data. It is more nuanced than that. It wasn't quite doing that or is repeating some phrases or it was a whole host of other problems then. So by looking at the data carefully, we find all of these annoying issues and we're able to write tests against them like, hey, you shouldn't be doing this stupid thing and that stupid thing or another stupid thing and that start making a lot of progress that way.

Swyx (00:19:59): So a quick follow up on this, when you worked on it, was it already deployed to customers?

Hamel Husain (00:20:05): It wasn't deployed to customers, actually not fully. More of an alpha with a couple real estate agencies, so some customers but not enough to have coverage across all the features quite as we wanted. So we had to use a lot of synthetic inputs.

Raza Habib (00:20:23): That was going to literally be my next question, which is it's all well and good writing these tests, you typically need a validation or a test set to run them over. How did you go about creating that?

Hamel Husain (00:20:33): Okay, yeah, let's just take a step back. So you're like, okay, you have this data, how did you get this data? I don't have data, which is the case for a lot of people. In this case we had an LLM simulator real estate agent asked this question in many ways and kind of perturb the inputs to the system as much as possible so we could generate as much synthetic data as possible to fill out all of these features and all of these things and get really creative with it. Let's think about how we can break the system and just generate lots of different questions and data. It's a little bit tricky because it's like we want to, for example, if we're pretending to be a real estate agent, we want to ground it in reality, so we don't want to ask for addresses that don't exist.

(00:21:18) Sometimes you do, but sometimes you do, but you also want to ask addresses that do exist, so you have to bake that into your synthetic data generation system. So I actually spent a long time on the synthetic data generation system like a simulator and they limit, it's like a world simulator and a lot of people, a lot of machine learning engineers working on AI stuff like LLM based products, they actually spend quite a bit of time on these world simulators. If they don't have good user data yet and they want to bootstrap themselves, they try to do that with LLMs, which is reasonable at least when you're getting started to get a sense of, okay, what is happening?

Raza Habib (00:21:58): So then you've got a test data set, you've got some tests you've written, so hopefully you're getting back some numbers that tell you how well it's working every time you make a change and presumably you then iterate against that till you get something that you consider to be good enough to deploy. There's kind of a few different types of evaluation we've mentioned. So there's the hardcoded test cases or assertions, things like that. There's obviously human review, you're manually going and annotating these things or you're getting domain experts to do it, but you mentioned LLM as judge as well, and whenever I talk to people about this, I'm often met with skepticism about it, especially because people feel like it's sort of turtles all the way down. I've got a messy stochastic thing and I'm going to evaluate it with another messy stochastic thing and I need to evaluate that thing as well. So a common problem is how do I trust my LLM as judge? How do I make it good? I have opinions about this, but I'd be curious to know how you guys do it.

Hamel Husain (00:22:47): Yeah, so this is a really great question and everyone has this question about how do you trust LLM as a judge, and this is just systematic way that you should use LLM as a judge and you need to measure its agreement with a human being. That's the only way that you can know whether you can trust it. So concretely, the way that works is, and I write about this also in the blog post, the blog post your AI product needs evals is you need to go through several iterations where you measure the agreement between the human and the LLM as a judge. We have to make sure your human is also writing critiques of the LLM as a judge so that you can use those critiques to improve the judge.

Raza Habib (00:23:30): And why not let that human adjust the prompt themselves?

Hamel Husain (00:23:35): Oh, you can, but you need to force a human or domain expert to go through the process of annotating this judge and also making their own assessment so that you can sort of measure the agreement between the human and then you need to do this process several times in every iteration. You need to try to bring the agreement closer. A very interesting thing happens when you do this one is, okay, your LLM as judge gets better because you get better at prompting it because you figure out where it's failing. You figure out from the critiques that a human is writing, you're able to incorporate aspects of that into the prompt either as examples or otherwise. But also as much as the LLM becomes aligned with the human, and I know I'm using these word aligned, there's an RHF that I'm talking about here, but it's just correlated with is actually a human becomes more aligned with the LLM. They're like, oh, you know what? This is a very interesting problem. This is actually really hard and I think what the LLM is doing is reasonable. It's fine. And there's some sort of empathy that's built the other way that's counterintuitive. You don't think about that, but it kind of makes the human accept.

Raza Habib (00:24:45): And in your experience, how long does this process take? How many rounds of manual labeling am I having to do? How hard is it actually to align an LLM as judge?

Hamel Husain (00:24:56): Only like three or four rounds, iteration rounds and I make is a bad term, but there's various ways to encourage this, but label a few hundred examples maybe,

Raza Habib (00:25:07): So we're not talking about masses of work.

Hamel Husain (00:25:08): It's not that bad.

Raza Habib (00:25:10): And then once you've got that LLM as a judge, now that's a scalable eval. You can also use it for monitoring. You've kind of got an artifact. Do you also fine tune them to make them quicker and cheaper? I guess the other reason I get people being reticent about wanting use LLM as judge is just as expensive.

Hamel Husain (00:25:24): It depends. If it's for something that's simple as a guard or more of a classification thing, then I might use fine tuning. But I try not to go there with LLM as a judge because it just becomes a meta problem that I have to worry about and I try really hard not to. It just adds to the complexity of the system because I'm also thinking about should I fine tune the LLM as well? So usually I just try of fine tuning the LLM as a judge. Sometimes I like, let's fine tune the actual model and try to make it better. You can get crazy with the judge. It can become like an academic exercise almost.

Swyx (00:26:04): Maybe there's a TikTok process where you go back and forth between your judge and your producer. I wanted to offer my contribution to this. I've lived this in the sense that I produce a daily AI newsletter that is 99% AI generated, and I am the human judge annotating every time I'm nine months in today actually, literally today was the first time I had a hundred percent agreement with my products and the output, so I was surprised when you said three to four times because it takes a lot. I think it also maybe depends on the open-endedness of your domain and maybe the complexity of the task in the sense that I'm summarizing all of AI, Twitter, Reddit, Discord.

Hamel Husain (00:26:40): I really like AI news. Oh, thank you. Have you talked about how you make it or do it?

Swyx (00:26:46): Yeah, ask you anything you want, but it's literally, it's a data pipeline. We filter, we group, summarize, do entity,

Raza Habib (00:26:52): And for anyone listening who doesn't know, AI news is the daily newsletter that basically summarizes all the Discords Twitter,

Swyx (00:27:01): This stuff that you're supposed to keep up on, but you can't.

Raza Habib (00:27:03): Yeah, and it tells you what you should actually just pay attention to. It is like I enjoy it. I get every day and I read the headlines. It's like, okay, I know what I should if anything, pay attention to

Swyx (00:27:11): Correct. So today the surprise was o-1 preview, not many sped out exactly what I would've written as a human. It was the first time it's done that I've always had to do with a manual override me as a layer on top of it going, okay, this is actually the story. This is not. But here right at the top of the o-1 preview was the top three stories I want to make.

Raza Habib (00:27:29): I think that's a really good point though, because that task is a lot bigger and more open-ended than what I would typically encourage people to do for an eval. So one of the things that we try and advise people to do is to break their evals down into small sort of atomic things that are independent of each other, whereas when you're doing that review on what the AI news thing is spat out, you're essentially assessing the whole thing. It's the

Swyx (00:27:54): Little bit, yeah, the whole thing and the micro thing, there's this big and small, we split it into the summarizer prompt, the filter prompt and the style prompt style is the one that is the smallest and probably fine, but even I have very specific requirements for style that are weird sometimes and sometimes often tuning that style prompt ends up leaking the feature shot examples I have into the data which I then maybe need a fourth evaluation on, which just gets a whole mess. So I don't have that part worked out yet. I just kind of rely on people to self-correct, whenever I hallucinate a link, but it's not great.

Hamel Husain (00:28:28): How much time do you spend as a human on each news blast that goes out?

Swyx (00:28:33): I've actually been recording it for my team. We're working on handing it off so that I'm no longer in the loop, so between 30 to 45 minutes a day. The goal is that we eventually build a newspaper with no journalists that scales to every vertical. So it's a product that I think could benefit from the process that you're saying. I think I differ in order of magnitude from what you said, which is kind of interesting, and I have not written unit tests, evals because we care about different metrics like topic selection because I have maybe 400 candidate stories every day I need 10. So it's actually a recommendation system rather than,

Raza Habib (00:29:07): And it's also a relatively low volume application in terms of the number of LLM outputs you have compared to a lot of what I've seen at least.

Hamel Husain (00:29:14): Yeah, that's true. The things that I tend to work on are more contained in scope, I would say. And a lot of times when people come to me with chatbots, I want to put a chat bot over my SaaS. I try as hard as I can to convince them not to do that because, and that's the idea that a lot of people start with, but it tends to be highly correlated with not being thoughtful about the product. If you're just trying to put chatbot over your SaaS, like, hey, instead of clicking menus or doing anything, you just talk to this chatbot to do anything in my SaaS. Whereas, okay, let's be more thoughtful. Can you integrate AI in your user workflow in a more thoughtful way without having them go to chat? It's a little bit of a smell in a lot of cases when you just try to put a chat bot on it. It also can be very hard to evaluate. The surface area is incredible and it's like a moving target. And then also from a product perspective, sometimes not

Swyx (00:30:16): Good. I agree with this from a, that is what most people should do, but I still aspire to the chat bot. No, no. I mean, let me do the bookcase. I go into linear and sometimes I want to do a thing and I don't know how to do it. I do command K and I figure it out, but if I type the wrong letter or I call it the function now if I don't know the name for a functionality, I can't do it. So at minimum we should be doing semantic matching for a function. But actually what I want is a little mini agent, if we call it that, calls tools, that and the tools are all the functionality. There's so many dashboards where there's 5,000 settings Zoom, fricking zoom. I can't figure out how to turn some of my permissions on and off. And of course they don't have search, but even if they did have search, actually, sorry, they do have search, but there's 500 matches for the search item that I want and I don't know what it is. A chat bot would be nice.

Hamel Husain (00:31:09): Yeah, I think it could work. If your SaaS has a really good API surface area that's really cleanly defined, then probably do really well or it would do really well. A lot of SaaS doesn't have that,

Raza Habib (00:31:21): But I agree with what you say about the smell in terms of not necessarily being thoughtful. And this has come up with a few of my other podcast guests have spoken about the projects that did or didn't work well, and I think post chat GPT, there was a rush to slap LLMs on everything and chat with PDF or whatever, and then people have taken a step back and gone, well, actually where are these or bottlenecks? A great example of this was speaking to the founder of Gusto and he was saying how they realized report building was something that people in Gusto spent hours on and they can actually automate the entire process of report building in a couple of minutes with an LLM call. So rather than building just chat over the SaaS, they went and they really spent a lot of time building that feature first.

(00:32:00) Where the amount of leverage a customer gets is enormous versus just, I think their story was actually started off with a distributed team, so they're like, everyone should build LLM features. And they got a lot of incremental improvements because every team did small things and then they pulled back and they said, actually, if we could do one or two really big bets, what would really move the needle for people? And I think not enough people are doing that, so it definitely resonates. Okay. Last thing on this area before we move on, which is just obviously working as an AI consultant, you get brought in when people are struggling or when things aren't working or when people have been disappointed by trying to build with LLMs. You've mentioned evals as being one missing piece that people are often getting wrong. What else are challenges that people are facing or things that are common pitfalls? What's the advice you find yourself giving again and again?

Hamel Husain (00:32:50): Yeah, a lot of times people come to me with, they're like, hey, can you help us? Here's a 10 page PDF of all of architecture and tools. Can you take a look and tell us what we can be doing better? And there's no looking at data or evals, and it's a very, very common mistake, is a focus on thinking that you can move the needle by having better frameworks and tools when you don't even know what's broken. You don't know what's wrong. You just kind of have the sense from vibe checks. You want it to be better. And so the most common advice I give is like, hey, we have to first instrument your system and then we have to look at your data and see what's wrong. And this is really every time I do it, I mean we instrument the data, start looking at it every single time, find, okay, there's some really dumb things that are happening here that we can fix.

(00:33:51) And it is just a blind, it's just a blind spot for so many people who don't think of that. Once you do it, it's fairly obvious that, hey, I should have been doing this for a long time. It's just for some reason people don't think of it. And then that's one thing. The second thing is along the same thread of tools and frameworks, people want to, when I talk about evals or looking at your data, the first question people ask is like, what's the tool? What's the tool that you use to do that? Okay, the tools can help you, but your mind shouldn't just be people have an idea that I'm going to buy a tool as going to solve that problem. But inherent in it is a process. You have to follow a process of looking at your data. The tool is not going to look at the data for you, it's not going to debug for you, it's not going to do all that stuff.

Raza Habib (00:34:40): I'm obviously biased to suggest tools, right? Because I build one of these tools. But actually I agree with you in that the customers who get to value with us tend to be problem aware. They don't come to us like having nothing in place, and they're like, we need evals. Typically what's happened is they've built some Jerry-rigged version with spreadsheets and human labeling and Jupyter Notebooks, and they've hit a scale issue where it no longer works with the collaboration challenges they have and they're trying to deploy things more frequently. But very rarely does someone show up. We did have this very early on where people would show up and they'd be like, we're building an LLM product. Help us. We need evals or whatever, or we need a tool. And we now DQ them in the sales process. They're not people who we consider high priority because we know that they're not going to get to value, and if they don't get to value, they're not going to join and buy the product. And so actually we only have success with people who are somewhat problem aware.

Swyx (00:35:32): What's a clarifying question that you can ask for problem aware? This is the new term for me.

Raza Habib (00:35:36): So typically, okay, this is now being very Humanloop specific, but very briefly, when someone comes in on the first call, usually I just have a big chat with them about the current stage of what they're building. So who's involved, where they got to, are they in production, where are they? And in that process, it becomes very clear whether this is a company that has been told they need to do something with AI and they don't yet fully know what that is yet. There's a lot of that.

Raza Habib (00:36:00): And there's a lot of that. And versus you've got a process in place that's not built on a good framework or a good set of tooling. And that's very common. I would say that's the vast majority of the customers who come through to us have already built something. Oftentimes they built something quite sophisticated versus, as you say, the people who are much earlier than that tend not to be. We've tried to help them and the tool is never going to solve it for you. You need to understand the framework and the process for us.

Hamel Husain (00:36:28): That's really smart. Yeah. This is book the mom test. We may, it's like, are you already investing? Can you see that that person is investing in the problem? Yeah. If they're not investing in the problem already, they're probably not serious about solving that problem.

Swyx (00:36:42): I often thought, basically we have a consultant and a SaaS founder. Have you ever thought about bundling services with software? So you sell tool and then you sell the labor to implement the tool because they don't have the labor, so they're coming to buy the tool.

Raza Habib (00:36:58): So the way that we are typically starting to do this, so firstly, we don't like to sell consulting services ourselves, although we end up doing a lot of that. So we have insanely hands-on customer support, which sometimes veers into free consulting, pre-con consulting. But the reason we don't want to do it as a consulting contract is I don't want to be on the hook for delivering consulting. Actually, I'm getting a lot of value from the customer, from working closely with them, and we're still selling them a product. Ultimately we go away. But where we have been starting to do it is we have a few customers who are consultants themselves, so who are large consultants, not individual systems integrators, and those systems integrators. We have a couple who are using it themselves, but increasingly now they are the ones who are going to a customer. It's an early channel for us, but that seems much more likely to me than us doing the consulting ourselves.

Hamel Husain (00:37:47): Give me a call.

Swyx (00:37:50): Absolutely. I don't know if this is true earlier, but I'm interested in what kind of work you are interested in. What is interesting to you? What is boring now, whatever. I dunno.

Hamel Husain (00:37:59): Yeah. So consulting was really interesting or is interesting to me because from a personal perspective, it was a way to explore entrepreneurship in a very lightweight way. I don't have to raise capital or do anything. I can just see what it's like to basically hunt and kill and try to figure out provide value and what it's like to have a sales process and what it's like to do marketing and what does that mean, how to think about pricing and all these little entrepreneurship things. So that was really interesting and I wanted to see a lot of different problems because I wasn't necessarily interested in a specific area or one thing. And so it was a good way to do that. But at the same time, I don't want to do consulting forever. I don't want to do it. It's because there's a lot of interesting problems to solve out there. And at the end of the day, I do want to build stuff. And so kind of paired back a lot of my consulting and still do it for select cases, but that's why I joined an AI part-time to build stuff. Also, I don't feel like I can do good consulting unless I'm building otherwise that's just in the limit. I'll just become some idea guy or something and I want to build stuff. That's kind of where I landed. It's constantly evolving. I

Raza Habib (00:39:28): Think this is a nice natural segue into the other half of what you do or recently have been doing. So AI consulting has obviously been a big part of it, but also you've been involved with Fast AI, you've been involved with Answer AI and NB Dev and others. And one topic that has come up a couple of times in the conversation we haven't had a chance to go deep yet is literate programming and these other sort of paradigms you're interested in. Can tell us a little bit about that. For people who don't know, what is literate programming, why should they care and how might it be connected to the other stuff that you work on as well?

Hamel Husain (00:39:58): Literate programming was coined by Donald Knuth. Most people know Donald Knuth from his treatise on algorithms and data structures, the art of computer programming

Raza Habib (00:40:10): Or they've had to suffer through LaTeX.

Hamel Husain (00:40:11): Yeah, I would say you can draw a straight line between LeetCode interviews and Donald Knuth in a way. It's not his fault, but I would say there's definitely a connection between those two things in a way. But so most people know him from that, but I would say even Donald Knuth, he's way more excited about literate programming than even the art of computer science. If you listen to any of his talks recently, he'll get way more excited about literate programming. He wants to talk about it, but it's kind of been an obscurity for a lot of reasons. So what is, what is literate programming? It's this idea that, so right now when you write code, you write your code in one place, you write your documentation in another place and you write your tests in another place. It's kind of a pain in the ass.

(00:41:01) And a lot of people don't write documentation. They're like, what the hell is this? Now I have to go write documentation. You might have AI write the documentation and you might have AI help you write the test, but the insight is code is meant to be written or read by people. And the process of writing the documentation as you write the code inherently leads to a much better code. And also having that together in one artifact where the code in the documentation is weaved together is more natural for a human. And yeah, it's just more digestible for a human to work with and leads to a higher quality program. And this is the claim also by Donald Knuth. Now again, it is in something that's in obscurity.

Raza Habib (00:41:48): So just to make sure I've understood correctly, and maybe just to summarize as well, what literate programming is writing simultaneously and in one place, both the documentation for your code and your code and the test.

Hamel Husain (00:42:01): Yes. So Donald Knuth, he has a framework called web, which is one kind of his own programming language that he invented essentially that compiles, you write your documentation and code in one place and it compiles to Pascal code and also separate documentation, but it writes it in one place. But essentially it's writing your documentation and code in the same kind of location. Now, there's a lot of nods to literate programming that we've seen. Like notebooks. You look at R or you look at Mathematica, Jupyter. Those are things that have,

Raza Habib (00:42:34): And there are things that are typically used by people who are doing very iterative work. It's rarely being used to write code that needs to be reliable and scalable and secure. Usually, I mean, if you just take those examples, you said like a notebook or a Mathematica, it's scientists or machine learning engineers or people who want to do analysis, but this, I guess would be trying to bridge all the way through to I should be doing this all through my production code.

Hamel Husain (00:42:59): So those tools have mainly been centered on analytical workflows in exploratory workflows, not writing software that you can deploy. It's not really meant for that per se. So that's one kind of area. And then also a very related area that's relevant is this idea of exploratory programming. Exploratory programming made popular by Bret Victor. You've watched some of his talks, is the idea that there's certain kinds of creations that can only be done if you're able to have a tangible connection to it, if you can fiddle with the code and immediately see the output of that code and kind of fiddle with what it's doing. And that concept is inherently very important, that connection, that immediate connection. And so again, notebooks and stuff have elements of that, but I think literate programming has been misunderstood to a large extent because people are like, oh, literate programming is just notebooks, but it's not. Notebooks kind of if you draw a Venn diagram, there's some intersection, but it doesn't take you all the way there. And so the reason this is all interesting, you're like, okay, well who cares?

Raza Habib (00:44:11): I say connect the dots for me firstly, why should people care generally, but also how does this link up with your work at Answer AI or how does this link up with what you're doing as an AI engineer more broadly?

Hamel Husain (00:44:20): So I first got turned on to literate programming six or seven years ago when I started working with Jeremy Howard. And it was really interesting. I was just really fascinated how one person could build so many things so fast

Raza Habib (00:44:35): And does he just work insanely hard? Is that the secret to this, or is there something else going on?

Hamel Husain (00:44:38): No, no. I would say he works maybe three hours a day. He does a lot of stuff. He's been on your podcast twice. Very interesting person.

Swyx (00:44:47): I wouldn't put that as an example of, does a lot of stuff.

Hamel Husain (00:44:52): Sorry. I was like, you should go watch that. We should go watch those episodes to learn.

Raza Habib (00:44:57): But for people who don't know, Kaggle, multiple startups, Fast AI, some of them deep learning, AI, some of the most popular courses, not deep learning AI, but courses on deep learning similar to that, we've all here probably benefited from something that Jeremy's done in terms of learning.

Hamel Husain (00:45:13): He was wildly productive, doing lots of different building, lots of software, innovating on machine learning, meeting benchmarks, but also creating very innovative software.

Raza Habib (00:45:24): And to be fair, actually, I think he gets not enough credit for the current LLM wave. So whilst we have a moment here, I just want to plug this. GPT-3 is usually remembered as the first big sort of in-context learning pre-trained LLM, and it wasn't the first even before that, there's T5 from Google and others. But I think actually the first good example of being able to pre-train LLMs that I saw in the literature was ULMF from Sebastian Ruder and Jeremy. And for some reason in the history, it rarely gets kind of mentioned as the genesis point, it's also the only paper he's ever written. So he did pop into NLP to write one paper, figured out how to do LLM pre-training, drop the mic and drop the mic and left.

Hamel Husain (00:46:02): And if you were paying attention at the time, it was like, wow, what is going on here? It's like one person doing writing all this code

Raza Habib (00:46:09): And sorry, just before we move on to give fairness to the whole community and wider stuff. Whilst that paper was super influential, I can see why it didn't catch on because in the same way that when Geoffrey Hinton did the layer wise pre-training for auto encoders or whatever, it was just really fiddly to get their method to work. So it was a multilayer RNN that required stage wise sort of fine tuning and it was complicated to work, but it was the first, I think, demonstration that you could get a pre-trained LM to do useful things on downstream tasks by first training it on a big unlabeled corpus.

Hamel Husain (00:46:42): It was really impressive actually. And so I'm really interested, inherently interested in people. And so I wanted to know how does he do it? What's going on here? And so I started working with him. I started contributing to libraries and stuff like that. Very quickly I saw this is okay, what are you doing using your own software development stack and what are all these language constructs? It is, if you look at the code, it's very interesting APIs and Python patterns of using Python that are very different.

Swyx (00:47:24): So you're referencing if you want to,

Hamel Husain (00:47:26): Yeah, so Fast Core is an extension of the Python programming language that really leans into pushing a lot of functional programming paradigms into Python as well as a whole bunch of other stuff kind of inspired by a lot of other languages, but just trying to make code more expressive, even more succinct. And then he had his own software development stack where he is using literate programming, NB dev. It was a really deep, insane rabbit hole. He was like, oh, I'm just going to check out what he's doing. What does he do? Does he meditate or does he have some workout shake or something that he drank? No, he was like this massive buildup of tools that he built to make his workflow optimized over so many years that led to this massive productivity. And so just really, I dove into it. I was skeptical at first, okay, how come I haven't heard? Why isn't everyone doing this? Or what is going on? Let me see. Let me try it myself. Let me contribute to these tools. Let me understand them. And yeah, I quickly saw that, okay, I can write software 10 times faster using this approach.

(00:48:46) Definitely there's a learning curve. So it's not just the Fast AI library that's written with NB dev. There's all kinds of stuff in with NB dev, there's CLI tools, there's API clients. There's various different kinds of software, Fast HTML, all kinds of stuff written with the benefit of this literate programming approach.

Swyx (00:49:11): I would basically even say it's always a misnomer or misconception to even question that it is not suitable for production. It's just code at the end of the day. It's just a different presentation of code.

Raza Habib (00:49:22): Yeah, I can see that. But there's a reason we don't all do our production software development just in plain text files, which we could do having tests and CICD. And so if I think about a notebook, there are lots of reasons why people are scared about using notebooks say to write things that might be production level code. And I know that we shouldn't conflate literate programming and notebooks, but just as a concrete example of it, it's one place in which there are good reasons why that's not the way we write production software. And you could say it's just code, but it's code that's difficult to maintain and difficult to check in to git, and has a lot of issues with it that I wouldn't want to therefore trust as a production code base. So in some sense, yeah, code could just be in a plain text file. We've built a very complicated set of systems and processes to make it reliable.

Hamel Husain (00:50:09): And so with the literate programming system, there's a way to write tests. There's a way to write documentation. You write your software in that environment and it compiles to these artifacts like plain text code and documentation and tests. And the end user doesn't know that it's written in literate programming. They just pip install something and it works. So I thought it was really powerful and I got really excited about it. I was like, wow, this is insane. The only issue is you had to had a tremendous learning curve to get to the point of being productive with it. And so I think that's what kind of, you have to be quite open-minded to try to experience something like that. And it's also extremely experiential. There's no amount of talking. Yeah, I can't talk about it. It's like you have to experience it to understand what it is, because it's like, oh, there's this vague idea of writing whatever. I don't understand why it's better.

Raza Habib (00:51:08): It's funny you say this because when you were talking about it before and you spoke about also the exploratory programming and being able to discover something through creativity, the immediate thing that I thought of was that the early part of the LLM wave, like pre-chat GPT, I felt like the difference between people who were excited and got it and the difference between people who hadn't was just had they interacted with the models through something like a playground environment where they could actually test them versus people who had read the papers or knew about it secondhand. Because you do learn something viscerally from being able to probe the edges of these interactive systems. And so it definitely resonates even in the AI version of it. I think unless you've played with the models, it's hard to know what's possible,

Hamel Husain (00:51:51): And so that's where it becomes interesting so

Raza Habib (00:51:55): We don't get to the intersection.

Hamel Husain (00:51:56): So the NB dev wave of things, okay. It was good. It wasn't quite enough to get people, it's really hard to get people to change their workflow developer workflow. It's insanely hard, especially if you're at a company, you're like, what the hell? You're not going to use this whole new development stack. Fine. It turns out literate programming works really well with AI because you are explaining, it's like the ultimate prompt for an AI. If you have a narrative about your code that's interwoven with the code, it is an insane prompt that works really well. And so if you can bring AI into your literate programming environment in the right way, then becomes this new paradigm that allows for a different way of doing things. So, okay. Just to give you a concrete perspective, like, okay, I use Cursor, I love Cursor. I think it's really great.

(00:52:51) In fact, if I were to kind of, so using early versions of this NB dev infused AI environment, and so I would say with Cursor I can get started insanely fast and get a simple project really fast. But once the complexity starts to climb, the AI becomes less and less helpful. As I build out a more complex system, this kind of literate programming environment keeps going because of that advantage of the richer narrative in the prompt. Because this idea that you're explaining everything carefully to human and how it works, and you have your prose and code and tests in line in a narrative, that is something that it just works really well with AI. It's able to understand a narrative. Once you start building, you're able to really just unlock.

Raza Habib (00:53:49): And is this now your default code environment? Is this how you're writing code these days?

Hamel Husain (00:53:54): I'm actually using everything I can still, so it's important to keep an open mind, and so I still use Cursor. Sometimes you use cursor to begin with. Then I switch over.

Raza Habib (00:54:05): Is there a name to this new tool, by the way?

Hamel Husain (00:54:07): It's not public. I don't know if it's public yet. He said it's AI magic. Okay. AI magic. Yeah, AI magic. I'm using it for a lot of stuff. It's really interesting. There's an aspect of engineering. So we have prompt engineering for example, is a discipline. There's this term that Jeremy coined dialogue engineering, which is essentially being very thoughtful about what is the dialogue and engineering the dialogue to where you kind of clean up the dialogue.

Raza Habib (00:54:36): The dialogue as in between,

Hamel Husain (00:54:39): Between AI and the human. Okay? So the AI, what you press the AI to do the code and the explanations that are being written out, you have to be really thoughtful about how you have to think about how that's going to appear to the AI, which is also aligned with how it should appear to a human. So it's not really that pretty intuitive, and that's kind of what you need to do to write code is thinking this dialogue engineering mindset about, okay, what are the right set of the history of questions and questions and answers that I should build that always have in context. So it's different, but it is very interesting because of literate programming. The intersection between large language models and literate programming is very interesting.

Raza Habib (00:55:24): So is your prediction that, so you said the reason, one of the reasons why literate programming hasn't become more mainstream is just it's different. There's a steep learning curve and you kind of have to experience it to get the benefits, and so not that many people have been willing to climb that learning curve and it's not currently diffused in professional environments. So you go and you learn at work a particular way of doing things. Are you saying that LLMs reduce the learning curve or is it just that they increase the benefit?

Hamel Husain (00:55:52): So there's no reason why we can't decrease the learning curve? I think with NB dev, it was an experiment in literate programming. It's like, hey, let's use a popular programming language to literate programming. Not like Pascal that no one's using. Let's be practical and not try to rebuild an IDE from the ground up. Let's not build Jupyter. What can we do? Does it work? And that was part of why there was a steep learning curve you had to hack around the development environment. It wasn't inherently built from the ground up for NB dev per se. So I think now that we know that that works, one, we can reduce the learning curve, but I think more importantly, I think instead of 10x improvement, we can see maybe a hundred x improvement, especially in certain areas where that lend to more complex problems where the IDE way of doing things doesn't quite work.

Raza Habib (00:56:50): Will this integrate nicely with existing IDEs or am I going to have to change how I do everything to, because one objection I would have, or this sounds like it'd be great and I would love to use it as in my hobbyist projects or side projects or whatever, but how do I persuade if I was going to do this at work, I'd have to fit my whole team to switch what they were doing from an old way to new way. Behavior change is often a lot harder than technology change.

Hamel Husain (00:57:12): Yeah, that's a good question. I don't know. That's an open question. If I were to guess, I would say that's not currently the, okay. One thing is it's not a developer tool for the sake of having a developer tool itself. The idea is like, you know what? It doesn't matter whether anyone believes in it or not. We are going to write software with it. It's kind of like the Paul Graham in his Lisp kind of belief like, hey, he is going to outcompete everyone based on Lisp. They might be controversial, but essentially it's like the one thing he's wrong about. Yeah. So it's kind of like, okay, can you have a much more productive workflow? We should be able to prove it out with actually shipping lots of stuff. And if you ship lots of stuff, then people will notice. Not like this academic like, hey, look, it's like a hundred times better. Who cares? Once you ship a lot of stuff, then there's a pressure to figure out what the hell is going on.

Swyx (00:58:11): But you've been open about it, I mean, you and Jeremy and the rest of the team. One observation I'll make is that you mentioned dialogue engineering, it suddenly made it click for me. That cursor hides the prompts and I think you don't hiding the prompts, and I want to use that as a way to highlight or recommend to people to read your post on a few show media prompts, and this is another example of that, right? Cursor hide the prompts and presumably AI magic. I haven't seen it, but you're talking about dialogue engineering and caring about what goes into the history and looking at the data of what goes in. I don't do that with Cursor. I just kind of add stuff and hope that it works.

Hamel Husain (00:58:47): Yeah, that's how I use it too. Yeah. That's probably the right way to use it. Yeah. Yeah. Looking at your prompts is definitely what a novel concept. Yeah. When I wrote that blog post, I was actually just, I was using LLM frameworks and I wasn't understanding what is happening.

Raza Habib (00:59:04): Frustrating,

Hamel Husain (00:59:05): Wasn't understanding what is happening.

Raza Habib (00:59:07): I actually think it was the source of the success of some of those frameworks early on is people thought the framework was doing a lot for them. If they could have seen what the string being sent to the model was.

Swyx (00:59:16): I mean, people knew Langchain was a class method with a,

Raza Habib (00:59:21): That wasn't just the Langchain I had. I feel like every framework that was built early on suffered from this.

Swyx (00:59:26): I discovered this with Claude 3, I use instructor and I was like, instructor is a low abstraction thing.

Raza Habib (00:59:32): System prompt. The

Swyx (00:59:33): Instructor uses a system, prompt injects it in. I didn't know failed until I figured it out. Yeah, yeah.

Hamel Husain (00:59:39): It was just in some of the code of these frameworks were quite complex and didn't have logging, so I was like, okay, you know what? I'm just really curious. Lemme put a man in the middle attack to intercept the request. Surely making a request to your API endpoint, at least I can just inspect that and see what's happening. And I was shocked what was going on. I was like, wow. I was like, I need to write about this. Did you know that this is happening? He

Swyx (01:00:05): Knows because he runs.

Raza Habib (01:00:07): I did know, but I was really glad you wrote the post

Swyx (01:00:10): That people want magic. But then at the end of the day, engineering is kind pushing away the magic and actually thinking logically about how things work. And you can't really have both. Either you have a suspension of disbelief and you stay non-technical effectively, or you stay technical and you dive in.

Raza Habib (01:00:26): You heard it here first. We're stripping away the magic,

Swyx (01:00:30): Which is ironic why they call their system AI magic. I'm like, maybe that's not what you want to. Naming is definitely a hard problem.

Raza Habib (01:00:37): Okay. So one question I've asked almost everyone else, I'd be curious to get your take on it as well, is if you think about AI now, do you consider it to be overhyped under hyped, something else?

Hamel Husain (01:00:48): I think it's really, really under hyped because when I talk to, if I go to the grocery store and ask people about AI or go to the library or just step out of my bubble and ask people about AI, not many people are using it. Not many people are aware of it like, hey, have you used chat GPT? Have you tried this or that? Not that many people are using it. It's completely different than what's going on inside our AI bubble. I'm using it every day constantly and normies are not, and that's really interesting and surprising to, and I think that's in a way that makes me feel like it's under hype. A lot of people aren't using it that obviously could be using it.

Raza Habib (01:01:37): Interesting. I could imagine that argument being used to cut both ways. You could say, well, hey, there's so much attention on all these dollars and all this money being spent on it, and no one's using it yet. Massively over hype. But I think maybe what you're saying is both other people aren't using it and you're using it a lot. I think that second part is probably critical to the argument of you can see the value, it's tangible for you already. Is that fair?

Hamel Husain (01:01:59): And when I talk to these people, various peoples in different office jobs and ask them about what they do, and every time I show someone a little bit of AI, they're like, wow, I didn't know it could do that. As if though I'm the first person, as if though it just came out yesterday, which is really still surprising to me. It hasn't permeated all the way into the public as much as,

Raza Habib (01:02:23): And part of the reason that surprises me is someone on my team who's not technical was he got a WhatsApp from his mom last week recommending he try cursor. And I feel like that's a, something's really escaped escape, reached escape velocity when they're non-technical, non-engineering parent has somehow come across this and is like, I think this might help you with your work and chat. GPT obviously has broken through, but yeah, I take your point that maybe most people don't realize how much the potential is yet. For what it's worth, I've been asking this question overhyped under hyped to almost every guest so far. And it's interesting, the people who are closest to actually building with it almost all consistently say that it's under hyped. And the people who I interact with who hear about it through the news or hear about it in other forums, mostly think it's overhyped. And I find that to be an interesting fact that actually the closer you are to it, the more potential you seem to think it has. Maybe we're biased because otherwise we wouldn't be working on it.

Hamel Husain (01:03:22): Maybe there's some small bias. It could be some bias introduced by, because it's so good at developer stuff that

Raza Habib (01:03:30): Yes, that's true.

Hamel Husain (01:03:31): Totally changed development 100%.

Raza Habib (01:03:34): I think there's a good reason for that, and it actually links back to what we said about evals at the beginning, where it's because you can write tests for software, it's easier to have something to iterate against than it is in lots of other domains. So I think it's easier to make progress quickly, including in fine tuning and improving the models and also the systems around them versus systems where it's harder to have a correct answer, it's therefore harder to actually improve.

Swyx (01:03:57): I'll be the token Claude 3 fan here and call out, watch the videos. That shifted the Claude 3 announcement, there were four or five of them. One of them was coding and watch how boring the coding one is compared to the others because we are used to coding tools being really good and increasingly good with Copilot and all that stuff. So this is no longer super interesting. It's mostly can you get off the wait list of whatever hot AI coding tool is at the moment. But then they also had solving new genetics problems or new quantum physics problems and writing a very complicated economics essay for a college professor, and it starts nailing that and doing very complicated math. These are the new domains. It's actually, we're kind of sitting in the easy corner of coding examples. We know how to write tests and these things run really fast, and we have a lot of data on that. We don't have data on the other stuff. Yeah, I think that's right. So it's under hyped to the quantum physicist in your life, and I think your wife's the doctor, right? My wife's the doctor have her run Claude 3 on some medical issue and see, she will be so much more impressed than you'll be impressed at what she runs and gets the results of because we don't have the ability to ask those kinds of questions.

Hamel Husain (01:05:13): Yeah. My wife's a doctor as well. She's an electrophysiologist and she, can you get me one? No, no, no. But she spends a lot of time writing notes and filling out charts and things like that, and the technology just hasn't permeated to her.

Raza Habib (01:05:28): Yeah. I don't know what the reason is in the US, but I feel like in the UK, at least, the barriers to being able to build and sell technology into healthcare are extremely high because it's so regulated.

Swyx (01:05:40): There's a current open bet on when an open source reproduction Claude 3 will be done. Currently it's like 63%

Raza Habib (01:05:47): Next year. My hope is this may, maybe more thinking, maybe more wishful thinking than analysis, but if it's something that requires, actually, if all it is is scale, then you just need the money. And if you have the money, you can do it. If it's a genuinely novel idea, then it might be just really hard for someone else to replicate that idea. So maybe actually

Swyx (01:06:09): We may never know. That's the hardest.

Hamel Husain (01:06:10): Yeah. I'm also hoping an open model.

Swyx (01:06:13): Yeah, people want it. And obviously it'll be great for healthcare in exactly that. This is why I brought it up, which was deploying in those sensitive environments. You just need an open model. We'll see. Yeah. We'll still have to eval it.

Raza Habib (01:06:26): Yeah, I mean, I think evaluating for specific applications remains pretty straightforward, but evaluating the models is just general artifacts. I feel like it's getting harder and harder to meaningfully compare them.

Hamel Husain (01:06:37): Yeah. Benchmarks are kind of saturated also. Sometimes the benchmarks, they don't really translate into domain specific.

Raza Habib (01:06:43): And also it's like comparing people. Now the models are, they have such a wide general applicability that, okay, model X is better than model Y on this task. Is it better at all tasks? Having a wide enough range of evals becomes harder and harder. The more complicated the things it can do become, the harder it is for a human to actually do the eval. So I still think, is it good for my use case relatively easy? Is this model just smarter in some weird abstract sense? Maybe it's just not well-defined enough to answer that question.

Hamel Husain (01:07:10): I think people, they struggle with eval. Another way they struggle with the evals is they use a lot of off the shelf evals for their domain, like conciseness score, helpfulness score, whatever. They import

Raza Habib (01:07:21): These things. This is something actually I struggle with as a provider, as someone who, because a question that people ask us frequently in demos is, does it come with a library of evals? And I'm always like, no,

Swyx (01:07:32): That's the point

Raza Habib (01:07:33): Because we don't want to give you a library of evals that's going to be worse for you. And they don't like that answer,

Swyx (01:07:39): But why can't you give them a decent starting point and now

Raza Habib (01:07:42): Can fork? We do give them examples, but they still don't like the idea that they're going to have to do a lot of work to have custom evals

Hamel Husain (01:07:52): Application. It's like a very slippery slope. Usually when a lot of people come to me, they're like, hey, we have these evals. We listen to you, we have evals. I'm like, but all these evals are generic and they don't really track to what you're doing and they haven't looked at their data. When you provide off the shelf stuff, it can quickly give someone a comfort blanket or like, hey, I checked these boxes, evals that. They don't go through the process of evals. They just kind of apply the tool and that's the failure mode that it can encourage. So

Raza Habib (01:08:30): It's a balance to be struck. Cool. I think that's probably a nice place to wrap. Thanks so much for coming on. It was a really interesting conversation. I think we managed to deliver on Swyx's prompt to us at the beginning of producing some new tokens. Alright, that's it for today's conversation on High Agency. I'm Raza Habib and I hope you enjoyed our conversation. If you did enjoy the episode, please take a moment to rate and review us on your favorite podcast platform like Spotify, Apple Podcasts, or wherever you listen and subscribe. It really helps us reach more AI builders like you.

About the author

avatar
Name
Raza Habib
Role
Cofounder and CEO
Twitter
𝕏RazRazcle
Raza is the CEO and Cofounder at Humanloop. He was inspired to work on AI as “the most transformative technology in our lifetimes” after studying under Prof David Mackay while doing Physics at Cambridge. Raza was the founding engineer of Monolith AI – applying AI to mechanical engineering, and has built speech systems at Google AI. He has a PhD in Machine Learning from UCL.