The Problem with RAG Terminology

By Raza HabibCofounder and CEO

This week on High Agency, Raza Habib is joined by Chroma founder Jeff Huber. They cover the evolution of vector databases in AI engineering, challenge common assumptions about RAG and share insights from Chroma's journey. Jeff shares insights from Chroma's development, including their focus on developer experience and observations about real-world usage patterns. They also get into whether or not we can expect a super AI any time soon and what is over and under hyped in the industry today.

Subscribe to Humanloop’s new podcast, High Agency, on YouTube, Spotify, or Apple Podcasts

Chapters

00:00 - Introduction
02:30 - Why vector databases matter for AI
06:00 - Understanding embeddings and similarity search
12:00 - Chroma early days
15:45 - Problems with existing vector database solutions
19:30 - Workload patterns in AI applications
23:40 - Real-world use cases and search applications
27:15 - The problem with RAG terminology
31:45 - Dynamic retrieval and model interactions
35:30 - Email processing and instruction management
39:15 - Context windows vs vector databases
42:30 - Enterprise adoption and production systems
45:45 - The journey from GPT-3 to production AI
48:15 - Internal vs customer-facing applications
51:00 - Advice for AI engineers
28:10 - Underappreciated AI developments

Podcast:

[00:00:30] Raza Habib: So I'm delighted today to be joined by Jeff Huber, who's the founder of Chroma, which is an AI native vector database, an open source AI native vector database. And we're going to chat a little bit about retrieval and vector databases and how that fits in for AI engineers. Pleasure to have you.

[00:00:45] Jeff Huber: Happy to be on. Thanks.

[00:00:47] Raza Habib: As we discussed just before, the goal of the podcast really is to try and help people who are themselves building AI products to understand better the best practices and learn from each other. I think you get to see across a wide range of companies what people are up to. So I thought what would be great to do is help people understand this very critical piece of the stack in vector databases especially, and then also just share some of the lessons that you've picked up from working with so many companies and some of the trends you've seen. So maybe to start with - why do we need a vector database as part of the AI stack? What's it doing for us? Why is it important to an AI engineer?

[00:01:30] Jeff Huber: There's a bunch of different ways to think about that and obviously for different people different things click. Generally the way that people come to a vector database in the first place is they're building something with AI specifically an AI API, right? And they want to bring their data to the LLM. They want to figure out well, this AI is really smart but like how can I make it know about my data?

[00:02:00] Jeff Huber: So the first ask might be "maybe I should fine-tune the model to make it learn about my data." And they pretty quickly realize that's not a good idea for those applications. Then they learn about this thing called a vector database that you can embed your data in and make it retrievable at inference time on a per query basis.

[00:02:30] Jeff Huber: In some sense, if you think about an LLM or LLM application having both reasoning and memorization capabilities, the LLM is reasoning - it's like the prefrontal cortex, and memorization is like storage. It's kind of like a simple analogy. I think people who are more technical, one way to think about retrieval is that it's like semantic "if this, then that." And so you can say "if this context then these instructions."

[00:03:00] Jeff Huber: Because LLMs are a way to, in our view, well, an LLM is not the techno machine god which is gonna solve all of these problems. But is in fact simply an information processing system that can process unstructured data in common sense fashion. You know, the real world is a messy place, right? And it's sort of very easy to get bugged by reality.

[00:03:30] Jeff Huber: The distribution of inputs to LLM systems can be very broad as a result. So retrieval becomes actually a very useful production tool as well because it allows you to cover for like more and more and more cases to kind of improve the reliability of your system.

[00:04:00] Raza Habib: If I was to kind of summarize it back to you, it's as you said, sort of providing a memory layer or a way to kind of augment LLMs with your own private information. So yeah, anything that's not in the public training data set, obviously, the model can't know about and you can't have much control over it.

[00:04:30] Raza Habib: And I guess something we've seen as well is like one of the reasons why retrieval augmented kind of LLM systems have become so popular is it's also a way to mitigate some of the downsides of LLMs. So things like hallucinations or models going wrong because you can make them more factual.

[00:05:00] Jeff Huber: I think that the most basic one-on-one workflow of this is like people come with some data, they embed it and now at test time, they're going to retrieve it and sort of add it to the context of the model. I know that one of the selling points of Chroma is you guys made the developer experience super easy.

[00:05:30] Jeff Huber: The goal of retrieval ultimately is information retrieval. You're trying to figure out what is the most relevant information for this query at this point in time. And there are a variety of tools that give you different signals and these tools have different pros and cons.

[00:06:00] Jeff Huber: One tool which has been used for a very long time for information retrieval is lexical search, full text search - you know, find documents with the term "New York" in them and that's actually a good signal. So Chroma does full text search.

[00:06:30] Jeff Huber: Another tool that developers find helpful is the ability to do meta filtering. So you could say like give me documents with the term New York shows up where the type is PDF for example. And then the third kind of new tool is unsupervised like internet scale contrastive trained similarity models called embedding models.

[00:07:00] Jeff Huber: In short, like the new tool that those allow is for this like sort of vibe based similarity, right? Like you know, do these things share the same vibe. And so like for two quick examples kind of like, one example would be if a user asks about the company's time off policy, we don't have any documents that mention that term like time off policy, let's say you're European. And so your HR knowledge wiki talks about the holiday policy - those would actually be pretty close in embedding space. But of course, like full text search is going to fail on that kind of query.

[00:07:30] Jeff Huber: What other simple example which sometimes we use in our demos is if you load the documents and these are just strings, right? You load the string, apples, oranges and pineapple and then you query with the string Hawaii, what do you think is going to match?

[00:07:45] Raza Habib: I don't know. Pineapple?

[00:07:50] Jeff Huber: Yes, great! So this is like how most embedding models work, right? You query Hawaii, you get back pineapple. And that's a fun example of kind of like an associative leap of how terms are related. And of course, this all comes from web pages, you know what terms sit close to each other on web pages ultimately trains these models on what terms are related.

[00:08:15] Jeff Huber: Chroma does a little bit of magic. We don't want to do too much magic and we strongly believe in convention over configuration. But we also want to get developers the tools to easily break out of all the boxes when they don't want to use the boxes that they've gotten out of the box.

[00:08:40] Jeff Huber: So Chroma comes prepackage with the default embedding model for both Python and JavaScript. The default embedding model is like the ONYX sentence transformers v6. It's like 70 megabytes or downloads really fast and it's pretty good on the benchmarks and it kind of gets you going.

[00:09:05] Jeff Huber: I think that for most production applications, they probably don't use that model. But again, our goal is to help people ramp up the ladder of complexity and abstractions. And so not having to think about the embedding model when you're getting started is really fun.

[00:09:30] Jeff Huber: And I think quickly you realize like, "OK, I need to get more sophisticated about these different components and upgrade them relative to my application goals."

[00:09:45] Raza Habib: Let's sort of increasingly ramp the complexity across this conversation, but let's cover the basics just so we make sure everyone comes along with us. Let's just do some quick one-on-one things - what is an embedding? Where is it coming from?

[00:10:00] Jeff Huber: An embedding is fundamentally an array of floats or an array of numbers. And it is the output of an embedding model which is a machine learning model. Most of these models that you will colloquially hear about take text in on one end and output the array of numbers on the other end.

[00:10:20] Raza Habib: And I guess they might be taking in other modalities as well?

[00:10:25] Jeff Huber: That's right. There are models that do images, they are embedding models that do both images and text in the same domain. But like most people will use this kind of first one - you input text and you output an array of numbers and there's array of numbers encodes again like the vibe of that text.

[00:10:45] Raza Habib: And I think something that it's helpful for people to understand is like how are those embeddings trained in some sense? When you're training the model to take in text and spit out numbers, the way the process works is you start off with untrained model. So text comes in random numbers come out and then it's iteratively refined through learning to kind of make the embeddings meaningful in some way.

[00:11:15] Raza Habib: And I think a little bit of an understanding of that process is helpful because it gives you a mental model of what to expect. You know, when I put in "pineapple", what would be chosen? Those things - there's reasons why some things end up close and others far away.

[00:11:35] Jeff Huber: Yeah, I think the intuition here is for people that have looked at Karpathy videos on transformers or whatever, right? Like you know, what are these large models doing? They're slurping in like an internet scale of tokens and then they're using the internet scale of tokens in the LLM case to do like next token prediction.

[00:12:00] Jeff Huber: Embedding models, you know, it's not the same thing, right? But it is similar in the sense that it is slurping in like internet scale of data and in some ways it's looking to associate like which tokens are likely to sit close to each other on a web page.

[00:12:25] Jeff Huber: The very earliest versions are just like counting things or like N-gram based embedding or something like that. But then people start to do things like, hey, look at the context that surrounds a word and words that appear in similar context should have high cosine similarity.

[00:12:50] Jeff Huber: And I mention these things just because I noticed there's a whole bunch of design decisions that people are making when they're using these vector databases. They need to know about like what distance measures to use, like how to retrieve things, like how nearest neighbor search works. There's a bunch of concepts that are probably new to people.

[00:13:15] Raza Habib: And actually, if you don't know how the embedding model was trained, it's actually quite hard to know which one of these you should use. And knowing which embedding model is being used also kind of gives you a mental model for what things might end up close to each other.

[00:13:40] Jeff Huber: Yeah, yeah. I think we're in this interesting period where like, it is an unfortunate fact that like most people need to learn eventually, at least mental models to reason about these things. You know, sometimes we like to relate it back to like kind of the origins of like the relational database and SQL.

[00:14:05] Jeff Huber: And, you know, in the early days, you know, maybe you sort of had to know how B-tree worked in order to build a good application, right? Or you had to have a good sense of like the physical layout of the query plan in order to like write a good application.

[00:14:30] Jeff Huber: And, you know, eventually that stuff was like essentially abstracted away from the majority of people. Like, I think that's kind of already happened with these systems as well. It's getting there, it's not there yet, but I think it will get there.

[00:14:55] Jeff Huber: And so like, it's always interesting with like kind of laying out that like educational roadmap for developers coming into the space is like, what are you telling them that they need to learn to be productive? And obviously like, you know, each traditional thing is not free, right? Traditional thing, you also have attrition of people that like drop out of the process because it becomes too much.

[00:15:20] Jeff Huber: And so like, where to draw that? It is really interesting, you know, and then like, how that changes over time is really interesting. You know, it's not much anymore, right? Like, I think if all you could remember is like text comes in and array of floats come out and if the distance between those things and for some distance is small, then hopefully the things mean similar stuff, I feel like that's the kind of core of it and then things will details will change.

[00:15:45] Raza Habib: 100% agree. OK. So we've covered like the fundamentals of just like what are embeddings? Like where does it come from? Why we might want vector databases for AI? Now I understand a little bit more about Chroma specifically.

[00:16:10] Raza Habib: So it wasn't the first vector database to exist, there's a landscape out there. Can you help me understand - why Chroma and how does it compare? I think a lot of people are trying to make these decisions right now about doing their stack, what are the tradeoffs and choices they might be making?

[00:16:35] Jeff Huber: The reason we built Chroma was, you know, we were working on some ideas related to LLM space embedding space specifically, actually at the time like analytics and embedding space to help you do better data selection for ML training workflows.

[00:17:00] Jeff Huber: Right after ChatGPT launched in November 2022, we saw how developers were using this like vector search thing to build useful things with AI. We pretty quickly called our shot that this was going to be really important and that this was going to be like the kind of technology which goes from zero people using it to every technology company on earth using it in the matter of a few short years.

[00:17:30] Jeff Huber: Actually have an email written to one of our investors in January, the beginning of January '23, laying this out. We looked at the sort of landscape and it was just like our own opinion. We actually talked to like 30 or 40 developers. One of the sort of story points here was that at the end of December of '22, I DM'ed every single developer who had tweeted about either embeddings or LangChain in the entire month of December who had their DMs open on Twitter.

[00:18:00] Jeff Huber: And it was like 150 people and like today, it would be like 4 million, but it was 150 people. And then in talking to those users and learning about like what are you using? Like, do you like it? Do you not like it? It became very obvious, very quickly that the tools that existed to date were designed for different use cases.

[00:18:30] Jeff Huber: Existing solutions usually were built for like web-scale semantic search, were built for like recommendation systems. They were probably built for a use case where you have one large index which is always online and you update once a day sort of batch updated. It was very complicated to use, right? It's like hard to set up all these different like features and flags and like knobs to tune.

[00:19:00] Jeff Huber: And I think the thing that we observed was like that's fine if you are a search engineer, infrastructure search engineer potentially. But if you're an application developer, that's just not the right shape product. And also for AI applications as we would later learn, AI applications typically don't have like one large index which is used, they have maybe like one small index or they have many small indexes very commonly.

[00:19:30] Raza Habib: So that was why we built Chroma - we started the company to really help developers, engineers like bridge the gap between demo and production and building applied AI systems. And we realized that this was like, again, like a really important thing that didn't exist, it needed to exist.

[00:20:00] Raza Habib: So the primary differences between what was there already was one, some of the developer experience. So trying to strip out some of the configuration that maybe is not relevant for AI and you mentioned also the workload being a little bit different. Can you talk to me a little bit more about that and help me understand that in more detail?

[00:20:25] Jeff Huber: Yeah, so I think like developer experience commonly is underrated by like data professionals and there's sort of an academic overhang and there's reasons that this is the case but like, you know, if you've worked in application development software before, right? You're like, why did React win? It's not because React is better but it had better developer experience. Why is Ruby on Rails really popular? It's not because it's the fastest. It's because it has the best developer experience.

[00:21:00] Jeff Huber: I think people commonly get tripped up on like, what is the fastest thing or what is the most cost efficient thing? And these things are very important that you have to talk more about, but they always sort of semi-underrate like the ease of use. And we really cared again about people building valuable things with AI machine learning.

[00:21:30] Jeff Huber: The thing that we quickly learned in last year by looking at - so Chroma collects anonymous telemetry, it's very easy to turn off, but it is default on. And it was through studying that anonymous telemetry that taught us about like the workload shape for AI applications and specifically AI applications frequently are almost always scale quite high in the number of collections or indexes that they have.

[00:22:00] Jeff Huber: So there is a logical sharding or a logical separation between like this user's data or this team's data. Secondarily, these indexes were like not billion scale, they were not 10 billion scale, they were 10,000 or 100,000 or a couple million typically, but they were getting not billion scale.

[00:22:30] Jeff Huber: And then thirdly like these applications ended up becoming quite like cost sensitive. And so you've heard people complaining about their bills from various vendors on the internet before. And sort of fundamentally because vector search happens in memory, it has to be in memory and memory is very expensive.

[00:23:00] Jeff Huber: And so like to keep everyone's data always in memory is like an insanely expensive proposition. And so like all of these insights kind of led up to us designing now this like new open source version of Chroma doesn't replace the single node version but has the same API as a single node version, which is a fully distributed and serverless version of Chroma designed for this like workload shape for retrieval specifically for AI applications.

[00:23:30] Raza Habib: And what does that mean for me as an end user? Like what would I experience differently?

[00:23:40] Jeff Huber: Yeah. So what we hear from users is the following. Number one, users are struggling with their like caching story. So if they're using like another provider where the data, all users' data is always in memory, that becomes insanely expensive. And so like either you just eat the bill but it's like really painful, maybe it prevents your company from scaling because it's so expensive, or you start to devise like a bunch of complicated caching technology and you invent yourself as to like where data lives when and how and Chroma solves all those problems for you.

[00:24:15] Jeff Huber: The other thing we hear from customers is that a lot of alternatives to Chroma are intrinsically single node systems. And so like that's fine until you want a bigger box and then that's painful and then you want two boxes and then it's like 100 times more painful. The single node to distributed story for a lot of these alternatives to Chroma is - we're just hearing about the operational hell that people are going through trying to make that work.

[00:24:45] Jeff Huber: And so like, what's great about a system which is like natively distributed is that it handles all the segment placement for you automatically. And so like, if you want to go from one node to 10 to 100 to 1000, you don't have to do anything differently. It just seamlessly scales up, which is good for companies that don't want to have like four engineers specialized only in making this one thing not fall over.

[00:25:15] Raza Habib: OK, that makes a lot of sense. OK. So we've covered a little bit about vector databases, why Chroma specifically. You probably get to see a lot of different AI use cases out there. What are some of the more interesting ones that you're seeing? Like what are something that have caught your eye? I think everyone knows about RAG and we'll probably go into RAG in a bit more detail. But are people using these in more creative ways?

[00:25:45] Jeff Huber: Yeah, I mean, we don't like to use the term RAG. We think it doesn't sort of mean anything.

[00:25:50] Raza Habib: Tell me, tell me more. I want to know why you don't like the term.

[00:26:00] Jeff Huber: Yeah. So like retrieval augmented generation is, you know what RAG stands for and if you sort of reason about that from first principles and you sort of compare it to traditional software, it's sort of equivalent to saying like database augmented programming, which like, that's just programming. Like, why are we giving it an acronym?

[00:26:30] Jeff Huber: You know, in some ways it's just like, it's mixing things together that should be separate. It's not a useful concept. It should be just, there's the retrieval and there's the generation - we don't need to put them together into an acronym as like kind of architecture.

[00:26:50] Raza Habib: OK? It's interesting. Yeah. The origins of that are because when RAG was first introduced, it was a single system. And in fact, people do often train systems like this so often times they're actually training the model jointly with the retrievers. But I take your point.

[00:27:15] Jeff Huber: Yeah. And maybe like, even in-context learning, I think is like a better word than RAG. Like in-context learning helps a developer reason about like what this thing is doing. For example, even the learning part is obviously, as you know, like misdirection - it's a bit misleading if you will.

[00:27:40] Jeff Huber: Sometimes we use the analogy of like a punched card computer. We're in like the punched card computer era of this stuff and like literally what was happening is you have like a sheet with a bunch of holes on it and you put your instructions in there and you put your data in there onto this punch card and you put it into the magic box and the magic box computes over it and then spits out the other side, a new card.

[00:28:10] Jeff Huber: And like, that's literally like in some ways what a context window and what an LLM is doing. Obviously computing evolved, sort of away from that paradigm and became much more like dynamic in real time so that like the compute can itself go and like, get instructions from other parts of the system.

[00:28:35] Jeff Huber: Quite a few of our customers do search type applications. So we have people like Blue Pool doing code search, or they used to be doing code search. Now they actually do something which is pretty cool, which they take legacy FORTRAN and COBOL code bases and they rewrite them in modern languages automatically.

[00:29:00] Jeff Huber: The traditional search architecture is I have a retrieval step and then I have a generation step, and they've gone much closer to what you were describing and they're not the only ones I've seen do this. Many have done this. But I think they're a great example of making it work really well, which is that they just make the retrieval API a tool that can be called by the model in the way that OpenAI does function calling.

[00:29:30] Jeff Huber: So instead of there being like a fixed retrieval window that is just being retrieved, the model gets a query and on the basis of that query, the model decides what to embed, right? So it's not just embedding the user query because the user query may not be worded in a way that's semantically similar to anything in embedding. And in fact, if you're doing code search, then it's definitely not because the user didn't ask you a question in code.

[00:30:00] Jeff Huber: And then it's able to do as much retrieval as it wants across what it wants because the model is making those decisions and then they finally generate. Note that's, I guess closer to what you're suggesting a kind of a dynamic lookup rather than retrieval followed by generation.

[00:30:25] Raza Habib: The linchpin there is like LLMs obviously have this current Achilles heel, which will be solved, which is they don't exactly know what they don't know.

[00:30:40] Jeff Huber: Yeah, but if they were good at knowing what they didn't know, then the sort of LLM directed retrieval would be like I'm thinking a much more viable prospect. You asked about use cases. You know, I think that kind of the chat use case, the sort of ability to have like a natural language conversation over unstructured data is global, right? Like it is sort of broadly useful, it can be applied to almost any application.

[00:31:10] Raza Habib: Chat with exactly your documents, exactly.

[00:31:15] Jeff Huber: And so like, obviously that is just like everywhere. And I think like for good reason, right? Like people love to like make fun of it as being like kind of simplistic and almost caveman and it's AI or something like that, but it's useful and it's like great because it also is such a great, easy starting point for so many organizations to like build expertise with the technology.

[00:31:45] Jeff Huber: I think like one use case which I think I would love to see a lot more of is, you know, again, going back to how I described retrieval earlier in the conversation - there's both like kind of just like bring your data to the LLM and make it aware of the data, that's the memorization part, whatever, right? That pretty much encompasses the chat related use case.

[00:32:15] Jeff Huber: The other use case that I think is interesting is like the ability to iteratively improve LLM systems. Some people would use the term agent - I'll try to avoid it here - iteratively improve an LLM system by giving it natural language feedback over what you want to do more of or less of.

[00:32:45] Jeff Huber: So like the canonical example that I could pull from here is like you want to create - I'll use the term agent - to process your email, right? So it's gonna take in your email, here's a new message and then it needs to decide what to do with that message. Do you archive it? Does it skip the inbox? Do you apply a label? Do you reply to it? Do you draft a reply with the human to send it? Do you automatically send the reply? Do you go out and grab calendar information from somewhere else? Right? Scheduling.

[00:33:15] Jeff Huber: And I think like, humans have a pretty complex rule set in their brain of like what they want to happen for, especially their inbox - kind of a personal thing. And so like, I think the ability for a human to give that agent feedback on like, "hey, next time do this" - you know, if you get an email from my mom, star it and don't reply to it.

[00:33:45] Jeff Huber: You know, if you get like an email asking me if I want to buy a lead list for the dental association - you know, as a co-founder you probably get similar spam emails, right? You know, instantly archive that. I don't care about that.

[00:34:15] Jeff Huber: And like that is in some sense building up a corpus of poor man's rough instruction sets, right? You're saying like if this context happens again, then do this thing. And I think that's like a very powerful use case and you kind of think about it as analogous to like you've seen agents in Minecraft that run around and come into this like, what's, where am I using the vector DB here?

[00:34:45] Jeff Huber: Well, the question is like, how do you store those instructions in a way that are retrievable, right? You can just pack them all at the window as one option, right? Your context will be like larger and larger and larger. But I think that like we find that like developers appreciate having a little bit more control, find great control over like what instructions are pulled for which context.

[00:35:15] Jeff Huber: And having this like human interpretable space that you can actually iterate on - like a retrieval system gives you that ability versus if it's all in the context window, like how do you edit that, right? Like do you just say like "no really this instruction, I promise" in all caps or you know, "grandma is going to die unless you pay attention to this one."

[00:35:45] Raza Habib: You know, it is all just going into the context window, right? You're retrieving it, it is the ability to just text in the context. It is the ability to treat is still down the instructions that that is relevant.

[00:36:15] Jeff Huber: Language models also today, this could change, today also get distracted by things commonly, especially over long context windows. And so like the ability to pare down the instruction set to be only the instructions that are relevant to that context is like faster, cheaper, more controllable, more steerable.

[00:36:45] Raza Habib: Maybe that's something we can double click on for a little bit because I think this is something that people who are architecting systems today who are thinking about stuff for the future - like a certain question that comes up often, which is like, how should they think about increasing context lengths versus, you know, putting more things into a database?

[00:37:15] Raza Habib: So, you know, the latest gen AI models might have millions of context tokens. I was actually at a conference a couple of weeks ago with one of the DeepMind kind of researchers who works there. And they've started just putting like what is effectively an enormous index just as free text with like some metadata straight into the context window, caching that because prompt caching, which is effectively now a vector DB, but it's a vector DB in once it's cached, it's stored as vectors, right in the memory of the model.

[00:37:45] Jeff Huber: Yeah. So, you know, we use like large context windows for stuff internally. So like, for example, like, take all of our GitHub issues and like feed them all to like Gemini's 2 million like context window and ask it to summarize these issues that are important and like that kind of stuff.

[00:38:15] Jeff Huber: So like we'll use large window stuff and it's, you know, it is pretty powerful. Like one way to reason - my analogy is like, OK, well, like we know what the von Neumann architecture looks like, you know, maybe there's an equivalent here. And so there's sort of like the CPU, the memory and the disk and they're good at different things, they had different like cost profiles.

[00:38:45] Jeff Huber: And so that analogy extending to AI is like, you know, if you don't have to put everything into memory, then like, maybe you shouldn't just because again, it's going to be like very slow and costly. I think those are relevant. Obviously, the models will get faster and cheaper.

[00:39:15] Jeff Huber: And so like, maybe those would be like sort of in order of magnitude less relevant every year, but maybe also like caching, it maybe matters less as well. Maybe if you're caching it matters less, we're also going to want like a lot more context. Right? You know, it's kind of the Bill Gates quote which I can never quite get right. Right? You know, it's like, nobody ever wanted more than 128 kilobytes of RAM or whatever his quote is.

[00:39:45] Jeff Huber: But I still think that like the steerability part is the part that's like the most important, which is like some fundamentally, the control over these systems, there's still a gap between demo and production, right? And so like to get to like 99 or 99.9% reliability, you kind of want to use all the tools at your disposal to get like more and more reliability.

[00:40:15] Jeff Huber: And like removing distracting information in the context window and having sort of surgical control over like what information I think is a very useful tool to get to the level of reliability that most companies want to get to. And most companies - be curious to get to hear your thoughts on this as well - I feel like what we hear is that a lot of enterprises are getting stuck and they're not making it to production with applications, they're getting stuck in kind of a POC demo phase.

[00:40:45] Jeff Huber: And of course, this is a tale as old as time in machine learning where it's like if you have clean data, a demo is easy but production is still insanely hard. It's been this way for 15 years. That new thing, I think that's the part that like we really worry about is like, you know, enterprises like never getting into production. But what are your thoughts?

[00:41:15] Raza Habib: I've gone back and forth on this a little bit and I, you know, I've spoken to a lot of people smarter than me to try and kind of understand from them. And so the arguments that I've heard in favor of, like, why we'll still need retrieval even as contexts get a lot longer - some of it is what you just said, which is like as the contexts get longer, our ability to just fill them with more stuff... there's almost an inelastic...

[00:41:45] Raza Habib: The hedonic treadmill of elastic context windows. Yeah, like make it bigger and now I can put an entire video in there. I can put my entire code base in there. I can put books in there, you know, like people keep filling it.

[00:42:00] Raza Habib: The other argument that I've heard is that, you know, people often want to do a lot of like filtering and reranking on the things that they're retrieving before they put them in context. So if I've got a big document store and I'm searching over it, like I maybe want to get like the documents for a particular date range or I want the documents that, you know, like give me the legal contract from last week.

[00:42:30] Raza Habib: There may be four different versions of the document with the same title. And so if this is just in context, like am I guaranteed to be getting the right one? So filtering and metadata on top of stuff, but you know, a large fraction of search queries are much more like that than they are just like a string that's being matched.

[00:43:00] Raza Habib: They have some metadata that you want to retrieve over. The thing that's most surprising to me from this conversation with the researcher at Google was the extent to which they feel that that's being solved in context by just putting the metadata into the text. And that was surprising to me and I'm still skeptical that it's reliable enough that people would prefer to do that.

[00:43:30] Raza Habib: And so I think if you're building today, like I would still 100% like, think you're much better off like putting it into a database and having a guarantee that you're filtering appropriately and putting the right thing into the context. Like for me, it still remains an open question as the models get better and the context windows get longer and their ability to follow instructions get better.

[00:44:00] Raza Habib: Do I still actually need to store this? Can I just use plain text? Plain text is a very cheap way of storing stuff. I can manipulate it quite easily. I have tools to meditate about it. So I honestly feel like uncertain. It depends how putting the models improve. Like today, definitely. How many years before that changes or does that ever change? I'm less certain.

[00:44:30] Raza Habib: And then on the question of like are enterprises getting to production, they just like, absolutely yes. So we were similar to you guys, I think a little early to the space, the company is four years old and we had like the first LLM product, four months before ChatGPT came out.

[00:45:00] Raza Habib: We've always been focused on evaluations and then helping people find the right prompts and the right components once they have evaluation. So kind of an iterative development cycle for building with LLMs or building with AI. And when we started like in the beginning, it was similar to what you said at the beginning where you messaged everyone and there were only 150 people on Twitter or whatever who had mentioned embedding or LangChain.

[00:45:30] Raza Habib: We tried to find like every company who was building on GPT-3. And we reckon it was about 300 companies at the time. And obviously, now, as you say, it would be millions. And early on everyone was a startup, there was like almost no one that we were speaking to who was like a very large company at that point in time who was building on GPT-3 - just wasn't reliable enough, it was too new and people didn't know about it.

[00:46:00] Raza Habib: And then like the story of the last year and a half for us has been like gradually seeing more and more larger and more serious companies first experiment. There was like a wave of POCs and now actually start to get things to production. And the things that are getting to production are like still pretty sane to you for most companies.

[00:46:30] Raza Habib: It's like there, you know, so people are still touching the, I think just the very beginning of what's possible. So it's like the first things people are building are like good customer support, automation and augmenting their workers with those things, they're building better search systems that are augmented.

[00:47:00] Raza Habib: But we're also seeing people build more interesting stuff. So, you know, people like Ironclad have started to build automatic negotiation systems. So some for the top customers in Ironclad now and you think of like legal tech is like a high stakes place, something like 50% of their contracts are auto negotiated.

[00:47:30] Raza Habib: So there's some kind of playbook contract comes in, the AI suggests all the redlines. Yes, a lawyer will review it and take it off, but the core work is kind of fully automated there. And we see that - legal tech is an area where we see a lot of applications, a lot of educational tech applications, language learning, like helping students with homework.

[00:48:00] Raza Habib: You know, that's an area where it's actually kind of existential for those businesses. If they don't adopt these technologies, they may not be able to stick. But yeah, these are like large enterprises in production. McMillan is another one in the ed tech space and one of our customers, they have like an AI tutor. So yes, people are getting real things into production with real users.

[00:48:30] Jeff Huber: Do you, I guess like not to turn, not to ask you questions but like, do you find that like, do you find that like the majority of use cases... like what is the percentage breakdown of like internal use cases where this is human-in-the-loop versus user, end user facing use cases where there is no human loop?

[00:49:00] Raza Habib: We are much more end user focused. So most of our customers are building like something end user or customer facing. And the reason it works out like that is as I said, we're like LLM evaluation for enterprises. And so when does evaluation become critical? When people are building things that are internal, it's more of a nice to have than a fire problem.

[00:49:30] Raza Habib: I see. So they might get evaluations in but they're also like more happy to if it's only going to be consumed by people within the company, then they can learn a little bit more from like YOLOing something into production there. But as soon as it's customer facing, you really care about understanding like is this actually good? How do you improve it?

[00:50:00] Raza Habib: So we have a sample bias towards people who are shipping things into production. So, so I don't know whether that's representative of like the true reality of usage. My guess would be probably there's more internal usage than there's production usage, but our customers are very much production focused.

[00:50:30] Jeff Huber: Great. We discussed this just before the podcast - AI Engineering is actually a pretty new field. Machine learning has been around for a long time, but AI has changed the way people are building and I think people are still evolving best practices. You're seeing your own kind of corner of it. What advice and lessons have you learned that you think are general that an engineer trying to start building with AI today might adopt?

[00:51:00] Jeff Huber: Obviously, it sort of depends on where the engineer is in the lifecycle. I think first, you know, build a small thing where you understand what every line of code does. It should be like no more than 50 lines of Python - that's a very good start. The way that most people learn is not by reading theory, but by getting hands-on experience.

[00:51:30] Jeff Huber: With applying the frame that like what AI is good at is processing unstructured data in a common sense way, what unstructured data do you need processed in a common sense way, either at your workplace or in your home? And so that could be processing your logs from your tech system to help you identify what's going wrong with a certain bug, it could be just a way to like semantically query your diary.

[00:52:00] Jeff Huber: Now you have a small thing, I hope you're getting some utility out of it, and then you're observing when it doesn't work. Keep a log of those things - either literally write it down in a notebook or automatically keep a log of "this input did not yield the output that I desired."

[00:52:30] Jeff Huber: And then from there, look at your data, right? The "look at your data" meme is viral for a reason. It's because you should look at your data. Look at the spectrum of like, well, what inputs work well and what inputs do not work well. And then reason about how to improve the inputs that are not working well.

[00:53:00] Jeff Huber: And it's probably going to be something about either changing the prompt or changing your some element or hyperparameter of your retrieval system, changing the embedding model, changing the chunking strategy if you're using retrieval. And you know, get a feel for how you would even evaluate like if something is better or worse.

[00:53:30] Jeff Huber: Today, at least like AI is not this deus ex machina, you know, it is just software. And so you should expect that you should build, test, deploy, monitor and iterate AI-based systems because it is just software. The same loop that you've been doing for a long time, you will do now as well.

[00:54:00] Raza Habib: Alright, that's it for today's conversation on Hi Agency. I'm Raza Habib, and I hope you enjoyed our conversation. If you did enjoy the episode, please take a moment to rate and review us on your favorite podcast platform like Spotify, Apple Podcast, or wherever you listen and subscribe. It really helps us reach more AI builders like you. Thanks for listening!

The Problem with RAG Terminology

Podcast:

About the author

Ready to build successful AI products?