Building the First LLM Search Engine

Raza Habib

In this episode, we sit down with Michael Royzen, CEO and co-founder of Phind. Michael shares insights from his journey in building the first LLM-based search engine for developers, the challenges of creating reliable AI models, and his vision for how AI will transform the work of developers in the near future.

Subscribe to Humanloop’s new podcast, High Agency, on YouTube, Spotify, or Apple Podcasts

Podcast:

Raza Habib (00:39):

This is High Agency, the podcast for AI Builders. I’m delighted to be joined today by Michael Royzen, who’s both a close friend and also a bit of a prodigy when it comes to machine learning. He was one of the younger interns at Microsoft Research as a 17-year-old, started his first machine learning research company straight after that, and he’s now the CEO and Co-founder of Phind, which is an answer engine for software engineers. They were the first company in the world to have an LLM-based search engine available, and they’re backed by Paul Graham and Y Combinator. Now, over 90 million answers have gone through Phind, and they have millions of active users. So, Michael, it’s a pleasure to have you on the show.

Michael Royzen (01:17):

Thank you, Raza. It’s great to be here.

Raza Habib (01:19):

So Michael, to start with, for our audience, just help them understand what is Phind, how do you describe it to someone new?

Michael Royzen (01:23):

Yeah, so Phind is a tool that helps developers get from an idea in their head to a working product. That’s the core of what we do. And so we started really as an answer engine, like you described, when we were actually the very first LLM-powered answer engine. A user puts in, like, “Hey, how do I do this?” or “Hey, this isn’t working,” and we’ll do an internet search. We’ll figure out what’s relevant to answering your question. Then we’ll use a large language model to synthesize all that information into a concise and hopefully correct answer to your question. Now we’re moving more broadly into how can we integrate that into your workflow as a developer as deeply as possible to really help you go from, “Hey, I want to build this thing,” to actually having it built.

Raza Habib (02:08):

Okay, so idea to product could mean a lot of different things. Give me an example. Walk me through a particular use case.

Michael Royzen (02:14):

So you say that, “Hey, I want to build a Next.js app that takes in a user’s question, and I want it to go out and get all these sources, synthesize them, and send them to an LLM.” One of our internal tests is actually being able to build Phind using Phind. And so we kind of enjoy that one because it’s a measure of how usable the product is.

Raza Habib (02:37):

So the idea is that developers should be able to come in with an idea, say to Phind, “Hey, I want to build X,” and then it’s going to try and instruct them on how to do that. It’s not going to build it for them necessarily, but it’ll maybe write some of the code and talk them through the process.

Michael Royzen (02:50):

So the current product that’s available online is really kind of more chat-based, but we’re working on something. By the time this is released, I think we’ll have something to say about it publicly. We’re working on something that will actually write the code for you, actually create the files and test them and orchestrate all of that. And so we started with this more chat-based UX. It was very natural, and that was kind of what the technology was limited to at the time. But now we’re very quickly moving towards having the LLM, kind of like a Tesla autopilot, take over more and more of the dirty work and the stuff that we as programmers don’t really find fun—fixing things that aren’t working—and instead shifting the fun things like design and figuring out how this is going to work over to the human so the human can still do the things that they find fun. Then it’ll kind of be like the Tesla autopilot for getting your project working, rather than, say, a Waymo type analogy that just completely doesn’t ask you for feedback at all.

Raza Habib (03:46):

Okay. So it’s a coding assistant that’s going to help me go from idea A to developed product. Obviously, there’s been an explosion of different AI coding tools recently starting from GitHub Copilot, which I think was the first really big commercially successful LLM product. But now there’s Cursor, there’s Cody, there’s a whole bunch of things out there. How is Phind different from those? What are you trying to do that sets you apart?

Michael Royzen (04:09):

So what we’re doing right now is we’re really reimagining what coding should look like from the ground up. So I think those other companies have done things that are pretty interesting, but I think they’re fundamentally limited by being shoehorned into an existing IDE. What we’re working on right now is really taking the Phind magic and having a native app on your computer built from the ground up for this AI-native idea-to-product workflow. Rather than being limited by kind of what people currently think an IDE is, we have the ability to completely reimagine all of this from an AI-first perspective because our hypothesis here is that what we’re really kind of limited by is the AI and AI progress, and it’s on this very steep improvement curve, whereas the IDE has already been invented. We don’t want to reinvent the IDE as it is. We want to be able to just kind of take the first principles experience that we want to provide and then rethink it completely from the ground up. So that’s what we’ve been working on.

Raza Habib (05:06):

So if you are going to build an IDE from the ground up for developers, so kind of an AI-first-from-scratch IDE, obviously that’s going to come with a lot of cost in the sense that developers are already familiar with their existing IDEs. There are a ton of keyboard shortcuts. They understand how to work within them. What are you imagining with first principles from AI that you can do differently, or that’s going to be fundamentally different as a product experience that makes you think it’s justified to reinvent the whole product rather than integrate the AI into the existing workflow?

Michael Royzen (05:37):

The biggest idea here is that developers will be writing code less, and IDEs are basically solely optimized for writing code. An IDE is designed purely for like, okay, here’s this code I want to write, how can I write it? And what we’re a lot more interested in, and where we think the thing that will make developers superhuman, is helping them think better and focusing on that, focusing on helping developers like themselves fully unlock their thoughts and go from an idea to an understanding much faster. And then we’ll write a lot of the code for them. So the big insight here is that we think the actual code writing should not be the focus of this product. The focus should be on how can we help pull all of the creative juices out of you and help you understand your own ideas better and formalize them and do the algorithm design and all of that?

Because the part that we find fun, really as developers, is still really hard. You can’t code anyway if you can’t have that. And it’s something that AI currently isn’t good at either. So that’s kind of what we think will stay true in the long term, is that humans will continue to be better than AI at these high-level “how do I do this” type tasks. And sure, the LLM can help pull it out of the human, and that’s what we’re really working on. But if the LLM can truly invent new products as well as humans can, then once we get there, then there are all sorts of other wide-ranging implications for society. So we won’t get there for quite a bit, is one of our bets.

Raza Habib (07:14):

So I’m kind of coming along with you, but it sounds like the argument is that coding is only part of what you’re doing when you’re doing software development. The IDE is built around helping you write code. It’s a text editor with a bunch of extra features, and you’re saying, “Hey, a lot of the actual producing the text of the code, we can automate a lot of that. The hard part is the thinking and the coming up with the product and solving problems.” So assume that I grant that premise. I just believe you on that. How will the product be different from a traditional IDE such that it enables this?

Michael Royzen (07:44):

So, sure, it’s focused a lot on a conversation with the AI and also real-time previews that are not really possible with an IDE because so much space and focus is on the code itself. And so what we want to put disproportionate focus on versus a traditional IDE is that the user themselves has a hypothesis. They’re building this app, they’re trying to test their own hypotheses as quickly as they can. How can we facilitate that through a product? So we have these user interfaces that make these very fast UI and functional mocks that the user can very quickly play with, interact with, test, and get feedback on. And that’s a very big part of the UX when using the new Phind experience versus solely this IDE that’s mostly code. Maybe it has a chat bar on the side. I think what those products are really missing is kind of a more real-time interactive experience that lets you really feel it and really use the product and tighten that iteration loop.

Raza Habib (08:35):

Now I’m beginning to be able to picture it. So it’s actually saying there should be something much more interactive here. I think Anthropic recently had the Artifacts product, and it sounds like it’s got bits of that kind of intuition in it where I want to be able to have an idea, see a render of it, and interact with it.

Michael Royzen (08:50):

Exactly. And Claude Artifacts, I think, is actually a brilliant product. I use it myself for things, but I think it’s clear by now that what developers really want is something that’s deeply integrated into their workflow. Time and time again, we hear from our users that they don’t like going to the web, they don’t like switching tabs, and that’s why we have a VS Code extension.

Raza Habib (09:09):

So Michael, one of the reasons I wanted to interview you in particular was how early you were to the space. You guys built the first quote-unquote answer engine, the first system that used LLM to do question answering that was widely publicly available, well ahead of Perplexity, well ahead of others, and you did it while still a college student and in a world where the technology was very different. We didn’t have LLM APIs yet. So I’m kind of curious, when you started, what did it take to build a product then? How has that changed?

Michael Royzen (09:34):

Yeah, it was like the Wild West back then. I started on this in the fall of 2020, shortly after I had played around with the Hugging Face Transformers library, which I think had come out a couple of months earlier. And so I was playing around with all of these BERT models, and I was building an invoice unstructured extraction tool with all these BERT models. And I was really fascinated by the fact that these models can classify text super well, but they can’t really write text. And the state of models that could write text at the time, like GPT-2, wrote, in my opinion, complete gibberish. I think GPT-3 had also come out at the time. I think GPT-3 also came out, and I tried it out. I wasn’t really impressed by it. I wasn’t like, this isn’t that much better than GPT-2, which I know is a bit of a hot take because obviously…

Raza Habib (10:23):

Definitely a hot take.

Michael Royzen (10:23):

I understand all the research and the hard work that went into it, and I definitely appreciated that. And I noticed that the text that it generated was a lot more coherent, but it was still fundamentally as useless as GPT-2 because it couldn’t answer questions, which was the use case that I was really excited about. And so I went really deep down this rabbit hole figuring out, well, how can we have these large language models answer questions?

Raza Habib (10:47):

And when you say couldn’t answer questions, you mean factual questions because the model only knows what’s in its pre-training dataset. So if I ask a question outside of that, and also I guess this was before instruction tuning, so there was no reason for the model to try to be accurate. They were just next word prediction machines at the time.

Michael Royzen (11:04):

Exactly. There was no widespread notion of instruction tuning, at least outside of the research labs at the time. And so even when you few-shotted the model, you provided a couple of examples of here’s what I want a sample response to look like. It was still very iffy. It would go off the rails all the time. Those models really wanted to do, and what they were trained to do, is just tell stories and repeat internet text. And they were quite good at that if you could kind of prompt it in such a way that it could generate a story or generate even an email. It was decent at those things. But even with a lot of tuning, it was very difficult for it to be good at answering any sort of question, especially reliably. And so I was kind of intrigued by this, but what really made me obsessed with this space is when I saw a demo by this ex-Meta researcher who I think was at Hugging Face at the time, his name was Yacine.

Raza Habib (11:59):

Yacine?

Michael Royzen (11:59):

Yes. And he made this demo of using a BART model, which was this encoder-decoder model released by Meta, 500 million parameters, so very small by today’s standards. And he fine-tuned it on Reddit to basically follow this instruction format of wanting to answer in response to a question. And then he also connected it to a RAG system. So he set it up so that it would, given a question, perform a lookup on Wikipedia, give the 10 most highly ranked results back. And the funny thing is it was super well implemented. He implemented it both using sparse retrieval using something like Elasticsearch and also dense retrieval using Meta’s dense vector retrieval database at the time. And it still flew off the rails a lot, but I was intrigued enough by this problem to take that as a starting point and kind of see how far I could push this.

Raza Habib (12:53):

And so a couple of months go by, and I tune the model myself. I do some work on the document retrieval system. I set this up end-to-end and not that much progress. I couldn’t make it that much better. I made it a little bit better, but not that much better. And so I shelved this idea for almost a year. I still have a year and a half left in college at this point. I started working with this professor at UT named Greg Durrett. I came to him in the fall of 2021, almost a year later, and I was like, “Hey, there’s been some development in the space. I worked on it kind of previously like a year ago, but now there are these new models called T0, which is a derivative of the T5 model released by Google in 2020. But this model is fine-tuned from the get-go to be a lot better at following instructions for answering different kinds of questions.”

Raza Habib (13:46):

And this was one of the first papers, if I recall correctly, that was trying to frame all NLP tasks, all natural language AI tasks, as just text in, text out tasks. The previous paradigm, you would have custom models for every use case. For listeners who have joined us in the post-LLM world where we have these very general-purpose models, it’s maybe difficult to appreciate that even just four years ago, that was a very new idea. So previously, if you wanted to do classification, if you wanted to do NER, if you wanted to do text extraction, you would build a custom dataset for every task and a custom model for it. And I think T5 was one of the first papers, if not the first, where they said, “Hey, why don’t we take this mindset where all of these tasks, whether it’s classification or NER or question answering or summarization, they’re actually all just tasks that take some form of text in and they spit text out, and we can frame all of these different problems with one model.” Again, it’s very easy to take that for granted today because it’s become what everyone does. But I think at the time it was extremely novel.

Michael Royzen (14:43):

Yes. And this paper is kind of what made me realize that now it’s actually possible, I think, to build a useful product of this type. And what was so new about T5 was the fact that, kind of like you said, Raza, they created a singular dataset that had many, many different kinds of prompts across many different kinds of tasks. So it was one model that was fine-tuned for a very diverse set of question-answering tasks and even text extraction tasks. And I think some other tasks were in there as well. And it was also larger than BART significantly. So unlike BART, which was roughly like a 500 million parameter model in the largest size, this model was… there were two models, there was a 3 billion parameter model and there was an 11 billion parameter model. And so the larger size, I think, helped a lot as well with making the model more coherent and staying on the rails even though it’s tiny by today’s standards.

Michael Royzen (15:36):

And so the model, because it was less likely to go off the rails just to begin with, I took inspiration from the scene strategy of really kind of focusing on Reddit. And so we focused it on two categories. I focused it on Reddit and Stack Overflow and Stack Exchange questions to try to have just a large dataset of a more singular format on top of all of the work that they already did with T0. And so I fine-tuned it for that very specific format to improve the performance specifically just for general question-answering tasks. And then the other thing I did was really on the retrieval system. So I wanted this to be able to retrieve information from basically the entire internet. And I didn’t want to cop out by just using the Bing API. So I started with a Common Crawl dump, and I downloaded all of Common Crawl, which is… Common Crawl is basically a breadth-first kind of dump of the internet. So for any given website, they might not necessarily have the whole website, but they include most of the publicly accessible websites on the internet in this dump. It’s like three terabytes worth of text or something ridiculous like that. And so I wanted to keep only, say, the top 10,000 websites. I guess that’s where most questions would be answered from. And then I got rid of the long tail and focused just on keeping the Common Crawl from the first 10,000 websites and then indexing those using an Elasticsearch cluster. And I was able to get it running on a relatively small EC2 instance on AWS, one instance that you ask it a question and it’s able to retrieve relevant documents from the most popular websites on the internet that contain the answers to that question. And when I put that system together end-to-end for the first time at the end of 2021, at the beginning of 2022, I did a Show HN on Hacker News in January 2022 being like, “Hey, I put this thing together,” and we got some mixed feedback. People who tried it and it worked were like, “Wow, this is game-changing.” And then other people didn’t get what they wanted, and they were like, “Ah, yeah, I don’t know about this.” But when it did work, it was so magical that it gave me… I was obsessed with this idea. I couldn’t sleep, I couldn’t stop thinking about it. And so I applied to YC. My app was basically like, “Hey, look, this is how people are going to find information.” And the central question that I was struggling with and what people were asking me too, I was having coffee chats with some friends, some who were in the VC business, and they’re like, “Yeah, who’s this for?” And I’m like, “I don’t know, it’s for me.” I kind of Wikipedia-type style searches and stuff, but I don’t really know if that’s a long-term winnable business versus Google. And so I decided around that time to focus this on software developers.

Raza Habib (18:18):

And this was kind of like a twofold hypothesis. The first one was that we can help software engineers… did it. And at the time by basically being a better Stack Overflow, they have a question that can be answered from looking at something on the internet, boom, we can answer it. And the second hypothesis was that, kind of the Paul Graham hypothesis that he wrote an essay about all the way back in 2012, which was that perhaps it is possible to displace Google. And the way to do that would be to first make a search engine that all of the developers use. So that’s what Google actually did. Google got all the hackers to use Google, and then they kind of opened the scope later rather than going with this over-broad “we’re everything to everyone” all at once. And so we focused on developers, got into YC, and immediately after college, two weeks later, we started that. We relaunched as initially Hello, and then we rebranded to Phind. But ever since the summer of ‘22, we’re basically doing what we’re doing now. So that’s a long-winded tour of our origin story.

Raza Habib (19:22):

Before we dive into the technical details of how Phind works, one thing I’m curious about: Paul Graham, the founder of YCPG, doesn’t do that many active investments anymore. He’s not a partner at YC most of the time. You guys are one of the rare companies that he’s personally invested in. How did that happen? How did you convince him to invest? I’m kind of personally just very curious about this.

Michael Royzen (19:44):

Well, we basically told him about the vision, and we showed it to him and we were like, “Hey, people don’t want links. People want answers. This is the future.” And he was looking at him, I could see that the gears were turning in his head. He was like, yes, I think he was happily surprised by it. And the way that it worked out, I chose the last office hour spot that he had that day that he was doing in person. And so he got so excited talking about this with us. First of all, I just asked him to invest on the spot. I could see the gears are turning. I was like, “Hey, do you want to invest?” And he’s like, “Yes, of course I want to invest. This is so cool.” And then he was like, “Hey,” he checked his watch, it was like seven or seven-thirty. He was like, “Hey, I have to go home to Jessica. I have to cook dinner, but do you guys want to come with?” And so we’re like, of course, we want to come with. And so he takes us to his lovely home, and we kind of sit down, and Jessica’s there too. So we got to meet Jessica, and Jessica’s absolutely wonderful. I think she does not get nearly enough credit for her role as literally the co-founder of YC, but she’s amazing. And we’re sitting at their backyard table, and Paul Graham just kind of looks at Jessica and he’s like, “Our grandkids aren’t going to know what a search result is. They’re just going to have answers.” And this is kind of a mind-blowing moment for everybody because this was pre-ChatGPT. This was long before LLMs have been kind of known to answer questions in a way that people were consuming. And so we all just got super excited about this idea of, hey, we can answer questions directly instead of just sending people to links.

Raza Habib (21:23):

Since then, it feels like you guys have shifted the vision somewhat from being a search engine for developers with a view to taking on Google. I’m kind of curious, what do you think of Perplexity? It’s got a huge valuation, it’s been growing really fast. It is closer maybe to the original Phind thesis than you guys are now. Do you think it’s going to be successful?

Michael Royzen (21:42):

There’s a couple of things that happened that I think shifted my perspective on the whole industry. The first thing is the release of ChatGPT. And I think that’s really the main catalyst that made it a lot more difficult for a startup to compete in this space because it was so much better than everything else for so long that ChatGPT is kind of what became known as the original kind of AI answer engine. And what that caused was the famous code red alert at Google. It forced everyone to be like, we have to get on this. One of our original hypotheses is that Google is going to be very slow to market, not just because they have all these safety bureaucracy issues inside the company, but also there’s political reasons as well why they wouldn’t want their AI saying things that could be wrong or offensive.

Raza Habib (22:28):

And I guess for Google also, it’s a classic innovator’s dilemma as well. They get a huge amount of their revenue through search results.

Michael Royzen (22:36):

ChatGPT’s success blew all of that up immediately. And so we see now Google is building generative LLM results directly into search. And yes, they still have a lot of work to do on improving the quality of that, but for fundamentally building a generic answer engine that just synthesizes stuff from the internet Perplexity style or Google AI overview style and then kind of gives that answer in the general way, everything to everybody, that’s not a game I think a startup can win because Google, in their AI overview announcement, even they themselves are like, yeah, we were able to lower the cost 80% over the last year. It’s like all of the blockers to them doing this well, cost mainly, have just basically gone away. And as models have gotten better, it’s very technically simple now to have a small, efficient, cheap fast model that just synthesizes all this data and produces a basic generic answer in an instant.

Raza Habib (23:35):

So you would predict that Google beats Perplexity in this race?

Michael Royzen (23:39):

Yeah, I don’t think it’s winnable by a startup.

Raza Habib (23:40):

Interesting. So what do you think is going to happen to Perplexity?

Michael Royzen (23:44):

I mean, I don’t have a crystal ball. It depends on what they want to be. My general prediction is that because of this rat race that was caused by ChatGPT, LLMs themselves have been commoditized a lot faster than I would have predicted pre-ChatGPT. And what that’s resulted in is like, I don’t think it’s a startup. You can be everything to everybody. I think you have to choose who you want to serve. I think that there’s a lot of valid use cases in, say, serving researchers, serving people looking for products, and building beautiful bespoke, highly optimized experiences just for those verticals. So I’m a huge believer that there will be vertical search and vertical AI chat applications that take a domain, do it extremely well, and then they can be basically untouchable because then other companies have innovators’ dilemmas basically for competing with them. And also, it’s a matter of focus.

As a startup, you really need to be laser-focused on one specific thing, or you’re just going to kind of build a mid-generic thing for a lot of people. And so I think that Perplexity has raised kind of enough money where they’re able to sustain this burn where they’re offering these kind of slightly more advanced models than Google is currently offering in their AI overviews product. And they also enable follow-up chat, which Google AI overviews currently does not. But I just don’t see how this is sustainable in the long run. The problem is Google is already entrenched on Chrome and on iOS, and they have all these deals, and Google’s already the default everywhere. And this is a problem that they’re clearly working on. And all of the previous innovators’ dilemmas that they had, it doesn’t exist anymore. They’re clearly trying to compete in this space.

Raza Habib (25:25):

So all of it comes down to, will they execute? And of course, that is the ultimate Perplexity question, but I think at the end of the day, the unfortunate reality is that Perplexity can still be better, and Google can still win just because of the platform advantage. The delta of the improvement has to be significant for people to be willing to go through platform switching costs to be able to mass the general population to be able to switch over. And that’s the fundamental problem. The single biggest reason why I started working on Phind and did the YC leap and all of that is because the delta leap AI-generated answers versus no answers is massive, massive delta that’s like a 10x at least improvement when done well. But a slightly better LLM answer versus a slightly worse LLM answer, I don’t know. Especially when the slightly worse LLM answer is literally everywhere and it’s already on all of the tools you use. I’m almost ashamed to admit that I myself will sometimes just open Safari and type something in, and boom, it’s there. It’s instant. I don’t even use my product sometimes; I don’t even use, say, Perplexity sometimes. It’s just there. It’s just the fastest. And so, long story short, I think it comes down to a matter of focus. Those who are focused will have a higher chance of succeeding. And for the record, I wish them the best. I think it’s a very interesting problem. It’s a problem worth solving. I wish them the best, and I hope that there’s an angle here that will be interesting.

Raza Habib (26:54):

I’d love to get into the technical details now. So, how does Phind work? What’s going on under the hood?

Michael Royzen (26:59):

Yeah, the fundamental architecture is very simple, and it’s very similar to what it used to be back in the day where we have this retrieval step where we get web results that we think are relevant. And of course, there’s some stuff that goes into that. So we do some intelligent query rewriting. So we have a very fast and small LLM model that takes your query, reformulates it in a way that we think might be relevant, and it’s also optimized specifically for technical searches. So that’s how we use it to help make sure that the best technical sources are in the search results. And then we also do a lot of some other pre-computation on the fly. So we also decide, is one search sufficient or do we need to do multiple searches? So we have this auto multi-search mode that basically runs from this classifier.

And so, if it does, then we do multiple searches, and then we aggregate all of that information. We feed all that information to an embeddings model that we’ve customized over the years. And this embeddings model takes in this technical information. It’s tuned specifically to be able to perform well with code, and it’s also tuned for high throughput and speed. So we actually, for every single request that comes in, we actually do up to eight-way parallelization for the embeddings to make it significantly faster. So our goal is for the embedding step to be able to complete in 100 milliseconds basically with quite a bit of text coming through it. We pull sometimes a lot of sources, and we need to organize it in 100 milliseconds or less. So that was a really fun technical challenge, figuring out how to get that timing down. And then we form the context that we send to the model, and then we send it to the model.

And we run both GPT models and Claude Sonnet now as well as our own custom models that serve the vast majority of the traffic on the platform. And so our own custom models, that’s been quite a journey developing those as well. We started out with our own models way back in the day just because OpenAI didn’t have anything that would work. Yeah, there were just no API models that could serve our needs in mid-2022. And after GPT-4 came out, everyone kind of lost their minds, including us, and we were like, okay, we have to put this into the product. And for a while, the product was primarily GPT-4 based, particularly as a technical product. And actually, us putting GPT-4 into the product resulted in, I think, one of the highest upvoted Hacker News posts of all time. We got something like 1,400 upvotes on a product launch, which is very high.

That was very exciting to see. And for a while, we were GPT-4 only, but we had people always asking us, why is it so slow? I want it to be faster. I don’t want to have to grab a cup of coffee every time I answer a question. And also, it was very expensive for us to run. And so running our own models was a way for us to figure out, hey, how can we make a custom coding model? We don’t have to worry about other cases. So that focus really kind of simplified this for us. How can we make just a coding model that will run fast and that people can use most of their time as the default? And that’s why we ended up building out the Phind models, which are based on Llama derivatives. And so today we have two different Phind models. We have a tune of the Llama 3, 8 billion parameter model, which is designed to be just serve the original vision of Phind, which is how quickly can we just pull everything and write a Stack Overflow or documentation-style summary of web text. And then we have a larger 70 billion parameter model whose goal is to be the best model for most things, generally speaking, as a balance between speed and answer quality. And of course, now we’re very excited about Llama 400B on the horizon. And so I think that’s kind of just a general summary of how everything works end-to-end. But yeah, I think there’s also a lot of interesting stuff happening in models that I think could be interesting to discuss as well.

Raza Habib (30:42):

Yeah, let me just make sure I’ve understood it correctly, and then I would love to chat about the model part of it. But it’s fundamentally a RAG pipeline. There’s a search index. Search query comes in, you’re doing a search, retrieving that, embedding it, summarizing the context, and then providing that to the model. But there’s a lot of nuance, it sounds like, in order to make it work well. So actually, you’re rewriting the question after the question comes in to make it better for technical questions and to make the search better. You’re doing both dense and non-dense search. You’re optimizing the speed of the embeddings that you can do fast queries, and you have intermediate models that are figuring out how to do some of those things. And then you’re making a choice about which model is finally used to generate the final answer. One thing that I would love to hear a little bit more about is, you said we retrieved this context and then we put it in the context window and then we generate you an answer. How much prompt engineering goes on? How much do you have to iterate over how you structure the context, or are you just fine-tuning? How do you make good answers come out of this system?

Michael Royzen (31:40):

Funny enough, not that much prompt tuning is required. And particularly for our own models. They’re not trained to be diverse models. They’re only trained on a handful of prompts, mostly technical.

Raza Habib (31:53):

So they’re fine-tuned on input-output pairs for your use case.

Michael Royzen (31:57):

Right, exactly. So we did a lot of work.

Raza Habib (31:58):

So you’re fine-tuning rather than prompt engineering.

Michael Royzen (32:01):

Yeah, we did a lot of work at the fine-tuning level to get these models to respond the way that we wanted. And we did, I think, some clever things at the fine-tuning stage as well. We don’t really want to train on other models’ synthetic outputs because they’re frequently wrong. Code produced by models will have bugs in it. And so our strategy for producing training data was to start with the code and use that as the label, as the gold standard that we’re going to train the model to predict the human-written code that we already know is right, and then generate a synthetic input to create an input-output example that we can train on. So rather than having an input and then kind of generating synthetic data for the output, rather we start with the output that we know is correct, and then we generate synthetic input to create the pair. And it turns out that also generating synthetic data for the input is a much simpler task than generating synthetic data for the output. Because today’s models are fantastic. They do summarization very well with very little hallucination. So giving it a piece of code, and this takes some prompt tuning obviously, but saying, “Hey, write a sample input that has all the information necessary to recreate this piece of code,” that actually works surprisingly well.

Raza Habib (33:18):

So it feels like there’s maybe a generalizable lesson there, which is that if you are trying to use models to create synthetic datasets, it’s much easier to start from a dataset of answers and generate accurate questions than it is to start from a dataset of questions and generate the answers.

Michael Royzen (33:35):

Absolutely. And in our experiments, trying to get LLMs to generate high-quality questions, that’s very difficult. Doing this approach where you say, “Generate me a hundred topics,” and then taking each of those hundred topics and generating another hundred subtopics, that just does not work super well in our experience.

Raza Habib (33:52):

What other generalizable lessons have you come across in the process of building Phind? So this strategy for generating synthetic data feels like one. Are there any other things that you’ve discovered or had to figure out along the way that you think anyone building with AI products could potentially use or that they should try and adopt?

Michael Royzen (34:09):

So I think that at the product level, minimize the chances of inherent non-determinism of the models screwing up a user experience that needs to be deterministic. I think that is kind of the most generic way to describe the way that we think every day at Phind.

Raza Habib (34:28):

Wait. So what does that mean? I don’t think I fully follow. So where do I want to try and remove the non-determinism? What’s an example?

Michael Royzen (34:35):

So part of it is, I think, really using test-driven development in creating AI applications at the engineering level. So for every single AI invocation that you have in your product, particularly when it absolutely needs to be reliable within kind of a range of acceptable values, first of all, make the AI as self-contained as possible. Focus is very important for prompt tuning as well. So keeping all of your different sub-tasks that you might deploy an AI on to as constrained a purpose as possible, keep the message as unpolluted as you can. And then for each one of those kind of submodules, make sure that it works 90 plus, 99% of the time specifically for that thing. And so something that we kind of had to learn the hard way was how do we actually engineer a product built from the ground up where we have to make sure these modules actually work. And so we’ve written these automated tests that need to pass. The tests are run not once but 20 times on a single kind of unit every time the tests run. And we calculate the percentage that it passes, and if it falls below the minimum threshold, then we know it’s bad. So I think that’s kind of a fundamental principle in software engineering that I think we kind of had to expand on a little bit is how do we actually make reliable components using LLMs? And that’s kind of what I meant by the whole non-deterministic component.

Raza Habib (36:06):

Okay. So you use heavy use of evaluation and testing for every sub-component enough such that they’re reaching some minimum performance threshold individually as well as joined together to have confidence in what is otherwise a question essentially of systems engineering: how do I make a reliable whole out of unreliable parts?

Michael Royzen (36:26):

Exactly. And I think it also comes down to focus just like how you need to be focused as a startup, as a team, etc. The AI benefits tremendously from focus. And I think something that we’ve done is we started thinking as the AI, we were like, okay, if some human came to me and they asked me what I’m asking of this model right now with all the information I’m giving it, could I as a human reliably produce the right answer? And sometimes there’s been a couple of cases where we’re like, okay, we’re just feeding the AI everything, but I as a human have no idea what’s happening. I have no idea what’s going on. I couldn’t answer it myself at minimum make sure that it’s something that you can do as a human because if you can’t, it’s definitely not going to be able to do it.

Raza Habib (37:10):

What’s your workflow for improving Phind? How do you make it better over time?

Michael Royzen (37:13):

Yeah, so we have a data flywheel where data is coming in from what worked, what didn’t work. We have all sorts of real-time feedback signals coming in from the website. We have the thumbs up, thumbs down. We have AI classifiers that we can also use to determine offline after the fact whether a given answer actually answered the question appropriately or if it didn’t. And a lot of self-reflection I think works as well. We’ve built datasets where we’ve had the model reflect on, “Hey, what did it get wrong? How can we fix it?” and use that to continuously improve the models.

Raza Habib (37:47):

How do you use that to improve the models? So you gather this feedback data. How do you use it?

Michael Royzen (37:50):

So we create new training data by having the models self-reflect on thinking step by step what they got wrong and try to do more complex tree-of-thought style reasoning to try to correct their answers. And that’s not perfect. That’s still a noisy training method. It’s not always possible, but it does help giving the models more time to think offline and then reincorporating that back into the model where they, or back into real-time inference usage where they have less time to think is an effective strategy. Still, something that’s also been really effective is just at the product level, seeing people use it in person, inviting people over or hosting in-person meetups and actually seeing how people use it. It’s remarkable how differently people use it sometimes than how you use it. We have one of our top users who has this trailing Google Doc of all of the searches that he tried and all of the different combinations of queries that he tried as well as their answers.

Raza Habib (38:55):

And we’re like, wow, that’s very interesting. And that gives us ideas on how can we better structure your thoughts and help you manage your thoughts and explore that further. And the reason why that was so important is because that feeds directly into our hypothesis about how very often programmers don’t even know themselves what they’re trying to do. And so that’s very clearly a problem when someone has to have this Google Doc organizing their thoughts. It’s like they’re trying to use Phind to help them organize it, but clearly, it’s not enough. So I think that focusing on product-level experiences like that is also very important. And then finally realizing that the model is not the product for a while. I think that was a very kind of blurry line for us to walk where the model was the most important part of the product. And I think that was certainly true in a pre-commodification era.

Raza Habib (39:45):

So in the era of there’s only one model that is good at programming or answering this specific type of technical questions, then yes, the model is in some ways really the star of the product. But now there are so many models roughly all approximately the same capability, and OpenAI hasn’t really fundamentally made their models better in over a year. OpenAI hasn’t. My hot take here is at least for programming, OpenAI models have not fundamentally gotten better since the first GPT-4 release. What they have gotten better at is following instructions. So GPT-4 Turbo and GPT-4.0 are a lot better at responding in exactly the kind of format that the user wants them to respond. But these models are actually more error-prone to more fundamental errors or syntax errors in writing correct code than the original GPT-4. And so what’s happened is there’s so many models now from OpenAI, now from Claude, even from Meta as well, and the models that we build fine-tuned from Llama, the model is essentially a commodity. And you have to really, I think, harder about the product experience as a whole rather than saying, oh, okay, I just got the best answer, I’m done. I think that now that the delta between models is so little, it’s really more about how can we help you as a developer get from idea to product end-to-end in the least frustrating way possible.

Raza Habib (41:23):

What advice do you have for people building AI products today? If I’m a product manager or an engineer and I’m building an LLM product for the first time, what lessons would you give someone or a friend who asks you, “Hey, I’m just starting out on this.” What would be your tips?

Michael Royzen (41:39):

I would say always work from first principles because I think whenever you’re in an AI hype cycle, it’s very easy to be like, “Oh, this is so cool. Let me build it.” And that’s a very dangerous thing because what is cool is sometimes not what you need to do to make the product functional for what it needs to do. And so ideally, there’s an overlap, but I think people working on AI need to stay really focused and really disciplined on how every single day, how am I end-to-end actually helping the user with their problem. Because sometimes people ask me for advice on how they should integrate LLMs, and it turns out they shouldn’t use LLMs. They should use classical programming techniques. It’ll be cheaper, it’ll be faster, it’ll be deterministic. I think that you still have to kind of really reason from first principles about, is this really necessary for this very specific objective that I’m trying to achieve? And working backwards from there.

And then of course, to reiterate, if you are working with LLMs, start by prototyping with GPT-4 or Claude. Don’t start by fine-tuning your own models in 2024. Start with a great product like HumanLoop that can help you figure out which prompts you should use and help you prototype basically as quickly as possible to figure out, can I actually solve this problem using GPT-4? And then over time, if your product really kind of takes off and you have a lot of volume, then you can kind of get into scaling. And scaling is a whole beast of itself where you figure out, okay, how can we lower the unit economics of running these models on a per-inference level? And at that point, fine-tuning really starts to make a lot of sense because it’s possible to run a fine-tuned model on your own hardware in a very high throughput setting and save 80-90% over GPT-4, even GPT-4 Turbo. But yeah, don’t kind of get ahead of yourself with fine-tuning models. Always focus on how can I actually solve this specific problem for my users, for my customers, and stay obsessed with that all the time.

Raza Habib (43:45):

In every technology wave, there’s a distribution of value that accrues to incumbents, to startups, to different parts of the stack. I think people have written about how in mobile it probably went more to incumbents than to startups, and maybe in the first internet wave, it went more to startups than to incumbents. Obviously, you are running an AI startup. If you think about this, you’re in a space where you do have competition that comes from the larger companies. So there’s GitHub Copilot, or if we take the search engine example, like Perplexity is going up against Google. What’s your view on how the value will split for different applications and where it’ll accrue between startups and larger companies?

Michael Royzen (44:24):

Great question. I think that as a recap of the Perplexity discussion, I think that in cases where the scope of the product is more broad, the value will accrue disproportionately to the incumbents because the incumbents already have the platform advantage. So like Google, for example, for generic search, I think that that is something where they have a significant advantage. However, for vertical-specific applications, that is where startups, I think, have a much better advantage because you’re competing against the big companies more or less on even footing. If it’s a very vertical-specific application, chances are the big incumbent doesn’t have a version of that yet. So it’s like you the startup building this new specific thing from the ground up versus the incumbent. And that’s where all of the old rules about startup versus incumbents come into play, where the inherent bureaucratic disadvantages and political infighting and lack of focus of the big companies really hurt the big companies compared to startups.

So what I expect to see happen is that companies that are really focused on a particular vertical solving a very specific need for specific customers who keep their eye on that ball will accrue disproportionate returns versus incumbents. But if it’s more general, the incumbents are going to win. And I think it’s even more exacerbated by the fact that unique sources of data are an insane advantage in the AI landscape, particularly when these models are commoditized. So take Harvey, for example, the legal AI assistant. Their lawyer network that collects their own bespoke data for their use case is not something like the big companies can collect as easily. It’s just like they have that data, they have that focus, kudos to them. And even say with Cursor and DS code, I think it’s very impressive that Cursor has built this product that GitHub has still not been able to beat despite it being this VS Code-type tool and Microsoft owning VS Code. I just don’t know how they have not been able to match them on that product in that niche. And I think what that really shows is that knowing your users, having focus, having that tight iteration cycle, it really helps.

Raza Habib (46:44):

If you picture the world five years from now, two-part question: broadly, what do you think will be different because of AI, and more narrowly, how will the work of a developer or someone building a software product be different?

Michael Royzen (46:59):

So I think more broadly, I think funny enough, we will think about it less because it’ll be more integrated than it is today. Because today, to use AI, you kind of have to think about it. You have to be like, “Oh, I have to open ChatGPT. Oh, I have to open Phind. Oh, I have to open Claude.” I think that it’s going to be integrated natively into many of today’s existing platforms that already exist, that have already won, and it’s just going to work. You’re not even going to think about it as a user because it’s just going to be a natural acceleration of what you were already trying to do. And of course, it’ll also open new doors, it’ll open new verticals, but I really think that that is still further away. And I think that I’m a little bit more pessimistic on it opening completely new verticals than I was before, just because the rate of progress in AI has both been faster and slower than what I expected. I know that sounds like a contradiction, but what I mean is that everyone who’s not OpenAI has moved very, very quickly, but the absolute state of the art has not budged that much in the last 16 months or so.

Raza Habib (48:06):

I’ve discussed this with a few people on the podcast and I’ve had mixed opinions. Do you think that’s because it’s become fundamentally harder to push the frontier and breakthroughs have just been less forthcoming than we thought? Or is it just that large model training runs take a long time? The model’s in the oven, it hasn’t come out yet. And so we might have large step changes in capabilities. It could feel for long periods of time, not much is happening because we’re not actually seeing that progress. And then a new model gets released, right? Between text DaVinci 2, GPT-3 coming out, the model, which you said was effectively useless for answering questions, and GPT-4, from an outside observer, not much happened. That’s almost a year’s gap or more. And if you weren’t sitting inside OpenAI, it was easy to have no clue how much progress was being made. And then GPT-4 came out and you said yourself that you guys had a kind of red alert moment and switched to what model you were working on. Do you think that that could be what’s going on here, or why do you rule out that hypothesis?

Michael Royzen (49:10):

I think it’s very difficult to be on the outside and speculate about what’s going on on the inside, as fun as it may be. But I think at the end of the day, the question comes down to how much capability is there left to be unlocked in these models? I think that’s the fundamental question. So you brought up text DaVinci 2 to ChatGPT, GPT-3.5. The insight there was that these models were fundamentally capable. It just needed, in retrospect, a relatively basic unlocking step to make that ingrained information useful.

Raza Habib (49:40):

So GPT-3 to 3.5, from text DaVinci 2 to ChatGPT, they figured out how to do instruction tuning and reinforcement learning from human feedback. And you’re saying that was the real unlock more so than it was scale and capabilities gain. And so if you buy that hypothesis, the main difference was the instruction tuning and not the capabilities of the base model, then you might sort of buy the argument that actually, unless there’s another change of that magnitude coming potentially.

Michael Royzen (50:11):

Right. I mean, it’s really an open question. I don’t know. So for example, GPT-4 basically scaled up this instruction tuning approach to really what was at the time the maximum. I think my hypothesis is they chose that model size based on the largest model that will be runable in a production environment and tolerable by users. So something approximately 15 to 20 tokens a second. Fast forward to today, I think one of the most interesting breakthroughs that we’ve seen recently is the interpretability paper from Anthropic. The one where they talk about Golden Gate Claude and how they were able to basically figure out what various features inside the model do and clamp them in such a way to make them more reliable. So I think the breakthrough that I’m most excited about that I think we’re going to see in the next generation of models is training for correctness as opposed to training solely for next token prediction.

Raza Habib (51:11):

So training for process correctness, for generating correct code, for generating correct reasoning, for doing correct math. I think that will be very interesting. It’ll be very helpful. What I’m less bullish about is will that really allow us to unlock a whole range of verticals that are not possible today? I’m not sure. I think that what’s more likely is that it will just enable products in verticals that already exist to be significantly more effective for their end users. And I’m concerned that we’re going to hit a real diminishing return situation where we fully unlock transformers and the rate at which we can make them bigger is really kind of hardware dependent on how quickly we can figure out lower precisions. There’s all sorts of optimistic stuff here. The paper that showed that basic one-bit precision for models is possible without degrading them too significantly. But I don’t know if it’s going to fundamentally make the model go from being able to write a program when given specific instructions to invent this thing with creativity and be kind of this creative tool as a service. I don’t know. I don’t know if we’ll get there without another significant architectural leap.

Raza Habib (52:29):

Yeah. I’m surprised by, or maybe I’m not surprised, but essentially I think what you’re saying is that overall you don’t think that the impact of AI in society is going to be that big and that actually, roughly speaking, we’ve got most of what we’re going to get from this current wave, and it’s just going to be a question of it being distributed more evenly. If I ask you how the world looks five years from now, it’s not like you’re saying more or less the same, but with things a bit more ubiquitous.

Michael Royzen (52:54):

I never meant to make a statement about the magnitude of the change or the second-order effects, which I’m sure will be massive. I think what my biggest point there was that your average person, I think, will experience the benefits of AI without paying too much attention to the fact that it was AI. I think that’s my point. My point is that AI will be seamlessly integrated into many facets of daily life. That for sure, I think, will have profound effects, but I think for the average person, it will just kind of happen and it will feel normal. And so this is my point living in San Francisco here. I looked at my Waymo app the other day. I’ve taken 60 Waymo rides now, and the first couple of times this was the craziest thing that I’ve ever experienced, and now I’ll just use it as a practical transportation mechanism and it just feels normal.

Raza Habib (54:56):

And so I think my bigger point here is that, yeah, I definitely think society will be transformed in many ways. I think people will become vastly more productive. I think it’ll be a lot easier to create businesses. I think there’s going to be a lot more creative output. GDP will increase dramatically. That will all happen. But the craziest thing about all of that is that your average person will still think about their lives, I think, the same way we do today, which is like, oh, my life is cool. There’s things that are going well, the things that are not going well. I don’t think that it’ll change most people’s thinking on a day-to-day basis. And so I think that’s the very important kind of point that I want to drive home, which is, and that’s almost a testament to its success rather than its failure. It’s going to be integrated so seamlessly that it will feel magical at first, and then it’ll all just feel totally normal.

Raza Habib (55:43):

Alright, Michael, well, it’s been a pleasure chatting to you and looking forward to doing it again in the future.

Michael Royzen (55:48):

Thank you. Thank you for having me. This was great.

Raza Habib (55:52):

Alright, that’s it for today’s conversation on High Agency. I’m Raza Habib and I hope you enjoyed our conversation. If you did enjoy the episode, please take a moment to rate and review us on your favorite podcast platform like Spotify, Apple Podcasts, or wherever you listen and subscribe. It really helps us reach more AI builders like you. For extras, show notes, and more episodes of Hi Agency, check out humanloop.com/podcast. If today’s conversation sparked any new ideas or insights, I’d really love to hear from you. Your feedback means a lot and helps us create the content that matters most to you.

About the author

avatar
Name
Raza Habib
Role
Cofounder and CEO
Twitter
𝕏RazRazcle
Raza is the CEO and Cofounder at Humanloop. He was inspired to work on AI as “the most transformative technology in our lifetimes” after studying under Prof David Mackay while doing Physics at Cambridge. Raza was the founding engineer of Monolith AI – applying AI to mechanical engineering, and has built speech systems at Google AI. He has a PhD in Machine Learning from UCL.