Building the First LLM Search Engine
In this episode, we sit down with Michael Royzen, CEO and co-founder of Phind. Michael shares insights from his journey in building the first LLM-based search engine for developers, the challenges of creating reliable AI models, and his vision for how AI will transform the work of developers in the near future.
Subscribe to Humanloopâs new podcast, High Agency, on YouTube, Spotify, or Apple Podcasts
Podcast:
Raza Habib (00:39):
This is High Agency, the podcast for AI Builders. Iâm delighted to be joined today by Michael Royzen, whoâs both a close friend and also a bit of a prodigy when it comes to machine learning. He was one of the younger interns at Microsoft Research as a 17-year-old, started his first machine learning research company straight after that, and heâs now the CEO and Co-founder of Phind, which is an answer engine for software engineers. They were the first company in the world to have an LLM-based search engine available, and theyâre backed by Paul Graham and Y Combinator. Now, over 90 million answers have gone through Phind, and they have millions of active users. So, Michael, itâs a pleasure to have you on the show.
Michael Royzen (01:17):
Thank you, Raza. Itâs great to be here.
Raza Habib (01:19):
So Michael, to start with, for our audience, just help them understand what is Phind, how do you describe it to someone new?
Michael Royzen (01:23):
Yeah, so Phind is a tool that helps developers get from an idea in their head to a working product. Thatâs the core of what we do. And so we started really as an answer engine, like you described, when we were actually the very first LLM-powered answer engine. A user puts in, like, âHey, how do I do this?â or âHey, this isnât working,â and weâll do an internet search. Weâll figure out whatâs relevant to answering your question. Then weâll use a large language model to synthesize all that information into a concise and hopefully correct answer to your question. Now weâre moving more broadly into how can we integrate that into your workflow as a developer as deeply as possible to really help you go from, âHey, I want to build this thing,â to actually having it built.
Raza Habib (02:08):
Okay, so idea to product could mean a lot of different things. Give me an example. Walk me through a particular use case.
Michael Royzen (02:14):
So you say that, âHey, I want to build a Next.js app that takes in a userâs question, and I want it to go out and get all these sources, synthesize them, and send them to an LLM.â One of our internal tests is actually being able to build Phind using Phind. And so we kind of enjoy that one because itâs a measure of how usable the product is.
Raza Habib (02:37):
So the idea is that developers should be able to come in with an idea, say to Phind, âHey, I want to build X,â and then itâs going to try and instruct them on how to do that. Itâs not going to build it for them necessarily, but itâll maybe write some of the code and talk them through the process.
Michael Royzen (02:50):
So the current product thatâs available online is really kind of more chat-based, but weâre working on something. By the time this is released, I think weâll have something to say about it publicly. Weâre working on something that will actually write the code for you, actually create the files and test them and orchestrate all of that. And so we started with this more chat-based UX. It was very natural, and that was kind of what the technology was limited to at the time. But now weâre very quickly moving towards having the LLM, kind of like a Tesla autopilot, take over more and more of the dirty work and the stuff that we as programmers donât really find funâfixing things that arenât workingâand instead shifting the fun things like design and figuring out how this is going to work over to the human so the human can still do the things that they find fun. Then itâll kind of be like the Tesla autopilot for getting your project working, rather than, say, a Waymo type analogy that just completely doesnât ask you for feedback at all.
Raza Habib (03:46):
Okay. So itâs a coding assistant thatâs going to help me go from idea A to developed product. Obviously, thereâs been an explosion of different AI coding tools recently starting from GitHub Copilot, which I think was the first really big commercially successful LLM product. But now thereâs Cursor, thereâs Cody, thereâs a whole bunch of things out there. How is Phind different from those? What are you trying to do that sets you apart?
Michael Royzen (04:09):
So what weâre doing right now is weâre really reimagining what coding should look like from the ground up. So I think those other companies have done things that are pretty interesting, but I think theyâre fundamentally limited by being shoehorned into an existing IDE. What weâre working on right now is really taking the Phind magic and having a native app on your computer built from the ground up for this AI-native idea-to-product workflow. Rather than being limited by kind of what people currently think an IDE is, we have the ability to completely reimagine all of this from an AI-first perspective because our hypothesis here is that what weâre really kind of limited by is the AI and AI progress, and itâs on this very steep improvement curve, whereas the IDE has already been invented. We donât want to reinvent the IDE as it is. We want to be able to just kind of take the first principles experience that we want to provide and then rethink it completely from the ground up. So thatâs what weâve been working on.
Raza Habib (05:06):
So if you are going to build an IDE from the ground up for developers, so kind of an AI-first-from-scratch IDE, obviously thatâs going to come with a lot of cost in the sense that developers are already familiar with their existing IDEs. There are a ton of keyboard shortcuts. They understand how to work within them. What are you imagining with first principles from AI that you can do differently, or thatâs going to be fundamentally different as a product experience that makes you think itâs justified to reinvent the whole product rather than integrate the AI into the existing workflow?
Michael Royzen (05:37):
The biggest idea here is that developers will be writing code less, and IDEs are basically solely optimized for writing code. An IDE is designed purely for like, okay, hereâs this code I want to write, how can I write it? And what weâre a lot more interested in, and where we think the thing that will make developers superhuman, is helping them think better and focusing on that, focusing on helping developers like themselves fully unlock their thoughts and go from an idea to an understanding much faster. And then weâll write a lot of the code for them. So the big insight here is that we think the actual code writing should not be the focus of this product. The focus should be on how can we help pull all of the creative juices out of you and help you understand your own ideas better and formalize them and do the algorithm design and all of that?
Because the part that we find fun, really as developers, is still really hard. You canât code anyway if you canât have that. And itâs something that AI currently isnât good at either. So thatâs kind of what we think will stay true in the long term, is that humans will continue to be better than AI at these high-level âhow do I do thisâ type tasks. And sure, the LLM can help pull it out of the human, and thatâs what weâre really working on. But if the LLM can truly invent new products as well as humans can, then once we get there, then there are all sorts of other wide-ranging implications for society. So we wonât get there for quite a bit, is one of our bets.
Raza Habib (07:14):
So Iâm kind of coming along with you, but it sounds like the argument is that coding is only part of what youâre doing when youâre doing software development. The IDE is built around helping you write code. Itâs a text editor with a bunch of extra features, and youâre saying, âHey, a lot of the actual producing the text of the code, we can automate a lot of that. The hard part is the thinking and the coming up with the product and solving problems.â So assume that I grant that premise. I just believe you on that. How will the product be different from a traditional IDE such that it enables this?
Michael Royzen (07:44):
So, sure, itâs focused a lot on a conversation with the AI and also real-time previews that are not really possible with an IDE because so much space and focus is on the code itself. And so what we want to put disproportionate focus on versus a traditional IDE is that the user themselves has a hypothesis. Theyâre building this app, theyâre trying to test their own hypotheses as quickly as they can. How can we facilitate that through a product? So we have these user interfaces that make these very fast UI and functional mocks that the user can very quickly play with, interact with, test, and get feedback on. And thatâs a very big part of the UX when using the new Phind experience versus solely this IDE thatâs mostly code. Maybe it has a chat bar on the side. I think what those products are really missing is kind of a more real-time interactive experience that lets you really feel it and really use the product and tighten that iteration loop.
Raza Habib (08:35):
Now Iâm beginning to be able to picture it. So itâs actually saying there should be something much more interactive here. I think Anthropic recently had the Artifacts product, and it sounds like itâs got bits of that kind of intuition in it where I want to be able to have an idea, see a render of it, and interact with it.
Michael Royzen (08:50):
Exactly. And Claude Artifacts, I think, is actually a brilliant product. I use it myself for things, but I think itâs clear by now that what developers really want is something thatâs deeply integrated into their workflow. Time and time again, we hear from our users that they donât like going to the web, they donât like switching tabs, and thatâs why we have a VS Code extension.
Raza Habib (09:09):
So Michael, one of the reasons I wanted to interview you in particular was how early you were to the space. You guys built the first quote-unquote answer engine, the first system that used LLM to do question answering that was widely publicly available, well ahead of Perplexity, well ahead of others, and you did it while still a college student and in a world where the technology was very different. We didnât have LLM APIs yet. So Iâm kind of curious, when you started, what did it take to build a product then? How has that changed?
Michael Royzen (09:34):
Yeah, it was like the Wild West back then. I started on this in the fall of 2020, shortly after I had played around with the Hugging Face Transformers library, which I think had come out a couple of months earlier. And so I was playing around with all of these BERT models, and I was building an invoice unstructured extraction tool with all these BERT models. And I was really fascinated by the fact that these models can classify text super well, but they canât really write text. And the state of models that could write text at the time, like GPT-2, wrote, in my opinion, complete gibberish. I think GPT-3 had also come out at the time. I think GPT-3 also came out, and I tried it out. I wasnât really impressed by it. I wasnât like, this isnât that much better than GPT-2, which I know is a bit of a hot take because obviouslyâŚ
Raza Habib (10:23):
Definitely a hot take.
Michael Royzen (10:23):
I understand all the research and the hard work that went into it, and I definitely appreciated that. And I noticed that the text that it generated was a lot more coherent, but it was still fundamentally as useless as GPT-2 because it couldnât answer questions, which was the use case that I was really excited about. And so I went really deep down this rabbit hole figuring out, well, how can we have these large language models answer questions?
Raza Habib (10:47):
And when you say couldnât answer questions, you mean factual questions because the model only knows whatâs in its pre-training dataset. So if I ask a question outside of that, and also I guess this was before instruction tuning, so there was no reason for the model to try to be accurate. They were just next word prediction machines at the time.
Michael Royzen (11:04):
Exactly. There was no widespread notion of instruction tuning, at least outside of the research labs at the time. And so even when you few-shotted the model, you provided a couple of examples of hereâs what I want a sample response to look like. It was still very iffy. It would go off the rails all the time. Those models really wanted to do, and what they were trained to do, is just tell stories and repeat internet text. And they were quite good at that if you could kind of prompt it in such a way that it could generate a story or generate even an email. It was decent at those things. But even with a lot of tuning, it was very difficult for it to be good at answering any sort of question, especially reliably. And so I was kind of intrigued by this, but what really made me obsessed with this space is when I saw a demo by this ex-Meta researcher who I think was at Hugging Face at the time, his name was Yacine.
Raza Habib (11:59):
Yacine?
Michael Royzen (11:59):
Yes. And he made this demo of using a BART model, which was this encoder-decoder model released by Meta, 500 million parameters, so very small by todayâs standards. And he fine-tuned it on Reddit to basically follow this instruction format of wanting to answer in response to a question. And then he also connected it to a RAG system. So he set it up so that it would, given a question, perform a lookup on Wikipedia, give the 10 most highly ranked results back. And the funny thing is it was super well implemented. He implemented it both using sparse retrieval using something like Elasticsearch and also dense retrieval using Metaâs dense vector retrieval database at the time. And it still flew off the rails a lot, but I was intrigued enough by this problem to take that as a starting point and kind of see how far I could push this.
Raza Habib (12:53):
And so a couple of months go by, and I tune the model myself. I do some work on the document retrieval system. I set this up end-to-end and not that much progress. I couldnât make it that much better. I made it a little bit better, but not that much better. And so I shelved this idea for almost a year. I still have a year and a half left in college at this point. I started working with this professor at UT named Greg Durrett. I came to him in the fall of 2021, almost a year later, and I was like, âHey, thereâs been some development in the space. I worked on it kind of previously like a year ago, but now there are these new models called T0, which is a derivative of the T5 model released by Google in 2020. But this model is fine-tuned from the get-go to be a lot better at following instructions for answering different kinds of questions.â
Raza Habib (13:46):
And this was one of the first papers, if I recall correctly, that was trying to frame all NLP tasks, all natural language AI tasks, as just text in, text out tasks. The previous paradigm, you would have custom models for every use case. For listeners who have joined us in the post-LLM world where we have these very general-purpose models, itâs maybe difficult to appreciate that even just four years ago, that was a very new idea. So previously, if you wanted to do classification, if you wanted to do NER, if you wanted to do text extraction, you would build a custom dataset for every task and a custom model for it. And I think T5 was one of the first papers, if not the first, where they said, âHey, why donât we take this mindset where all of these tasks, whether itâs classification or NER or question answering or summarization, theyâre actually all just tasks that take some form of text in and they spit text out, and we can frame all of these different problems with one model.â Again, itâs very easy to take that for granted today because itâs become what everyone does. But I think at the time it was extremely novel.
Michael Royzen (14:43):
Yes. And this paper is kind of what made me realize that now itâs actually possible, I think, to build a useful product of this type. And what was so new about T5 was the fact that, kind of like you said, Raza, they created a singular dataset that had many, many different kinds of prompts across many different kinds of tasks. So it was one model that was fine-tuned for a very diverse set of question-answering tasks and even text extraction tasks. And I think some other tasks were in there as well. And it was also larger than BART significantly. So unlike BART, which was roughly like a 500 million parameter model in the largest size, this model was⌠there were two models, there was a 3 billion parameter model and there was an 11 billion parameter model. And so the larger size, I think, helped a lot as well with making the model more coherent and staying on the rails even though itâs tiny by todayâs standards.
Michael Royzen (15:36):
And so the model, because it was less likely to go off the rails just to begin with, I took inspiration from the scene strategy of really kind of focusing on Reddit. And so we focused it on two categories. I focused it on Reddit and Stack Overflow and Stack Exchange questions to try to have just a large dataset of a more singular format on top of all of the work that they already did with T0. And so I fine-tuned it for that very specific format to improve the performance specifically just for general question-answering tasks. And then the other thing I did was really on the retrieval system. So I wanted this to be able to retrieve information from basically the entire internet. And I didnât want to cop out by just using the Bing API. So I started with a Common Crawl dump, and I downloaded all of Common Crawl, which is⌠Common Crawl is basically a breadth-first kind of dump of the internet. So for any given website, they might not necessarily have the whole website, but they include most of the publicly accessible websites on the internet in this dump. Itâs like three terabytes worth of text or something ridiculous like that. And so I wanted to keep only, say, the top 10,000 websites. I guess thatâs where most questions would be answered from. And then I got rid of the long tail and focused just on keeping the Common Crawl from the first 10,000 websites and then indexing those using an Elasticsearch cluster. And I was able to get it running on a relatively small EC2 instance on AWS, one instance that you ask it a question and itâs able to retrieve relevant documents from the most popular websites on the internet that contain the answers to that question. And when I put that system together end-to-end for the first time at the end of 2021, at the beginning of 2022, I did a Show HN on Hacker News in January 2022 being like, âHey, I put this thing together,â and we got some mixed feedback. People who tried it and it worked were like, âWow, this is game-changing.â And then other people didnât get what they wanted, and they were like, âAh, yeah, I donât know about this.â But when it did work, it was so magical that it gave me⌠I was obsessed with this idea. I couldnât sleep, I couldnât stop thinking about it. And so I applied to YC. My app was basically like, âHey, look, this is how people are going to find information.â And the central question that I was struggling with and what people were asking me too, I was having coffee chats with some friends, some who were in the VC business, and theyâre like, âYeah, whoâs this for?â And Iâm like, âI donât know, itâs for me.â I kind of Wikipedia-type style searches and stuff, but I donât really know if thatâs a long-term winnable business versus Google. And so I decided around that time to focus this on software developers.
Raza Habib (18:18):
And this was kind of like a twofold hypothesis. The first one was that we can help software engineers⌠did it. And at the time by basically being a better Stack Overflow, they have a question that can be answered from looking at something on the internet, boom, we can answer it. And the second hypothesis was that, kind of the Paul Graham hypothesis that he wrote an essay about all the way back in 2012, which was that perhaps it is possible to displace Google. And the way to do that would be to first make a search engine that all of the developers use. So thatâs what Google actually did. Google got all the hackers to use Google, and then they kind of opened the scope later rather than going with this over-broad âweâre everything to everyoneâ all at once. And so we focused on developers, got into YC, and immediately after college, two weeks later, we started that. We relaunched as initially Hello, and then we rebranded to Phind. But ever since the summer of â22, weâre basically doing what weâre doing now. So thatâs a long-winded tour of our origin story.
Raza Habib (19:22):
Before we dive into the technical details of how Phind works, one thing Iâm curious about: Paul Graham, the founder of YCPG, doesnât do that many active investments anymore. Heâs not a partner at YC most of the time. You guys are one of the rare companies that heâs personally invested in. How did that happen? How did you convince him to invest? Iâm kind of personally just very curious about this.
Michael Royzen (19:44):
Well, we basically told him about the vision, and we showed it to him and we were like, âHey, people donât want links. People want answers. This is the future.â And he was looking at him, I could see that the gears were turning in his head. He was like, yes, I think he was happily surprised by it. And the way that it worked out, I chose the last office hour spot that he had that day that he was doing in person. And so he got so excited talking about this with us. First of all, I just asked him to invest on the spot. I could see the gears are turning. I was like, âHey, do you want to invest?â And heâs like, âYes, of course I want to invest. This is so cool.â And then he was like, âHey,â he checked his watch, it was like seven or seven-thirty. He was like, âHey, I have to go home to Jessica. I have to cook dinner, but do you guys want to come with?â And so weâre like, of course, we want to come with. And so he takes us to his lovely home, and we kind of sit down, and Jessicaâs there too. So we got to meet Jessica, and Jessicaâs absolutely wonderful. I think she does not get nearly enough credit for her role as literally the co-founder of YC, but sheâs amazing. And weâre sitting at their backyard table, and Paul Graham just kind of looks at Jessica and heâs like, âOur grandkids arenât going to know what a search result is. Theyâre just going to have answers.â And this is kind of a mind-blowing moment for everybody because this was pre-ChatGPT. This was long before LLMs have been kind of known to answer questions in a way that people were consuming. And so we all just got super excited about this idea of, hey, we can answer questions directly instead of just sending people to links.
Raza Habib (21:23):
Since then, it feels like you guys have shifted the vision somewhat from being a search engine for developers with a view to taking on Google. Iâm kind of curious, what do you think of Perplexity? Itâs got a huge valuation, itâs been growing really fast. It is closer maybe to the original Phind thesis than you guys are now. Do you think itâs going to be successful?
Michael Royzen (21:42):
Thereâs a couple of things that happened that I think shifted my perspective on the whole industry. The first thing is the release of ChatGPT. And I think thatâs really the main catalyst that made it a lot more difficult for a startup to compete in this space because it was so much better than everything else for so long that ChatGPT is kind of what became known as the original kind of AI answer engine. And what that caused was the famous code red alert at Google. It forced everyone to be like, we have to get on this. One of our original hypotheses is that Google is going to be very slow to market, not just because they have all these safety bureaucracy issues inside the company, but also thereâs political reasons as well why they wouldnât want their AI saying things that could be wrong or offensive.
Raza Habib (22:28):
And I guess for Google also, itâs a classic innovatorâs dilemma as well. They get a huge amount of their revenue through search results.
Michael Royzen (22:36):
ChatGPTâs success blew all of that up immediately. And so we see now Google is building generative LLM results directly into search. And yes, they still have a lot of work to do on improving the quality of that, but for fundamentally building a generic answer engine that just synthesizes stuff from the internet Perplexity style or Google AI overview style and then kind of gives that answer in the general way, everything to everybody, thatâs not a game I think a startup can win because Google, in their AI overview announcement, even they themselves are like, yeah, we were able to lower the cost 80% over the last year. Itâs like all of the blockers to them doing this well, cost mainly, have just basically gone away. And as models have gotten better, itâs very technically simple now to have a small, efficient, cheap fast model that just synthesizes all this data and produces a basic generic answer in an instant.
Raza Habib (23:35):
So you would predict that Google beats Perplexity in this race?
Michael Royzen (23:39):
Yeah, I donât think itâs winnable by a startup.
Raza Habib (23:40):
Interesting. So what do you think is going to happen to Perplexity?
Michael Royzen (23:44):
I mean, I donât have a crystal ball. It depends on what they want to be. My general prediction is that because of this rat race that was caused by ChatGPT, LLMs themselves have been commoditized a lot faster than I would have predicted pre-ChatGPT. And what thatâs resulted in is like, I donât think itâs a startup. You can be everything to everybody. I think you have to choose who you want to serve. I think that thereâs a lot of valid use cases in, say, serving researchers, serving people looking for products, and building beautiful bespoke, highly optimized experiences just for those verticals. So Iâm a huge believer that there will be vertical search and vertical AI chat applications that take a domain, do it extremely well, and then they can be basically untouchable because then other companies have innovatorsâ dilemmas basically for competing with them. And also, itâs a matter of focus.
As a startup, you really need to be laser-focused on one specific thing, or youâre just going to kind of build a mid-generic thing for a lot of people. And so I think that Perplexity has raised kind of enough money where theyâre able to sustain this burn where theyâre offering these kind of slightly more advanced models than Google is currently offering in their AI overviews product. And they also enable follow-up chat, which Google AI overviews currently does not. But I just donât see how this is sustainable in the long run. The problem is Google is already entrenched on Chrome and on iOS, and they have all these deals, and Googleâs already the default everywhere. And this is a problem that theyâre clearly working on. And all of the previous innovatorsâ dilemmas that they had, it doesnât exist anymore. Theyâre clearly trying to compete in this space.
Raza Habib (25:25):
So all of it comes down to, will they execute? And of course, that is the ultimate Perplexity question, but I think at the end of the day, the unfortunate reality is that Perplexity can still be better, and Google can still win just because of the platform advantage. The delta of the improvement has to be significant for people to be willing to go through platform switching costs to be able to mass the general population to be able to switch over. And thatâs the fundamental problem. The single biggest reason why I started working on Phind and did the YC leap and all of that is because the delta leap AI-generated answers versus no answers is massive, massive delta thatâs like a 10x at least improvement when done well. But a slightly better LLM answer versus a slightly worse LLM answer, I donât know. Especially when the slightly worse LLM answer is literally everywhere and itâs already on all of the tools you use. Iâm almost ashamed to admit that I myself will sometimes just open Safari and type something in, and boom, itâs there. Itâs instant. I donât even use my product sometimes; I donât even use, say, Perplexity sometimes. Itâs just there. Itâs just the fastest. And so, long story short, I think it comes down to a matter of focus. Those who are focused will have a higher chance of succeeding. And for the record, I wish them the best. I think itâs a very interesting problem. Itâs a problem worth solving. I wish them the best, and I hope that thereâs an angle here that will be interesting.
Raza Habib (26:54):
Iâd love to get into the technical details now. So, how does Phind work? Whatâs going on under the hood?
Michael Royzen (26:59):
Yeah, the fundamental architecture is very simple, and itâs very similar to what it used to be back in the day where we have this retrieval step where we get web results that we think are relevant. And of course, thereâs some stuff that goes into that. So we do some intelligent query rewriting. So we have a very fast and small LLM model that takes your query, reformulates it in a way that we think might be relevant, and itâs also optimized specifically for technical searches. So thatâs how we use it to help make sure that the best technical sources are in the search results. And then we also do a lot of some other pre-computation on the fly. So we also decide, is one search sufficient or do we need to do multiple searches? So we have this auto multi-search mode that basically runs from this classifier.
And so, if it does, then we do multiple searches, and then we aggregate all of that information. We feed all that information to an embeddings model that weâve customized over the years. And this embeddings model takes in this technical information. Itâs tuned specifically to be able to perform well with code, and itâs also tuned for high throughput and speed. So we actually, for every single request that comes in, we actually do up to eight-way parallelization for the embeddings to make it significantly faster. So our goal is for the embedding step to be able to complete in 100 milliseconds basically with quite a bit of text coming through it. We pull sometimes a lot of sources, and we need to organize it in 100 milliseconds or less. So that was a really fun technical challenge, figuring out how to get that timing down. And then we form the context that we send to the model, and then we send it to the model.
And we run both GPT models and Claude Sonnet now as well as our own custom models that serve the vast majority of the traffic on the platform. And so our own custom models, thatâs been quite a journey developing those as well. We started out with our own models way back in the day just because OpenAI didnât have anything that would work. Yeah, there were just no API models that could serve our needs in mid-2022. And after GPT-4 came out, everyone kind of lost their minds, including us, and we were like, okay, we have to put this into the product. And for a while, the product was primarily GPT-4 based, particularly as a technical product. And actually, us putting GPT-4 into the product resulted in, I think, one of the highest upvoted Hacker News posts of all time. We got something like 1,400 upvotes on a product launch, which is very high.
That was very exciting to see. And for a while, we were GPT-4 only, but we had people always asking us, why is it so slow? I want it to be faster. I donât want to have to grab a cup of coffee every time I answer a question. And also, it was very expensive for us to run. And so running our own models was a way for us to figure out, hey, how can we make a custom coding model? We donât have to worry about other cases. So that focus really kind of simplified this for us. How can we make just a coding model that will run fast and that people can use most of their time as the default? And thatâs why we ended up building out the Phind models, which are based on Llama derivatives. And so today we have two different Phind models. We have a tune of the Llama 3, 8 billion parameter model, which is designed to be just serve the original vision of Phind, which is how quickly can we just pull everything and write a Stack Overflow or documentation-style summary of web text. And then we have a larger 70 billion parameter model whose goal is to be the best model for most things, generally speaking, as a balance between speed and answer quality. And of course, now weâre very excited about Llama 400B on the horizon. And so I think thatâs kind of just a general summary of how everything works end-to-end. But yeah, I think thereâs also a lot of interesting stuff happening in models that I think could be interesting to discuss as well.
Raza Habib (30:42):
Yeah, let me just make sure Iâve understood it correctly, and then I would love to chat about the model part of it. But itâs fundamentally a RAG pipeline. Thereâs a search index. Search query comes in, youâre doing a search, retrieving that, embedding it, summarizing the context, and then providing that to the model. But thereâs a lot of nuance, it sounds like, in order to make it work well. So actually, youâre rewriting the question after the question comes in to make it better for technical questions and to make the search better. Youâre doing both dense and non-dense search. Youâre optimizing the speed of the embeddings that you can do fast queries, and you have intermediate models that are figuring out how to do some of those things. And then youâre making a choice about which model is finally used to generate the final answer. One thing that I would love to hear a little bit more about is, you said we retrieved this context and then we put it in the context window and then we generate you an answer. How much prompt engineering goes on? How much do you have to iterate over how you structure the context, or are you just fine-tuning? How do you make good answers come out of this system?
Michael Royzen (31:40):
Funny enough, not that much prompt tuning is required. And particularly for our own models. Theyâre not trained to be diverse models. Theyâre only trained on a handful of prompts, mostly technical.
Raza Habib (31:53):
So theyâre fine-tuned on input-output pairs for your use case.
Michael Royzen (31:57):
Right, exactly. So we did a lot of work.
Raza Habib (31:58):
So youâre fine-tuning rather than prompt engineering.
Michael Royzen (32:01):
Yeah, we did a lot of work at the fine-tuning level to get these models to respond the way that we wanted. And we did, I think, some clever things at the fine-tuning stage as well. We donât really want to train on other modelsâ synthetic outputs because theyâre frequently wrong. Code produced by models will have bugs in it. And so our strategy for producing training data was to start with the code and use that as the label, as the gold standard that weâre going to train the model to predict the human-written code that we already know is right, and then generate a synthetic input to create an input-output example that we can train on. So rather than having an input and then kind of generating synthetic data for the output, rather we start with the output that we know is correct, and then we generate synthetic input to create the pair. And it turns out that also generating synthetic data for the input is a much simpler task than generating synthetic data for the output. Because todayâs models are fantastic. They do summarization very well with very little hallucination. So giving it a piece of code, and this takes some prompt tuning obviously, but saying, âHey, write a sample input that has all the information necessary to recreate this piece of code,â that actually works surprisingly well.
Raza Habib (33:18):
So it feels like thereâs maybe a generalizable lesson there, which is that if you are trying to use models to create synthetic datasets, itâs much easier to start from a dataset of answers and generate accurate questions than it is to start from a dataset of questions and generate the answers.
Michael Royzen (33:35):
Absolutely. And in our experiments, trying to get LLMs to generate high-quality questions, thatâs very difficult. Doing this approach where you say, âGenerate me a hundred topics,â and then taking each of those hundred topics and generating another hundred subtopics, that just does not work super well in our experience.
Raza Habib (33:52):
What other generalizable lessons have you come across in the process of building Phind? So this strategy for generating synthetic data feels like one. Are there any other things that youâve discovered or had to figure out along the way that you think anyone building with AI products could potentially use or that they should try and adopt?
Michael Royzen (34:09):
So I think that at the product level, minimize the chances of inherent non-determinism of the models screwing up a user experience that needs to be deterministic. I think that is kind of the most generic way to describe the way that we think every day at Phind.
Raza Habib (34:28):
Wait. So what does that mean? I donât think I fully follow. So where do I want to try and remove the non-determinism? Whatâs an example?
Michael Royzen (34:35):
So part of it is, I think, really using test-driven development in creating AI applications at the engineering level. So for every single AI invocation that you have in your product, particularly when it absolutely needs to be reliable within kind of a range of acceptable values, first of all, make the AI as self-contained as possible. Focus is very important for prompt tuning as well. So keeping all of your different sub-tasks that you might deploy an AI on to as constrained a purpose as possible, keep the message as unpolluted as you can. And then for each one of those kind of submodules, make sure that it works 90 plus, 99% of the time specifically for that thing. And so something that we kind of had to learn the hard way was how do we actually engineer a product built from the ground up where we have to make sure these modules actually work. And so weâve written these automated tests that need to pass. The tests are run not once but 20 times on a single kind of unit every time the tests run. And we calculate the percentage that it passes, and if it falls below the minimum threshold, then we know itâs bad. So I think thatâs kind of a fundamental principle in software engineering that I think we kind of had to expand on a little bit is how do we actually make reliable components using LLMs? And thatâs kind of what I meant by the whole non-deterministic component.
Raza Habib (36:06):
Okay. So you use heavy use of evaluation and testing for every sub-component enough such that theyâre reaching some minimum performance threshold individually as well as joined together to have confidence in what is otherwise a question essentially of systems engineering: how do I make a reliable whole out of unreliable parts?
Michael Royzen (36:26):
Exactly. And I think it also comes down to focus just like how you need to be focused as a startup, as a team, etc. The AI benefits tremendously from focus. And I think something that weâve done is we started thinking as the AI, we were like, okay, if some human came to me and they asked me what Iâm asking of this model right now with all the information Iâm giving it, could I as a human reliably produce the right answer? And sometimes thereâs been a couple of cases where weâre like, okay, weâre just feeding the AI everything, but I as a human have no idea whatâs happening. I have no idea whatâs going on. I couldnât answer it myself at minimum make sure that itâs something that you can do as a human because if you canât, itâs definitely not going to be able to do it.
Raza Habib (37:10):
Whatâs your workflow for improving Phind? How do you make it better over time?
Michael Royzen (37:13):
Yeah, so we have a data flywheel where data is coming in from what worked, what didnât work. We have all sorts of real-time feedback signals coming in from the website. We have the thumbs up, thumbs down. We have AI classifiers that we can also use to determine offline after the fact whether a given answer actually answered the question appropriately or if it didnât. And a lot of self-reflection I think works as well. Weâve built datasets where weâve had the model reflect on, âHey, what did it get wrong? How can we fix it?â and use that to continuously improve the models.
Raza Habib (37:47):
How do you use that to improve the models? So you gather this feedback data. How do you use it?
Michael Royzen (37:50):
So we create new training data by having the models self-reflect on thinking step by step what they got wrong and try to do more complex tree-of-thought style reasoning to try to correct their answers. And thatâs not perfect. Thatâs still a noisy training method. Itâs not always possible, but it does help giving the models more time to think offline and then reincorporating that back into the model where they, or back into real-time inference usage where they have less time to think is an effective strategy. Still, something thatâs also been really effective is just at the product level, seeing people use it in person, inviting people over or hosting in-person meetups and actually seeing how people use it. Itâs remarkable how differently people use it sometimes than how you use it. We have one of our top users who has this trailing Google Doc of all of the searches that he tried and all of the different combinations of queries that he tried as well as their answers.
Raza Habib (38:55):
And weâre like, wow, thatâs very interesting. And that gives us ideas on how can we better structure your thoughts and help you manage your thoughts and explore that further. And the reason why that was so important is because that feeds directly into our hypothesis about how very often programmers donât even know themselves what theyâre trying to do. And so thatâs very clearly a problem when someone has to have this Google Doc organizing their thoughts. Itâs like theyâre trying to use Phind to help them organize it, but clearly, itâs not enough. So I think that focusing on product-level experiences like that is also very important. And then finally realizing that the model is not the product for a while. I think that was a very kind of blurry line for us to walk where the model was the most important part of the product. And I think that was certainly true in a pre-commodification era.
Raza Habib (39:45):
So in the era of thereâs only one model that is good at programming or answering this specific type of technical questions, then yes, the model is in some ways really the star of the product. But now there are so many models roughly all approximately the same capability, and OpenAI hasnât really fundamentally made their models better in over a year. OpenAI hasnât. My hot take here is at least for programming, OpenAI models have not fundamentally gotten better since the first GPT-4 release. What they have gotten better at is following instructions. So GPT-4 Turbo and GPT-4.0 are a lot better at responding in exactly the kind of format that the user wants them to respond. But these models are actually more error-prone to more fundamental errors or syntax errors in writing correct code than the original GPT-4. And so whatâs happened is thereâs so many models now from OpenAI, now from Claude, even from Meta as well, and the models that we build fine-tuned from Llama, the model is essentially a commodity. And you have to really, I think, harder about the product experience as a whole rather than saying, oh, okay, I just got the best answer, Iâm done. I think that now that the delta between models is so little, itâs really more about how can we help you as a developer get from idea to product end-to-end in the least frustrating way possible.
Raza Habib (41:23):
What advice do you have for people building AI products today? If Iâm a product manager or an engineer and Iâm building an LLM product for the first time, what lessons would you give someone or a friend who asks you, âHey, Iâm just starting out on this.â What would be your tips?
Michael Royzen (41:39):
I would say always work from first principles because I think whenever youâre in an AI hype cycle, itâs very easy to be like, âOh, this is so cool. Let me build it.â And thatâs a very dangerous thing because what is cool is sometimes not what you need to do to make the product functional for what it needs to do. And so ideally, thereâs an overlap, but I think people working on AI need to stay really focused and really disciplined on how every single day, how am I end-to-end actually helping the user with their problem. Because sometimes people ask me for advice on how they should integrate LLMs, and it turns out they shouldnât use LLMs. They should use classical programming techniques. Itâll be cheaper, itâll be faster, itâll be deterministic. I think that you still have to kind of really reason from first principles about, is this really necessary for this very specific objective that Iâm trying to achieve? And working backwards from there.
And then of course, to reiterate, if you are working with LLMs, start by prototyping with GPT-4 or Claude. Donât start by fine-tuning your own models in 2024. Start with a great product like HumanLoop that can help you figure out which prompts you should use and help you prototype basically as quickly as possible to figure out, can I actually solve this problem using GPT-4? And then over time, if your product really kind of takes off and you have a lot of volume, then you can kind of get into scaling. And scaling is a whole beast of itself where you figure out, okay, how can we lower the unit economics of running these models on a per-inference level? And at that point, fine-tuning really starts to make a lot of sense because itâs possible to run a fine-tuned model on your own hardware in a very high throughput setting and save 80-90% over GPT-4, even GPT-4 Turbo. But yeah, donât kind of get ahead of yourself with fine-tuning models. Always focus on how can I actually solve this specific problem for my users, for my customers, and stay obsessed with that all the time.
Raza Habib (43:45):
In every technology wave, thereâs a distribution of value that accrues to incumbents, to startups, to different parts of the stack. I think people have written about how in mobile it probably went more to incumbents than to startups, and maybe in the first internet wave, it went more to startups than to incumbents. Obviously, you are running an AI startup. If you think about this, youâre in a space where you do have competition that comes from the larger companies. So thereâs GitHub Copilot, or if we take the search engine example, like Perplexity is going up against Google. Whatâs your view on how the value will split for different applications and where itâll accrue between startups and larger companies?
Michael Royzen (44:24):
Great question. I think that as a recap of the Perplexity discussion, I think that in cases where the scope of the product is more broad, the value will accrue disproportionately to the incumbents because the incumbents already have the platform advantage. So like Google, for example, for generic search, I think that that is something where they have a significant advantage. However, for vertical-specific applications, that is where startups, I think, have a much better advantage because youâre competing against the big companies more or less on even footing. If itâs a very vertical-specific application, chances are the big incumbent doesnât have a version of that yet. So itâs like you the startup building this new specific thing from the ground up versus the incumbent. And thatâs where all of the old rules about startup versus incumbents come into play, where the inherent bureaucratic disadvantages and political infighting and lack of focus of the big companies really hurt the big companies compared to startups.
So what I expect to see happen is that companies that are really focused on a particular vertical solving a very specific need for specific customers who keep their eye on that ball will accrue disproportionate returns versus incumbents. But if itâs more general, the incumbents are going to win. And I think itâs even more exacerbated by the fact that unique sources of data are an insane advantage in the AI landscape, particularly when these models are commoditized. So take Harvey, for example, the legal AI assistant. Their lawyer network that collects their own bespoke data for their use case is not something like the big companies can collect as easily. Itâs just like they have that data, they have that focus, kudos to them. And even say with Cursor and DS code, I think itâs very impressive that Cursor has built this product that GitHub has still not been able to beat despite it being this VS Code-type tool and Microsoft owning VS Code. I just donât know how they have not been able to match them on that product in that niche. And I think what that really shows is that knowing your users, having focus, having that tight iteration cycle, it really helps.
Raza Habib (46:44):
If you picture the world five years from now, two-part question: broadly, what do you think will be different because of AI, and more narrowly, how will the work of a developer or someone building a software product be different?
Michael Royzen (46:59):
So I think more broadly, I think funny enough, we will think about it less because itâll be more integrated than it is today. Because today, to use AI, you kind of have to think about it. You have to be like, âOh, I have to open ChatGPT. Oh, I have to open Phind. Oh, I have to open Claude.â I think that itâs going to be integrated natively into many of todayâs existing platforms that already exist, that have already won, and itâs just going to work. Youâre not even going to think about it as a user because itâs just going to be a natural acceleration of what you were already trying to do. And of course, itâll also open new doors, itâll open new verticals, but I really think that that is still further away. And I think that Iâm a little bit more pessimistic on it opening completely new verticals than I was before, just because the rate of progress in AI has both been faster and slower than what I expected. I know that sounds like a contradiction, but what I mean is that everyone whoâs not OpenAI has moved very, very quickly, but the absolute state of the art has not budged that much in the last 16 months or so.
Raza Habib (48:06):
Iâve discussed this with a few people on the podcast and Iâve had mixed opinions. Do you think thatâs because itâs become fundamentally harder to push the frontier and breakthroughs have just been less forthcoming than we thought? Or is it just that large model training runs take a long time? The modelâs in the oven, it hasnât come out yet. And so we might have large step changes in capabilities. It could feel for long periods of time, not much is happening because weâre not actually seeing that progress. And then a new model gets released, right? Between text DaVinci 2, GPT-3 coming out, the model, which you said was effectively useless for answering questions, and GPT-4, from an outside observer, not much happened. Thatâs almost a yearâs gap or more. And if you werenât sitting inside OpenAI, it was easy to have no clue how much progress was being made. And then GPT-4 came out and you said yourself that you guys had a kind of red alert moment and switched to what model you were working on. Do you think that that could be whatâs going on here, or why do you rule out that hypothesis?
Michael Royzen (49:10):
I think itâs very difficult to be on the outside and speculate about whatâs going on on the inside, as fun as it may be. But I think at the end of the day, the question comes down to how much capability is there left to be unlocked in these models? I think thatâs the fundamental question. So you brought up text DaVinci 2 to ChatGPT, GPT-3.5. The insight there was that these models were fundamentally capable. It just needed, in retrospect, a relatively basic unlocking step to make that ingrained information useful.
Raza Habib (49:40):
So GPT-3 to 3.5, from text DaVinci 2 to ChatGPT, they figured out how to do instruction tuning and reinforcement learning from human feedback. And youâre saying that was the real unlock more so than it was scale and capabilities gain. And so if you buy that hypothesis, the main difference was the instruction tuning and not the capabilities of the base model, then you might sort of buy the argument that actually, unless thereâs another change of that magnitude coming potentially.
Michael Royzen (50:11):
Right. I mean, itâs really an open question. I donât know. So for example, GPT-4 basically scaled up this instruction tuning approach to really what was at the time the maximum. I think my hypothesis is they chose that model size based on the largest model that will be runable in a production environment and tolerable by users. So something approximately 15 to 20 tokens a second. Fast forward to today, I think one of the most interesting breakthroughs that weâve seen recently is the interpretability paper from Anthropic. The one where they talk about Golden Gate Claude and how they were able to basically figure out what various features inside the model do and clamp them in such a way to make them more reliable. So I think the breakthrough that Iâm most excited about that I think weâre going to see in the next generation of models is training for correctness as opposed to training solely for next token prediction.
Raza Habib (51:11):
So training for process correctness, for generating correct code, for generating correct reasoning, for doing correct math. I think that will be very interesting. Itâll be very helpful. What Iâm less bullish about is will that really allow us to unlock a whole range of verticals that are not possible today? Iâm not sure. I think that whatâs more likely is that it will just enable products in verticals that already exist to be significantly more effective for their end users. And Iâm concerned that weâre going to hit a real diminishing return situation where we fully unlock transformers and the rate at which we can make them bigger is really kind of hardware dependent on how quickly we can figure out lower precisions. Thereâs all sorts of optimistic stuff here. The paper that showed that basic one-bit precision for models is possible without degrading them too significantly. But I donât know if itâs going to fundamentally make the model go from being able to write a program when given specific instructions to invent this thing with creativity and be kind of this creative tool as a service. I donât know. I donât know if weâll get there without another significant architectural leap.
Raza Habib (52:29):
Yeah. Iâm surprised by, or maybe Iâm not surprised, but essentially I think what youâre saying is that overall you donât think that the impact of AI in society is going to be that big and that actually, roughly speaking, weâve got most of what weâre going to get from this current wave, and itâs just going to be a question of it being distributed more evenly. If I ask you how the world looks five years from now, itâs not like youâre saying more or less the same, but with things a bit more ubiquitous.
Michael Royzen (52:54):
I never meant to make a statement about the magnitude of the change or the second-order effects, which Iâm sure will be massive. I think what my biggest point there was that your average person, I think, will experience the benefits of AI without paying too much attention to the fact that it was AI. I think thatâs my point. My point is that AI will be seamlessly integrated into many facets of daily life. That for sure, I think, will have profound effects, but I think for the average person, it will just kind of happen and it will feel normal. And so this is my point living in San Francisco here. I looked at my Waymo app the other day. Iâve taken 60 Waymo rides now, and the first couple of times this was the craziest thing that Iâve ever experienced, and now Iâll just use it as a practical transportation mechanism and it just feels normal.
Raza Habib (54:56):
And so I think my bigger point here is that, yeah, I definitely think society will be transformed in many ways. I think people will become vastly more productive. I think itâll be a lot easier to create businesses. I think thereâs going to be a lot more creative output. GDP will increase dramatically. That will all happen. But the craziest thing about all of that is that your average person will still think about their lives, I think, the same way we do today, which is like, oh, my life is cool. Thereâs things that are going well, the things that are not going well. I donât think that itâll change most peopleâs thinking on a day-to-day basis. And so I think thatâs the very important kind of point that I want to drive home, which is, and thatâs almost a testament to its success rather than its failure. Itâs going to be integrated so seamlessly that it will feel magical at first, and then itâll all just feel totally normal.
Raza Habib (55:43):
Alright, Michael, well, itâs been a pleasure chatting to you and looking forward to doing it again in the future.
Michael Royzen (55:48):
Thank you. Thank you for having me. This was great.
Raza Habib (55:52):
Alright, thatâs it for todayâs conversation on High Agency. Iâm Raza Habib and I hope you enjoyed our conversation. If you did enjoy the episode, please take a moment to rate and review us on your favorite podcast platform like Spotify, Apple Podcasts, or wherever you listen and subscribe. It really helps us reach more AI builders like you. For extras, show notes, and more episodes of Hi Agency, check out humanloop.com/podcast. If todayâs conversation sparked any new ideas or insights, Iâd really love to hear from you. Your feedback means a lot and helps us create the content that matters most to you.
About the author
- đ@RazRazcle