The Contrarian's Guide to AI

Raza Habib

Jason Liu has worked for dozens of enterprises consulting on RAG and LLM projects. You may know Jason from his snarky, yet spot-on hot takes ON X, or his viral presentation from last year’s AI Engineer conference titled “Pydantic is all you need.”

In 2024 Jason has continued to diversify his expertise, launching a new consulting practice while seeing massive adoption of his side-project Instructor, a Python library that helps AI engineers easily structure LLM outputs, which is on track to hit 1M downloads by EOY.

In this week’s episode of the High Agency Podcast, I sat down with Jason to cut through the AI hype and identify the most reliable best practices to build LLM products today.

Subscribe to Humanloop’s new podcast, High Agency, on YouTube, Spotify, or Apple Podcasts

If you want to break into AI engineering, then this conversation is for you.

Here are Jason’s golden rules for building scalable AI products.

1. Set Evaluation Metrics that are Fast to Calculate

In the world of AI product development, speed-to-market is critical. The faster you can calculate LLM product evaluations, the faster you can iterate. When prototyping new features, don’t get stuck in decision paralysis – just get started. Instead of wringing your hands over prompting vs. fine-tuning, run an experiment on both. If your metrics are fast to calculate then you can try more things quickly.

2. No Prompt will be Good for all Data

Understand that different hyperparameters and prompts will perform optimally in different regions of the input space. For example, if you’re using an LLM to analyze LinkedIn bios, recognize that more junior employees typically provide more detailed information than C-Suite executives so you’ll have to adjust. You'll likely need different prompts and hyperparamters for different input types.

4. The Secret Recipe to Fine-Tuning GPT Models

Jason is seeing an average of 10% improvement rate in classification tasks by adding chain of thought prompting. "We use chain of thought to generate a ground truth data set. Then, when we fine-tune, we can fine-tune those same examples without the CoT." (Check out the 12:55 minute mark for a great example.)

5. Ban Adjectives from Your Discussion of AI Performance

“It seems better” is not a good enough benchmark of success when prototyping AI products. Strive to diligently gather and report on actual quantitative benchmarks for your AI projects, and rigorously track changes over time.

6. Segment Your Evaluation Criteria

LLM evaluations should never be tied to a single metric like an F1 score. Instead, report your benchmarks in clusters aligned to product use case and user outcomes.

7. Domain Experts are Critical

When building AI product teams, optimize for full-stack engineers that specialize in breaking down complex tasks into smaller steps. And make sure to have domain experts involved early and often in your human evaluations to ensure you’re building a highly-crafted user experience. Think long and hard about what it means to do a task well and then build a system that tackles it.

8. Agents Won't Save You

"If I could short agents, I would right now." Instead of buying into the latest, most hyped LLM use case, start with the basics. In Jason’s opinion, modifying and evaluating workflows is far more important than leveraging agents when building AI products. Reliability > everything else.

9. AI Will Never Replace Humans and Hubris

“At the end of the day, the people who excel are still the ones that just can describe what they want clearly.” AI won't change our ability to articulate what vision we want of the future or what we want to get done. However, it will certainly augment and scale our bold visions for the future.

Jason Liu's insights provide valuable guidance for anyone looking to build scalable and reliable AI products. By focusing on measurable metrics, operating and prioritizing reliability over hype, AI engineers can create products that deliver real value to users. To hear more of Jason's insights and examples, be sure to listen to the full episode of the High Agency Podcast.


Jason Liu and Dan Becker are offering a course on Maven titled "Systematically Improving RAG Applications". This course provides a repeatable process for evaluating and enhancing your RAG applications, building on the insights shared in this interview. You can find more information and enroll at maven.com.

Interview with Jason Liu on AI Engineering and Structured Prompting

Chapters

00:00 - Introduction and Background
08:55 - The Role of Iterative Development and Metrics
10:43 - The Importance of Hyperparameters and Experimentation
18:22 - Introducing Instructor: Ensuring Structured Outputs
20:26 - Use Cases for Instructor: Reports, Memos, and More
28:13 - Automating Research, Due Diligence, and Decision-Making
31:12 - Challenges and Limitations of Language Models
32:50 - Aligning Evaluation Metrics with Business Outcomes
35:09 - Improving Recommendation Systems and Search Algorithms
46:05 - The Future of AI and the Role of Engineers and Product Leaders
51:45 - The Raptor Paper: Organizing and Summarizing Text Chunks

Transcript

Raza: I'm delighted to be joined today by Jason Liu, who has over eight years of experience in machine learning. He's been working as a consultant recently to a lot of leading AI companies. He's also the author of the Instructor Library for Structured Prompting and has a new course coming out about RAG that I'm sure we'll get into details on. So Jason, it's a pleasure to have you.

Jason Liu: Thanks for having me.

Raza: Jason, to start with, I've seen you argue that AI applications shouldn't be treated so differently from normal software. And something I heard you say was we've put too much magic on the AI part and not enough emphasis on software quality. Do you still feel that way? And why?

Jason Liu: Yeah, I think a lot of it has been too focused on some of the features that AI presents as a product rather than the outcomes folks are really trying to drive with some of these things. If you think about some of the latest AI hardware companies, it's actually hard to figure out when I want to pull some of these up. What's really important is figuring out what the benefit to the user is. When do we think of using some of these products?

Rather than just the features that we build out, earlier we talked about RAG being question answering versus report generation for decision making. A lot of it ends up becoming good product rather than playing around with AI. Because AI has existed in underwriting and insurance, it's existing in lead generation and customer service. We're still doing the same thing over and over again, but now it's just a little bit more.

Raza: Yeah, I guess the reason I wanted to start here and I chose this question to begin with is because I found myself both agreeing and disagreeing with it and I wanted to tease apart that tension. The bit that I agree with is that obviously we want to be focused on end user outcomes and actually building a useful product and not getting overexcited about technology for its own sake. But where I was maybe not as much on the same page and where I feel like we might have some genuine disagreement is whether building with AI and building with LLMs genuinely needs new systems and approaches and different forms of approaching building applications. So I'm curious, am I misreading you here? Or do you think no, we should be using the same tools, like you know, Pydantic is all you need, get out your IDE, screw all this new stuff. You know, there's nothing new to see here.

Jason Liu: Yeah. I think for the most part, I say this primarily because I've been solely building with systems that are fuzzy and systems that are probabilistic and have uncertainties. So for me, not much has changed. But in the way that I've been doing some of my consulting and seeing other engineering teams run their standup versus a research machine learning team run their standup, I definitely recognize there are some differences.

I think some people want to ship products when 100% of the tests pass. Because when you build regular CI, if there's any failure in tests, you cannot deploy your product. Whereas when I come in, usually I give them permission to say no and give them permission to ship when some of these things are 85-90% successful. The solution ends up being, how can we communicate expectations? How can we educate the user? And how can we turn down certain requests?

For example, if something goes 90% correct, is it that one in 10 fail? Or is there just 10% of the input space that we can't serve well? Maybe there we should just pop up a modal or put something in the prompt that says, "Hey, I can't really access your calendar. It's not my fault. I can't book something." Let me educate that with some copywriting instead.

So in that sense, I definitely agree that there is a big difference in how people are working on probabilities and ratios and thinking about systems that they need to iterate over time. But in terms of development cycle, it's still very much, you know, either way, get feedback.

Raza: Yeah, maybe there's two cultures here. And if you've come from traditional machine learning, you're used to building stochastic systems. You have accuracy on a test set, you iterate against it. That's more normal to you. And potentially what's really changed - tell me if you agree with this summarization - what you're saying is that there's a new group of people who are now building ML software kind of through the back door, because they've not had to go through the model training side. They've not come through conventional machine learning.

You now get access via API to these super powerful models. But you're thinking about it as if it's traditional software and it's...

Jason Liu: Yeah, yeah. I think the way that I've described it before was, if you want to build a machine learning driven product, even five, six years ago, you first have to build your non-machine learning driven product, use that as a sensor to collect data about the users, about the world. And then you can use that data that you've collected, and then you've earned the right to sort of inspect this data, build a model, deploy it.

And the second time around, you have this feedback loop between what the model does and what it recommends and how the user interacts with that model. And then you have this flywheel where you kind of recognize that there is some iteration process. Like I'm showing you ads and there's a probability that you click. And just because the model says that it's better doesn't mean I actually generate more revenue.

Whereas with LLMs, we've skipped the first model that has no machine learning. We don't need to collect any data. And now you just get something and it becomes that because it comes so freely, you sort of forget the fact that this is actually machine learning that has to learn. It's not default intelligent. And I think AI assumes the intelligence has already been baked into the system, whereas the machine learning stuff has to sort of cut the data, teach the model and really iterate.

Raza: Yeah. And I guess also if you're coming to this from traditional software engineering, and this is your first foray into AI and machine learning, then you're also probably not used to the expectation that your product is going to be able to kind of improve that much over time. Like how much deterministic software just gets better through more user interactions, through regular fine tuning. So yeah, that idea of the benefit of that feedback would be obviously much bigger here than it might be in other circumstances.

Jason Liu: Yeah. Even with things like fine tuning, you know, a lot of companies use default embeddings, but a company like Netflix or Amazon, their item embeddings when they do recommendation systems are updating on a daily basis because every checkout can improve.

Raza: So to that extent, I agree with you. I think one place where it feels genuinely different, at least to me, compared to just traditional machine learning, even as someone who was in the machine learning space, is that because you can do natural language prompting, everything feels like it's increasingly more accessible to non-technical people. And something that I felt is, or a thesis I have that I'd be curious to see whether or not it resonates with you, is that as the models get better, the barriers to a product manager or subject matter expert being able to be directly involved in product development should go down. And there's so much valuable knowledge that they have in their heads about actually what the end outcome should be that traditionally gets filtered through all of these systems and PRDs and specs or whatever to try and get the right outcome. Whereas now actually, if you want to, you can have them go and do stuff directly. And that feels genuinely new to me and probably will require a change in how we do things.

Jason Liu: Yeah, when I was doing a lot more data science work, I always felt like my job was just to translate what the PM wants or what leadership wants into some sort of dashboards that I can then use to capture some notion of improvement. And now, through all this hard work, I get a line and I just want that line to go up and to the right.

The issue is, like you're right in the sense that even though the barrier to entry is very low, what ends up happening I think in practice is the non-technical folks just think the AI will work now. And the problem is still the same. So instead of having metrics that I build dashboards against that say, you know what, if this line goes up and to the right, it's been successful, and I've accounted for some kind of growth and I've accounted for novelty effects in my metrics, now I know this works.

In the same token, the job of the AI engineer or the machine learning engineer is to sort of induce some set of evaluations from the PM to actually figure out, what do you want the model to actually do? I understand there's a question answering system, but what kinds of questions do we want to answer? What are the shapes of answers that we care about? What are the trade-offs that we need to make?

And those again, if this is sort of almost like a debate style format, those are still the core skills that need to be learned. How do we distill these desires that we have into simply put the number that goes up and to the right.

Raza: You've obviously been involved in machine learning for a while, like before LLMs were around. I think you were at Stitch Fix before and elsewhere. You built recommender systems. I actually think from what I've heard, you were skeptical of LLMs in the beginning and sort of maybe came around to them later. Is that true?

Jason Liu: I've written apology letters to like everyone.

Raza: You weren't the only one, for sure. And so the question is just what ideas and concepts from traditional machine learning do you think people who are new to the space should still be aware of? Like what are the things that we should be porting over if you're an AI engineer or a product leader that you need to learn about from Core ML?

Jason Liu: Yeah, I think the first one is iteration speed. The faster you can test your hypotheses, the more you can try to move the goal metrics. And so the real idea is around leading metrics versus lagging metrics.

Raza: Can we try and couch this in a concrete example?

Jason Liu: Yeah. So lagging metric is like, you know, you want to put on some muscle every day. I can weigh myself, but just because I weigh myself doesn't mean I can actually figure out what I should do next. And so, you know, when I talk to a personal trainer, I want to gain weight. They say, okay, really what you need to track is how many calories you burn and how many calories you eat. And every day at any single moment in time, you can ask yourself, are you doing enough? And then immediately take action.

And so by having fast iteration cycles in your product, you do the same thing. Instead of saying, make my AI app good, which is easy to check but hard to impact, we should think about experimentation. A simple example would be in retrieval, if you have precision recall metrics, you can retest small hypotheses in the order of minutes. But if you have 200 examples and you use an LLM as a judge, now it's something that takes like 40 minutes.

Making sure you're testing against the right metrics at the right experimental phase really helps you improve how quickly you can iterate. That's one example. The second example is really around this idea of having hyperparameters and doing grid search. This comes up a lot when companies will say, "Hey, Jason, I saw these three papers. Which one do you think is the best one?" And they say, should we use CoT or should we use step back prompting? Or should we fine tune? And then the communication is actually like, oh, these are all just three experiments. If you have your fast retrieval data set and you've guaranteed that because of the way you've built the system, each one takes 30 minutes, it's actually more cost effective if you just run those experiments.

Raza: Yeah, this reminds me of kind of the 101 lesson I remember getting at the start of my PhD or my master's thesis from my supervisor, which was like, implement the dumbest baseline first and then iterate up from that, which I'd even forgotten was something that I learned through ML because it's just become almost like a life principle for me now. But yeah, start with whatever is the easiest, fastest thing, get a baseline and then iterate.

Jason Liu: Yeah, even one of the biggest surprises I had earlier this year was I did some synthetic data generation to evaluate how good BM25 and semantic search did. And the first data set I used was the Paul Graham essays. And so what you do is you just take a random text chunk that you pull out, you ask an LLM to generate the question, and then you check whether or not your search methodology gets you that chunk again. So for Paul Graham essays, it was like 95% recall for BM-25 and 96% recall for semantic search. I was like, oh wow, they're basically the same. It turns out it's because I've read a bunch of Paul Graham essays and also the way you generate questions is very complementary to BM-25 and semantic search.

And then I told myself, okay, this is like too easy of a problem. Let me just do the same thing with GitHub issues. You do the same process with GitHub issues and you just get like 62% and 63%. I was like, wow, even something as simple as just asking an AI to generate one question to find an issue is very hard in GitHub. And you need to add a bunch of more filters around repos and date filters to even begin to answer some of these questions. But those are experiments you can run in like 30 minutes.

Raza: Exactly the same setup.

Yeah. Okay. So iteration cycle being one thing, anything else that you think people should be pulling over from machine learning? Are there any core, you know, okay. So metrics people should know about. We've mentioned precision and recall a few times and they're assumed knowledge. Obviously those metrics are things we should port over. Anything else?

Jason Liu: Yeah, like understanding what a hyperparameter is and understanding that you need to experiment with different combinations of these things.

The way to find hyperparameters isn't necessarily just like the temperature and the top P and penalty. I think, you know, the kinds of few-shot examples I want to include is a hyperparameter. Different system messages become hyperparameters. Do I use a summarization of the data or do I use the raw data? And then you realize, wow, I have like 60 different variables and I have a budget and all I need to figure out is under my time and capital and resource constraints, how do I make the most effective use of my resources?

You find that like sometimes what you'll notice is at least in the RAG context, different hyperparameters work best within certain regions of the problem space. I got a funny example, which is we were doing RAG over LinkedIn bios and it has basically worked very well for junior engineers and junior developers and worked terribly for senior folks because the junior developer has mountains and mountains of descriptions in their job titles.

But the highest status, highest level person on LinkedIn is just "investor, investor, CEO, CEO, board member," and there's no additional context. And it becomes incredibly hard to achieve some of this data. And it turns out that, in those situations, it's really useful for the LLM to augment some of that data with something like CoT. Or for junior folks, we don't need that. Now by adding those rules, again, you kind of figure out what the right hyperparameters are for certain use cases.

Raza: Okay, let's talk about Instructor. So you've given, I think, one of the more popular talks at the AI Engineering World Fair twice now, "Pydantic is all you need." It's a very popular library. Just to start with, what's the elevator pitch? Like, why should people be using it? What is it?

Jason Liu: Yep. The elevator pitch is if I had an intern that wrote a REST endpoint for me and all they returned was a string that I had to then parse into a JSON object, I would be very unhappy. But it seems to be the case that everyone's very happy with doing that for LLMs. You don't know if the keys are correct. You don't know if the types are correct. It just gives you a lot of anxiety and gives me a lot of anxiety. And what Instructor does is it hooks into whatever data validation library is popular in your language and provides you some additional hooks.

So you can pass in a struct, you can pass in an Elixir and Ecto protocol, you can pass in a Pydantic object, you can pass in an Exod schema, and then we guarantee that the output is validated. If you want things to be an integer, it's an integer. If you want it to pass some regular expression, it'll pass.

Raza: And how do you do that under the hood? So I specify some desired output structure from an LLM. I then query the model. What are you doing to guarantee that I get the right structure?

Jason Liu: Yeah, so it really depends on what backend you use. The general idea is that we pass in the JSON schema version of that object into the language model. And for closed source models, a lot of them support something called tool calling. And we basically overload tool calling to return structured outputs. If you use some of the open source models through something like Ollama or LlamaCPP, we also convert that JSON schema into grammar.

And what this does is we actually sample the tokens differently based on what we know about the current structure. And so if we know that we just got squiggly bracket quote, we kind of expect an understanding of what is the next thing we're going to generate. But to me, these are all kind of implementation details. That's like, what is the delimiter in the CSV file? But yeah, the idea is that if it fails, we throw error messages and the language model will heal itself.

Raza: Yeah, as the user, you don't need to worry about this.

So high level, essentially what Instructor is allowing me to do is have guarantees about the output types that I'm getting back from an LLM. So I can specify a structure and be confident that I'll get that. Do you want to just talk through some example use cases? At least intuitively, like when I first heard this idea, I was like, yeah, obviously that sounds great, but I don't think I fully appreciated the power of it until I saw you work through a couple of real world use cases.

Jason Liu: Yeah, a really simple one is just the fact that when you ask for structured outputs, usually you'll get a bunch of language models saying things like, "Here's the additional information" or anything like that. And then using that to render something in the UI as a React component, rendering a table, there's a bit of risk in understanding whether or not that is done correctly. Whereas if I had a transcript and I had an action item, which is a list of strings, and then a summary, which is a string with minimum 200 character limit. Now you're designing the outputs a little bit more, right?

If you want to go even further, you can say, okay, action item isn't a list of strings. It's a list of items. And an item has a title, a description, and an assignee. And the assignee has to be one of the speakers in the meeting. Now you can build very safe and robust objects that might map directly to a database. It might map directly to a React component. And you can even build validators that make sure that the folks that are assigned are actually participants of the meeting. Now, your code boundary is a little bit smoother.

Raza: And how is Instructor going? Like what's the current state? It feels like it's, you know, at some point I think I saw 150,000 downloads a month or something. Is it still growing?

Jason Liu: Yeah, I think right now we're at like 320,000 a month. We're seeing like 20% growth month over month. And so I think my goal is by January 2025, we'll have a million.

Raza: Okay, wow. So it's becoming an established protocol then. So people should really be going up and they can trust that this is gonna be maintained and long lasting.

Jason Liu: Just in Python.

Yeah, I think, you know, the joke that I really tell is like Instructor is about 600 lines of code that you'll never want to write yourself and like 20,000 lines of markdown to tell you how to think about prompting. And so one of the things that's coming up soon is we've actually converted the 58 prompting tips, the white paper that came out recently, and we've implemented every single one as an Instructor example. And so you can sort of see more clearly how you can implement some of these things in your own work.

Raza: I haven't read this 58 Prompting Tips paper. This is new to me. What are some of the highlights? Is there anything non-obvious in there? Like what are the ones that stood out?

Jason Liu: I think a lot of it has been just around different ways of doing reflection. So obviously we have chain of thought, chain of thought with a plan. They talk about some ensembling methods. They talked about different ways of doing, you can basically have a chain of thought. You could generate the answer, then you can have a critique reflect on a chain of thought to generate a new answer. And a lot of that ends up looking like things like validators.

Raza: And from your experience, how effective are those more complicated methods? Because you're obviously adding both complexity and cost because you're running more tokens and also there's just like more stuff going on under the hood. Is it actually worth it? Like, do I need to do these things? Because, yeah, why can't I just have a validator basically? Or how much performance benefit am I getting?

Jason Liu: So I think you'll be surprised. I have some numbers I love to share. One of them is in classification tasks, adding chain of thought has been able to improve classification.

Raza: And that's still true even after the model providers have been setting fine tuning on chain of thought?

Jason Liu: Yeah, and so what we do in practice is we'll actually use chain of thought to generate a ground truth dataset. Then when we fine tune, we fine tune those same examples without the chain of thought. And now we can do a lot better.

Raza: Okay, yeah, that makes a lot more sense to me. 'Cause I'm like, why am I generating all of these extra tokens every time? If I can, if I can fine tune on it.

Jason Liu: And then for validations, one of the things that Instructor does is it allows you to distill Python functions. So by adding any kind of decorator to a Python function, you can fine tune a 3.5 model that basically implements... Like just like student-teacher networks have done that.

Raza: Let me just say that a little more explicitly. 'Cause again, I think this is one of those places where we might lose someone. So the idea here is that I like decorate a Python function and this is going to then automatically generate a dataset of input output pairs, fine tune a smaller model on that. And so I ultimately get the ability to like get this faster, cheaper model for the same task?

Jason Liu: Exactly, exactly. So maybe the version one implementation uses two OpenAI GPT-4.0 calls. And once you fine tune, it becomes a single 3.5 call. So we've seen this work in cases where we have a 96% correctness. And then we have one retry, and then we get that to 100%. And then by fine tuning, we can just do that in a single shot with 3.5, and it'll work every time. And so those are some pretty good results, in my opinion.

Raza: And so then you're getting a big cost and latency reduction by fine tuning the smaller model. And it's relatively straightforward to do. It's just decorator function. And now you've got faster, cheaper version.

Jason Liu: Exactly. Yeah, so day one you decorate the function, you point it to like a JSON lines path. Day two, you replace it with the OpenAI 3.5 model name and we'll just swap out the implementation.

Raza: Interesting. And whenever something's like that, where it's a bit black box, I always worry about being able to measure performance and interpretability. Like I personally like to be able to see those intermediate steps and go in and look at the eval graphs and have confidence that the fine tuning is actually doing what I expect, etc.

Jason Liu: That will come from like any other product provider you use. So I don't really try to touch any of the observability aspects. I just add some deep debugging in place. But generally we've just sort of solely focused on having experts evaluate our data rather than having an LLM as a judge. Like it's either a validator that has passed/fail or it's a human.

Raza: Okay, we can hopefully come back to that. Changing subjects slightly. Obviously you do a lot of consulting work for AI startups and a bunch of companies that are on the frontier versus the early stage. What are some of the use cases that excite you the most that you've seen be successful? How should people think about getting real ROI? Are there any trap ideas that you've seen people try a lot and just consistently fail?

Jason Liu: Ooh, I have not thought too much about the trap ideas just because I try... those are often too lost to even find the time to find a consultant. I think the ones that I am really excited about are kind of boring because I'm mostly working with enterprise companies generating reports and generating memos. Like so much of my time in trying to be a good consultant is doing a lot of research beforehand. And then during these calls doing a lot more research and preparing these research memos.

And so if we have a series of calls over next couple of weeks, I don't really want to do RAG over those call transcripts. Instead, what I want to do is I want to say, identify all the different pain points we've discussed, identify all the different research opportunities we've discussed, and prepare a two-page memo of everything. And then let's use that as something that syncs. Same thing with job interviews. I can't do like six job interviews. I don't want to do RAG over the job interviews. I want a report of like strengths and weaknesses and all that stuff.

And again, it brings it to like, instead of doing percentages of wages like we talked about, it becomes how do we make better decisions? Those are the things that really excite me. And that can happen with call transcripts. It can be done in terms of financial due diligence and research. It's basically RAG, but all the questions are done ahead of time rather than in real time. And by doing that, you have standard operating procedures that you can introduce to your employees and to your AI systems. And the big challenge is how do you figure out what those standard operating procedures are? Because asking questions, I think, has just been a terrible, terrible problem to solve. And maybe the answer is, I think, QA RAG is kind of the trap idea.

Raza: Yeah, maybe. It sounds like there are a few use cases that I've seen be really successful. And actually this is one I hadn't thought about that much before, but it's kind of one of these obvious in retrospect ideas, which is like a lot of expensive white collar knowledge work is researching things, pulling stuff together, summarizing it, presenting that to someone else. And so if you can automate that, especially for very expensive people, financial analysts, or some of the others you mentioned, then that's both a very high ROI activity.

Jason Liu: Yeah.

Raza: Very well aligned to the LLM skills and something where it's actually augmenting human decision making. And so kind of, I like that that ticks a lot of boxes on the ROI side.

Jason Liu: Yes.

Yeah, the example I give is oftentimes I'll do like two or three 30 minute meetings with a client before I generate like a sales proposal. But I would not just do RAG. Instead, what I do is I will read two books about MBAs, and then I'll read a book on sales and I'll read a book on how to write better offers. And I think this is all this knowledge to produce a report. Like the value is in having done the research to figure out what is the right report shape.

It turns out if you put pricing at the end, people are much more happy by being presented the metrics and objectives and the benefits they're going to get first. That's all the hard work of generating the prompt. Whereas, and I would probably pay like a hundred dollars if someone could write that better than me because that's negligible compared to the contract size. But if I put all three phone calls into ChatGPT, I still have no idea how to write a high conversion report. I think that's kind of the simplest way of framing it.

Raza: At the moment you still have to do it yourself, you can't get the models to do it.

Jason Liu: Yeah, because I have to think really hard about how to write a high conversion report. Like if the proposal is for $40,000 there's no way I'm spending 10 minutes with ChatGPT. It's worth me spending like three hours if it means having higher success.

Raza: And do you think that's just a limitation of the models today? They're not smart enough yet? Or what's stopping you being able to do it?

Jason Liu: I think a lot of it is just distilling certain regions of the LLM space. If I prompt the language model and say, give me six different pieces of advice on how to write a good report. Here's the template I want you to use. And then I want you to write a report. It gets me a lot closer. But I think right now, if you go to the sales example, the objective of these language models is not to be exceptionally persuasive. It's kind of to be happy and helpful and harmless.

Raza: And when you do these consulting projects for AI startups, is there any advice you find yourself giving them again and again? Like what are the most common things?

Jason Liu: The first one is like your engineering team should not be allowed to use adjectives. We should try as hard as we can to come up with numbers and report on the number itself or changes in that number. I think usually the first week we joined is I was like, we want to make the AI better. I had some examples. I looked at them. They look better. Actually I think the model was worse or a lot worse.

And I think just in terms of how you do communication as a data scientist, as a mathematician, a physicist, experimentalist, you really want to just be much better at presenting numbers and really thinking about doing better data visualization. That's comically been like the number one thing I've noticed is I'll come in and everything is like better or worse and there's not a single number. And then you change it and all of a sudden it comes, I did three experiments this week and I improved recall from 0.4 to 0.9 because I'd recognized that there was a mistake in how we indexed the file name. You're like, well, that feels like way more impact.

And the second thing is really producing evals that are in alignment with business outcomes. I think a lot of times, the feedback we collect from the usage might not actually align with what a good business outcome might look like. The things that we're measuring or the evaluation that we have might not align with these business outcomes. And we just got to make sure that when these numbers go up, we should at least hope that those numbers go up too.

Raza: In my experience, that's actually been a place where it's been maybe a bit more positive, at least for me with LLMs compared to traditional machine learning, because the people involved tend to be closer to product. So when I've been on pure machine learning teams, it's like, let's push up the F1 score, versus people on product teams who are like, I want to improve this product experience. I want this end user metric to change in a particular way. But it's interesting that maybe that's not been your experience.

Anything else, any other bits of common advice? You know, if you're a product leader or an engineer listening to this podcast, what do you want them to take away?

Jason Liu: I think one other thing to think about is segmenting the evaluation that you have. I think right now everyone just has a single number, and they think that moving the single number will improve these systems. But in practice, there's actually probably a lot of different kind of use cases that we care about. And we should also, when we report some of these metrics, consider clusters of them. For a simple one, we have text to SQL. For skinny tables, we do a lot of joins. And for wide, normalized tables, we have to really worry about how we select the right columns and whether or not we can interpret the number of columns given the context. It's really useful to say, when we improve one, maybe one falls. And they're going to decide whether or not it's worth that trade-off. Or at what point, if one degrades, should we have two separate prompts and a classifier to detect which way to route it?

Rather than just reporting a single number. Maybe we can call it like the Simpson's paradox. But generally, like specialization is very, very important and very relevant because once you decide to specialize, you can build routers to do the right things.

Raza: Okay, Simpson's Paradox is a rabbit hole. Don't wanna go down right now. I feel like we could, I feel like just trying to explain that was gonna take us the next 30 minutes. But we're gonna, where you and I are gonna be like air miming diagrams and stuff in order to try and do this. But another time or maybe after this, I'd love to chat to you more about Simpson's Paradox.

Jason Liu: Yeah, exactly.

Raza: You have a new course coming out titled, "Systematically Improving RAG Systems." It's a topic that's very close to my heart, right? Something that we think about almost every day. Why do we need this course now? What will you be covering? You know, why is it timely and what's the advice that people are gonna get from it?

Jason Liu: Yeah. So my background has been recommendation systems and search for the past like six, seven years now. And what I've noticed is when we come into a lot of these companies, we don't really have a framework on how to allocate resources within the team. What experiments should be run? What is important in a business? And so the first part of this course is going to be me teaching the playbook that I've executed at a couple of these AI companies that I've worked for in the past.

And the general idea is that you can first bootstrap a dataset using a language model to focus in on precision and recall scores and make sure that retrieval is good before you do the generation. Then as you ramp up user traffic, one of the biggest things that people are sort of ignoring right now is segmenting the questions that you might be asking and identifying clusters. And when you identify these clusters, there's kind of two families of issues that you run into. One is an inventory issue where you do not contain the data that can answer a question, or you run into a capabilities issue where the metadata or the filtering isn't available that allows you to improve the system.

And really the idea is that once you can concretely identify these topic and inventory issues versus these capabilities issues, now you can start being creative in how you solve some of them. For example, in DoorDash, if I do search and it turns out customer satisfaction for Greek restaurants is really bad in the zip code, like better AI is not going to help you. The solution ends up being like reaching out to Greek restaurants and see if they want to be part of DoorDash. If satisfaction is really low after 8 pm, better AI might not help you, but instead you might need to build an "open now" button and make sure that your availability is up to date. And those are some small things that will dramatically improve search without having to convince yourself that better AI is the solution.

And the rest of the course is just teaching and going over case studies and examples and really working with different companies to figure out like, yeah, what are those edge cases.

Raza: And can we talk, because I think this is related, can we also just talk skills? Like if I'm an individual developer or if I'm an engineering leader, what skills does a developer need that maybe they didn't have before that are more relevant now? And also how do I craft the right team? So what's the right composition?

Jason Liu: Yeah, I think for the most part, practically what you end up needing is just like full stack engineers. I think the goal with something like the course is really to sort of break down these steps in such a way that you don't need much skill except for the sliver of identifying these topics and capabilities. Like that is effectively where all the skill is, but that's where you bring in domain experts who can figure out like, you know, turns out that if I group by dates...

Raza: That was literally about to be my next question. What about the domain experts?

Jason Liu: That's hard. I think that's just like, how do you hire people that are very passionate and interested? I think that's definitely a hard skill to figure out, especially because only recently has looking at data become a skill that now everyone must have. Before you had scientists that were getting PhDs being paid to figure out like DoorDash orders. And then, you know, there's a lot of attention on what's going on. But I think for the most part, it's really been hard to even try to figure out how to communicate what skills are needed.

I wrote a post recently on what are some bad inverted behaviors that you can use to justify data literacy. But a lot of it still has been very hard to capture.

Raza: Wait, wait, so tell me more about that. What do you mean by inverted behaviors?

Jason Liu: Yeah, let me pull up this article because I almost don't remember. So...

Raza: And whilst you're doing that, maybe I can just summarize what I think your advice was, which is like, first off, you probably don't need specialist engineers. Like the magic of the new stuff is that a generalist software engineer who's smart and curious should be able to pick up these skills, which I think I fully resonate with. You mentioned needing to get domain experts in as part of designing evals, which again, perfect sense to me. Is there anything else that I've missed?

Jason Liu: Yeah, I think that's the general idea. It's like, because you have these LLMs, now we only need a couple of domain experts to really navigate the space, whereas before you might need thousands in order to figure out and capture and look at all the data. But the skill is still distilling their knowledge into words that then you can use to train an LLM or distilling their knowledge into labels that you can use to fine tune a model.

I don't like to go into the details of how to write better instructions for data labelers, but I think a lot of the work I had done a couple years ago was really around, you know we wanted to have art students annotate the qualities of a good outfit. And it's not easy.

Raza: This is where it's Stitch Fix.

Yeah, I think fashion is, you know, we had Brian Bischoff on and obviously he was at Stitch Fix and also at Blue Bottle. And I feel like fashion recommendation, coffee recommendation are actually examples of some of the hardest parts of recommender systems, places where it really is extremely subjective. And for me, that feels like great preparation for LLMs. I feel like so many, you know, because Brian was like, nothing's new. And I was like, of course you don't think anything's new. Because you've come from a world where everything you were doing was super subjective to start with. But I think a lot of other people are entering this for the first time.

Jason Liu: Exactly.

Yeah, like I don't know if Brian's done the whole Stitch Fix comparison, but at Stitch Fix you have a human write a request note. The request note is sent to Stitch Fix. That request note turns into a bunch of recommendations that a stylist sees and the stylist creates the final shipment that goes to the user. That effectively is a RAG output, right? So text chunks is inventory. Instead of an LLM, it's a stylist. And instead of a response, you get a shipment.

But still like you have metrics like average order value, you have like, how long does it take to style a fix? Oftentimes at Stitch Fix you run into issues where we don't have the inventory to ship them. And then you have to be like, oh, the way to fix this problem is to like make more clothes. Like I can't, this is not an AI problem.

All right. I found the article. A couple of the ones that I think are interesting is just, you know, the article basically describes like 10 ways to be data literate. In hopes of actually uncovering what data literacy means. The first one is not caring about the quality of data. This might be, what are the mechanisms that might lead to missing data, for example? How good are these sensors? Another mistake that people make is not visualizing the data and only relying on aggregate statistics.

This again goes to this idea that, if we actually segment our evals, maybe we find that the better we do with skinny tables, the worse we do with wide tables, should we specialize. Lacking domain expertise, not testing with the right split.

Raza: That last one reminds me of, there was some research that came out of Chris Re's lab, I think with the Snorkel folks, about having evals on particular data slices. So being able to create your evals to be weighted in such a way that you want them to be particularly good at some subset of your test set. So it matters overall, but hey, we know this area is more important.

Jason Liu: Exactly, exactly. And then the rest of them are like, you know, improper test splits, not thinking about data drift, confusing correlation with causation, thinking about data privacy, overfitting, sampling bias, and disregarding interpretability and explainability. And those are some of the things that I think are like symptoms of bad data literacy. But even then it's hard to really figure out what data literacy means.

Cause I think a lot of it just comes down to almost having like an "oh shit meter" when looking at something, right? Like you train a model, it's like 96% accurate. It's like, hmm, something went wrong for sure. Or it's like, hey, how come this column being missing is 50% of the variance of this classification task? What is going on? Like why is it missing? And then you sort of get curious when you explore, but it's really hard.

Raza: All right, maybe some final questions to round us out. I'm trying to come up with some questions that I'm going to ask everyone. So sort of like almost like a quick fire round. We'll see how it goes. So first one, sort of reflecting on AI at the moment, do you think it's overhyped, underhyped, something else?

Jason Liu: I think it's definitely overhyped because it's hard to figure out where we've actually unlocked a lot of economic value that is not just minimizing cost centers. You see the biggest success and it's the same like customer service, lead scoring. It's the same kinds of problems.

Raza: Well, I was thinking that, but I feel like you gave a good counterexample to this yourself, not a few minutes ago when you're talking about report generation and decision-making support.

Jason Liu: Yeah. But I still think that's very much not even totally possible yet. Like if you said that there was an LLM that was a hundred dollars for a million tokens that could generate a report that was higher converting than a report I would generate, I'll pay a thousand dollars a month to get access to that language model. But I just don't think that's actually the case.

The smarter these models get, actually what ends up happening is the more I want to prompt it because I know the knowledge is in there. It's like the cost of the intelligence actually makes me want to prompt more, not less. And so in that sense, until we get to a world where most of the economic value is being unlocked in profit centers versus cost centers, I think that's kind of where I'm drawing the line. And again, this is like the third wave that I've seen in terms of AI.

We had like the deep learning and neural networks phase. We had like the big data, data science phase. A lot of it is just like better decision-making under uncertainty.

Raza: So your forecast going forward is kind of S-curve, this kind of like plateaus out. And at the moment, it's allowing us to save costs in places, but people are overestimating how good it's going to be.

Jason Liu: Yeah, and I think when it does get that good, it is because humans have built very good fences around what the AI can do. It's like the economic value can be unlocked, but it's not because of like an agent in a "for" loop. It's because people thought hard about what it means to do a task well and has built a system that does that task well.

Raza: How bullish are you on agents in general? What's your opinion on agents?

Jason Liu: I'm... if I could short that I would right now, I think.

Raza: Hahaha

Jason Liu: I think really what I want is that when I have workflows that I can review and modify, because when I generate a workflow, that workflow can be something I can fine tune and reproduce. Whereas with agents, it's really hard to figure out how do I have an agent execute the same task N times. Whereas the workflow kind of, again, puts more effort and energy on the user or the developer to come up with this workflow.

But that's been the case for all business for all time. Like the first thing you learn as a business...

Raza: Isn't that exactly the promise of agents? Like the hope that you don't have to do that work upfront?

Jason Liu: Yeah, I just think that's been very challenging to implement reliably well.

Raza: Ah, an interesting thing, the technology is not there.

Jason Liu: Yeah. And in this place, particularly, I think if we have a 70% success rate it's not even the case that it is because we do 70% of the tasks 100% correctly every time and 30% of the tasks 0%. I think every task is a coin flip. And I think that's the issue. Cause if it was that 70% of the time, it worked every time we can throw away the rest.

But right now, I think for the most part, every task is a coin flip and that is the ultimate risk. The quality of the result is probabilistic.

Raza: If you imagine the world five or maybe 10 years from now, how do you imagine AI has changed it? How does the role of an engineer look different or a product leader look different?

Jason Liu: Five to 10 years, hmm. I've not thought that far ahead. I have some ideas on how I think agents might be implemented in the future. They're a little bit wacky. In terms of engineering and product, that I really don't think about at all. I think it's just gonna be...

I'm very boring in the sense that like the way I describe engineering now is how I think of like pottery. Like you make some pots, everyone needs pots. And then plastic comes in and you realize that, oh, wow, like these things don't break. It's great. That's like mass-produced plastic containers. Then you realize microplastics are bad and now we're spending like $70 on wonky mugs that are made by folks in Brooklyn. I think I wouldn't be surprised if software engineering goes down that same path.

Raza: It's going to become bespoke artisan software.

Jason Liu: Yeah, I think so. Just like weird stuff. Like we don't really pay for like the most perfect cup that looks like a paper cup, but I use a ton of paper cups. But there's something nice about having something that was like made by somebody else. And I think it might just go down that route. The same way that we might like buy art.

Raza: I do want to push you on this bigger picture five or 10 years from now, what does the world look like in terms of how AI has impacted, partly because I feel like it's a different way of asking that same overhyped, underhyped question, and it'll allow me to calibrate what you mean by hyped as well.

Jason Liu: I see no difference than what we're doing with AI now with just how mass production has impacted the creation of goods. I feel like that's super boring.

Raza: But like, do you think the economy looks largely the same? Is my day to day largely the same? Do people still have jobs that look mostly similar to what they are now with just slightly different tools or does society look wildly different?

Jason Liu: Yeah, realistically, I think the jobs will be the same. It's like, I'll be surprised if there's less workers. I think there'll just be more workers having like higher multiples of output. And I think a lot of the, even what the good skills are, if you think about like, what is the difference between a good engineer and a bad engineer? A lot of it is great engineers that are like L6, L5 are ones who are great writers.

Who can allocate and attract talent well, and can describe requirements in a way that can influence both people above them and below them. That roughly will look the same once we have more agentic software engineers. But the people who excel are still the ones that just can describe what they want clearly. And so the only difference is instead of hiring one junior engineer, you could hire two junior engineers and like three LLM engineers. And you might just pay them the same amount. But I think the core skills ends up just being like, can you clearly describe what you want?

And can you write well? And I think that's been the same for all time.

Raza: Okay, well I look forward to a few years from now to see how everyone's different predictions pan out.

Jason Liu: Yeah, because right now, a lot of the work that I've been doing, which is writing memos, has just been dramatically accelerated by language models. But I don't think it's going to get to... If it gets to a point where it replaces that memo, I might just do even more memos, if that is an interesting pursuit.

Raza: Yeah. I think that's consistent with your "overhyped" statement. So I'm satisfied with that at least. Like, you know, if you'd said to me, I think it's overhyped and also five years from now, I think everything's been wildly different, I would have been surprised. So that makes sense. I find that personally, I find that a difficult intuition to hold just because my personal experience of like coming into the field, like I did my PhD in 2014.

Jason Liu: Sure.

Raza: The rate of progress has just felt so fast and has felt accelerating to me. Like when I started, literally like things that I do in five minutes now with an LLM, were research projects. And it feels to me like there's no reason why that shouldn't continue. Like more effort is going into it. More resources are going into it. I don't see any fundamental reason why these problems are intractable in principle. So I just feel like there should be more breakthroughs coming and it kind of surprises me that you don't expect that.

Jason Liu: Yeah, I guess I think of like my total output is not limited by how much faster I can type or my ability to like have a great vision and describe what I want. I think as I slowly hire more people and learn to delegate, I was like, man, like I just don't know how to describe what I want. And whatever I want is ambitious enough. And those things are not things LLMs will help me really push me on.

Raza: Yeah, AI won't change our ability to articulate what vision we want of the future or what we want to get done. That definitely resonates. Okay, final question from me. Something that I think everyone's struggling with at the moment is just keeping up with rate of progress, papers, announcements. Are there any papers or techniques or things that you've seen recently that have stuck in your mind and that you think people should be aware of?

Jason Liu: I'm going to be very embarrassed to say that I do not read Twitter or AI news that much, even though I'm on Twitter all the time. Because again, my thesis has been like nothing has changed.

Raza: Yeah. But it doesn't have to be Twitter, right? It could be like, I read a paper on arXiv.

Jason Liu: Yeah, I don't super... I think there's so much noise that if I just close my eyes for six months and open them up, the market will have filtered everything else out.

Raza: Yeah, I think that's actually really good advice. Like this is something that I've said to people multiple times when they ask me sort of how do you keep up? Which is like, to a certain extent, like don't. In that the ideas that are good... I used to think this when I was a PhD student, like you could tell which ideas were good by the ones that were actually being used. Like the reason you knew Adam was like a killer optimizer is like everybody was using it and it was actually helping them get better results.

Jason Liu: Yeah, I just don't think right now we're limited by techniques. I think people don't know what they want to build and they don't know the thing they're building is productive. Because I'm not really trying to do research or anything like that. But yeah, the phrase I tell my buddies is like, "I don't really read the news. Just wait for it to become history." And yeah, I think if we just close our eyes and open them again in like six months, we'll kind of have a good idea of what is interesting.

I will say one interesting paper that has been on my mind a while has been the Raptor paper. The general idea is that it could be much simpler, but the general idea is that, okay, we're gonna chunk up the text. And what we're gonna realize is that if we do basic retrieval...

Raza: Okay, tell me about this, I don't know this one.

Jason Liu: Weird behavior with just shoving a bunch of different chunks into that text. And so what you can do instead is you can first cluster the text chunks and then generate synthetic summaries of those clusters. And now you've reorganized the information in a way that is much more tractable for search. For example, if I have 20 PDFs and each PDF there's three topics.

If I do RAG, I'm actually like fetching chunks of like each PDF, but if I can just organize them some way and then summarize them, that might be a much more useful piece of information. And so what this tells me is...

Raza: So almost like doing topic modeling as a first step or something like that?

Jason Liu: Yeah, but the idea is like, you want to organize all knowledge first and then generate abstract or extractive summaries around them. And then your life will be a lot easier. And I think that just makes a lot of sense. I definitely think that's how I think about things. What are the core concepts? And you realize, these 60 papers actually really have four core ideas. And now that is something that can be reasoned well. And then the hyperparameters there are just...

How good is your clustering and how good is your summarization? All right, can you generate a summary that is retrieved well? And just this idea of like a summarization as an index and a key is the interesting idea. It's like, when you insert a row into a database, if you do a little bit more pre-processing, you can look up that row faster, right? By setting the primary key.

And you can do the same thing with language models. And there's some cost you can incur in the beginning of ingestion that makes retrieval a lot simpler.

Raza: Yeah. Something that's a counter to that, that a different interviewee put to me was that they were doing RAG over some really large corpuses. Like ingesting huge corpuses for big companies. And the reason that they were pretty negative on a lot of these pre-processing techniques is that there's a, they found there was a huge long tail in terms of what documents were retrieved. And so paying like a lot of upfront cost on processing or like augmenting with metadata ended up being wasted effort for like 80% of the documents. And like there was this like 20% that are retrieved a lot and then 80% that very rarely get retrieved. Like you do need them in the corpus. Like occasionally people ask stuff that brings them up, but it's been wasted effort if you spend a lot of compute upfront kind of doing anything...

Jason Liu: Yeah, I know that makes sense, but I think that's definitely sort of the trade-off and that is interesting to think about now. It's like, okay, well, there's a performance benefit if I put everything in Redis in memory or I put it into a database or into Hive. Now you reason about like, is it a scan? Is it O(1), O(N)? And now you get fun trade-offs you can make to make the system much more...

Raza: Yeah, and I guess we're now back into just conventional engineering land, which is, I guess, one of your core messages throughout.

Jason Liu: Yeah, it's like, you know, if you want to primary key your inserts, your writes are a little bit slower, but your reads are a little bit faster.

Raza: Yeah. All right. Well, on that bombshell, let's end it there. Thanks so much for coming on, Jason. I really enjoyed the conversation.

Jason Liu: It was a lot of fun. Thank you.

About the author

avatar
Name
Raza Habib
Role
Cofounder and CEO
Twitter
𝕏RazRazcle
Raza is the CEO and Cofounder at Humanloop. He was inspired to work on AI as “the most transformative technology in our lifetimes” after studying under Prof David Mackay while doing Physics at Cambridge. Raza was the founding engineer of Monolith AI – applying AI to mechanical engineering, and has built speech systems at Google AI. He has a PhD in Machine Learning from UCL.