From PyTorch to Fireworks AI: Building AI Infrastructure

Raza Habib

This week we’re talking to Lin Qiao, former PyTorch lead at Meta and current CEO of Fireworks AI. We discuss the evolution of AI frameworks, the challenges of optimizing inference for generative AI, the future of AI hardware, and open-source models. Lin shares insights on PyTorch design philosophy, how to achieve low latency, and the potential for AI to become as ubiquitous as electricity in our daily lives.

Subscribe to Humanloop’s new podcast, High Agency, on YouTube, Spotify, or Apple Podcasts


Fireworks AI is a platform that provides optimized inference for generative AI and compound AI systems. They offer a one-stop shop for developers, handling model and hardware updates while focusing on performance optimization. Fireworks AI uses various techniques, including custom CUDA kernels, smart model sharding, and semantic caching, to achieve low latency and high efficiency in AI inference.

Lessons from PyTorch development

Lin Qiao emphasizes the importance of maintaining a clear product vision and not compromising design for short-term gains:

We cannot compromise product design just for how people feel... So be extremely opinionated on the product spirit and stick closely to it. It's very important."

The power of declarative approaches in AI frameworks

PyTorch's success was partly due to its declarative approach, which aligns with the increasing complexity of AI models:

"PyTorch is taking more declarative approach... They should just give a system that they say, they tell us what's their instruction like, what they want to get out of it, and we build a complex execution engine to optimize for them."

Challenges in Gen AI development

The shift to GPU-based infrastructure has changed the cost and latency profiles for AI applications:

"Gen AI changed the calculation here. First of all, the general models are on the highest end of model complexity in size, it has to run on GPU... So whole entire infrastructure is very expensive versus before, application ran on heavily commoditized CPU."

Fireworks AI's approach to inference optimization

Fireworks AI uses various techniques to achieve low latency and high efficiency:

"We heavily optimize by handwritten our customized kernel, right? The customized CUDA kernel... we have ways to chop the model into pieces and scale them completely differently. So that's the best way to remove bottlenecks in the most efficient manner."

The future of AI hardware

Lin predicts increased competition in the AI hardware market:

"In any business market, when the margin is high, it's going to invite competition... I think clearly other like big hardware providers are looking into this space, especially AMD, and their new MI 300 is launching on various different CSPs has become available."

Open-source vs. closed-source models

Lin believes the quality gap between open-source and closed-source models is closing:

"The quality gap between open source and closed source are closing, right? So Phi-2 is almost on par with with the best closed source models."

AI's future impact

Lin envisions AI becoming as ubiquitous as electricity in our daily lives:

"I firmly believe that AI will be like our electricity, like utility, deeply ingrained into our day-to-day life... We're at a very early beginning stage of this innovation."

Chapters

00:00 - Introduction and PyTorch Background
04:28 - PyTorch's Success and Design Philosophy
08:20 - Lessons from PyTorch and Transition to Fireworks AI
14:52 - Challenges in Gen AI Application Development
22:03 - Fireworks AI's Approach
24:24 - Technical Deep Dive: How to Achieve Low Latency
29:32 - Hardware Competition and Future Outlook
31:21 - Open Source vs. Proprietary Models
37:54 - Future of AI and Conclusion

Podcast

[00:00:00] Lin Qiao: So I think, first of all, we cannot compromise product design just for how people feel. Pre-Gen AI, post-Gen AI, there are a few things that are changing, a few things that are not changing. What's not changing is those app developers, they need to have viable products, they need to have viable businesses. That goal is not changing. But Gen AI changed the calculation here, and just kind of in a nutshell, right? When you think about performance optimization, when it comes to latency, it's not just one piece, it's this whole entire system. And that's kind of the angle we're tapping into.

[00:00:33] Raza Habib: This is High Agency, the podcast for AI builders. I'm Raza Habib. I'm super excited to speak to Lin Qiao today. I've been looking forward to this one for a while, partly because Lin led the PyTorch team at Meta for five years, and also because she's now the CEO of Fireworks AI that is making inference for generative AI and for compound AI systems accessible to more people. Lin, it's great to have you on the show.

[00:00:53] Lin Qiao: Hey, Raza, I've been also wanting to chat with you for a long time, so thanks for having me here.

[00:00:58] Raza Habib: It's fantastic. So Lin, I actually wanted to start off pre-Fireworks, and chat a little bit about your time at Meta and PyTorch. You know, I grew up during my machine learning growing up, as it were, was in the heyday of the autodiff wars. In the early part of my PhD, everyone was rolling their own autodiff solution. We had at least two in my group. And then there was Theano and Caffe and Torch and, you know, Wolfe. And then TensorFlow came out and seemed to be kind of the initial winner, and PyTorch seemed to come from behind and really overtake and become, I think today, the most popular framework. And so I'm kind of curious, what can you tell me about what it was like running that project? Paint the picture for what that was like and how you guys managed to do so well.

[00:01:38] Lin Qiao: This is a great question, Raza. So I think I still remember that was about seven years ago. Across the industry, there were so many frameworks emerging, left and right. Same as within Facebook at that time, it's not Meta. We had three different frameworks: PyTorch for research, Caffe 2 for mobile, and ONNX for server. It got very confusing for our ML engineers, like when to use what and how they make those choices. And we made the decision, we're going to have one framework serve all different use cases, from research to production, from server to mobile, across the board. After we made that decision, I took on that team, and at that time, I'd be very honest with you, it felt like Mission Impossible, because there's one framework that needs to satisfy all different requirements, sometimes contradicting each other. It's funny that we had very, very strong engineers supporting and driving this project, and we came up with various different smart solutions. And one emerging is, hey, we're going to take the best of PyTorch, which is the front-end ease of use, simple, and marry it with fast back-end of ONNX and Caffe 2.

And boom, you get the best of both worlds, right? And the project is called Zipper. Actually, it's literally zipping both together. And then soon after we get into the project, we realize, oh, it's not impossible to drive it, because these two systems, or three systems, were never designed to be married together, and there's much bigger overhead to deliver that compared with building from scratch achievement. So we quickly realized, oh, this is actually a bad product design. The best product design is being trim, opinionated back-end, front-end, you need to go as one piece. So we decided to go all in on PyTorch. And the key bits, brilliant front-end and rewrote the whole entire back-end from scratch. And the PyTorch C++ window, that was the first step into that direction. And we brought forward TorchScript, which is an interpreter that can, on behalf of the user, try to kind of drive the performance to the next level without too deeply changing the user experience. We also kind of further double down on this direction and create PyTorch 2.0. It got released two years ago that is going to be even more automated performance optimization through the compiler stack. Which is the application development, product development, model research, right? So all these great areas, they have expertise in driving and to leave the consideration of latency, TCO, cost efficiency, scalability, reliability, all to this framework, to this AI infrastructure, so we find that that is the best positioning, and that drives crazy adoption of PyTorch in the industry.

[00:04:28] Raza Habib: The point about simplicity definitely resonates. I remember being an early TensorFlow user, and one of the big differences I recall was TensorFlow forced you up front. You had to create the graph, specify it, kind of in a compiled language, and then it would be compiled to XLA. And there's a lot of overhead upfront as a developer. Building that and PyTorch, I think building off Autograd was like one of the first ones where they did the tracing for you. So you would just write your code as if you were normal code, and you would create the graph for automatic differentiation, kind of in the background. Do you think that's the reason that PyTorch became more popular, or are there other things that I'm not thinking about? Like when you were running PyTorch, what were the metrics that you thought to optimize for? What's the reason that PyTorch, I think PyTorch won? I think that's held out in the stats in terms of usage. Why do you think PyTorch won?

[00:05:28] Lin Qiao: So there's a bigger trend, right? So we take a big step back and look at how the modeling space evolves from the pre-Gen AI to the Gen AI space. It's clear that modeling becomes more and more complex, and if you have to spend a lot of time to understand and update, maintain and evolve a very complex graph, the graph is getting more welcome, and that is not a good user interface. Instead, in Python code, writing the model and developing model in the code is a much better user interface, and then we leave the rest of optimization. Remember, like our mission for PyTorch is to develop it for both researchers for fast actuation in innovation as well as for production deployment. So all the production concerns need to be baked into part of design, but we do not want to ruin the user interface, because these two goals actually are kind of conflicting with each other, and we take a very determined product design. There are two part designs. One is imperative, the other is declarative. Imperative means you just tell the system exactly what to do. That's the graph approach. Declarative is you tell the system what you want to get out of it, and let the system decide what to do. So and in the past, in the industry, this is not new. And of course, PyTorch is taking more declarative approach. So in the whole entire industry, like, for example, database, right? Database world, we are very familiar with, there are the DAG-driven workflow, scheduling built on top of databases, those are completely imperative, right? You basically build a graph, or your execution graph. How do you do data pre-processing? How do you filter at what time, what trigger and what kind of table you want to refresh, and all this, right? So it's an explicit DAG, but database mechanism is declarative, as in rewrite SQL. And SQL is basically telling database what you want to extract out of it. You don't tell database what to do. And the database is so smart. It builds a distributed execution plan by itself. It uses all kind of data structure to accelerate execution, with indexes, materialized views, partitioning, all these different techniques. So I will say, in this whole AI infrastructure space, we have those different extremes of design points, and based on our PyTorch journey, that's what we I carry a lot of thinking into the Fireworks. Part of design is to focus on the declarative part, because we later focus on the cohort of application engineers, product engineers, they are riding this wave of Gen AI new technology to innovate, to disrupt, and we don't want to then put a lot of thinking into how to teach the system what to do. They should just give a system that they say, they tell us what's their instruction like, what they want to get out of it, and we build a complex execution engine to optimize for them. And therefore they focus on where they should innovate on. So I think this mentality and philosophy is deeply ground into our mind when we design our product.

[00:08:20] Raza Habib: That's going to naturally transition to what I want to talk to you about next, which is, you know, we're talking about history right now, and I want to talk about Fireworks AI. One final question, just before we jump to that, do you think there are any lessons from that period of time, you know, the PyTorch and initial kind of autodiff wars that were going on between mainly, I would say, PyTorch and TensorFlow and then later JAX, because Theano and others kind of disappeared quite early. Are there any generalizable lessons for people who are building Developer Tools, you know? We're building for developers and for AI product people. You guys are now, like, what do you take with you from that time that you've taken to Fireworks that maybe informs how you guys are building today?

[00:09:00] Lin Qiao: Yeah, there are so many lessons learned. It's not a straight path to get where PyTorch is today. So I think first of all, we cannot compromise product design just for how people feel. So let me elaborate, right? So, for example, I shared earlier, initially, we took the Zipper approach. So it's kind of we put multiple teams together, and we want to make those teams feel good about the future direction, and then it feels like, hey, well, everyone has work to do here, to buy, integrate with each other. But in fact, by going this direction, we completely compromise the quality of the product. We realize, okay, this is not the way to go. And if we don't have a viable product, nobody's going to be happy. So be extremely opinionated on the product spirit and stick closely to it. It's very important. So that's kind of one big lesson learned. And the second is also when we develop PyTorch in general, in the past seven years in this AI journey, one thing that hasn't changed or even become more vibrant is the ecosystem is constantly moving. The ecosystem is constantly moving and developing, and it's very important to focus on direction of this framework, and currently Fireworks as the kind of AI platform to embrace it. So don't fight with ecosystem.

[00:10:11] Raza Habib: It's interesting. You say that I agree, and I also want to caveat it slightly, because I actually think one of the things that TensorFlow got wrong is they integrated too much from their contrib library into TensorFlow 1.0 and so I remember, like, their API versus PyTorch, there was maybe, like, three or four ways to do everything, because they had kind of like crystallized an abstraction too early, and they'd taken too much community contribution. And so I wondered, like, how do you get, you know, at the stage we are now? Because I think this is true in Gen AI today as well, that everything is changing really quickly, so the models are changing, and so trying to find the right abstraction is actually quite challenging, and something I worry about is actually crystallizing abstractions too early, or allowing too much community contribution at a stage when things aren't fully settled. So how do you balance that? Like, how do you balance wanting to play nicely with ecosystem whilst also knowing that you know things have to adopt very rapidly and you don't want to get stuck in a bad crystallized abstraction too early.

[00:11:11] Lin Qiao: Abstraction is very hard. Getting abstraction right is very hard, especially for the pioneers that go ahead of the curve, and without knowing all the data of what the integration points will look like and how to optimize it, I first want to acknowledge it's very hard, and take a big step back and also abstraction for who right? So I think in general, AI platforms and systems, there are two layers of abstractions, or now there are probably multifaceted layers of abstractions for integration. One is developer-facing abstraction, right? So that will be a bet, and I see like two schools of thinking currently in the industry. One is, as I said, imperative versus declarative for developer-facing abstractions, and there are very clear trade-offs. The other side of abstraction is your other ecosystem integration, including various different kind of hardware and the models and data part, right? So what does that look like, and how to make it easy so you build a vibrant ecosystem and have your developers onboarded to our platform. Leverage, always leverage the state of the art, the newest, without worrying about, oh, I need to create a separate path and to learn new skills to kind of get the best out of what's emerging in the ecosystem. So those are kind of to categorize that into, like two, these two big, different buckets. And right now I can go into details of how Fireworks thinks about it. And typically people know Fireworks as the inference engine, single model as a service, right? And that's our lowest layer of product offerings. When we started, we know our product will be multiple layers, but we have to build this lowest layer, because what is happening in the application development world is very interesting. It's pre-Gen AI, post-Gen AI. There are a few things that are changing, a few things that are not changing for application development. What's not changing is those app developers, they need to have viable products, they need to have viable businesses. That goal is not changing. But Gen AI changed the calculation here. First of all, the general models are on the highest end of model complexity in size, it has to run on GPU. Cannot. Well, the model is getting smaller. There may be chance it will run on CPU as well, but majority of models still run on GPU. And GPU itself is expensive. GPU requires large amount of power. Power, expensive to acquire power, Generate heat, cool is expensive. So whole entire infrastructure is very expensive versus before, application ran on heavily commoditized CPU. So you can see the latency profile is different, cost profile is very different, and that actually put them in a very interesting bucket, as in this new technology is going to create fantastic product experience that never exists before. That means they have a higher chance of hitting product market fit much faster. But once they hit product market fit, then problem comes. Then actually, they could quickly get into trouble, because in a small scale, if they lose money, that means they quickly scale up, they're going to bankrupt soon. They need a lot of CapEx and a lot of capital to sustain their growth before they figure out how to kind of make business even kind of viable. And second is the important part of the product experience, is latency, interactiveness and all this genetic majority of genetic tech is not power B2C or B2D applications, because there is generating content for human to consume, right? So those are the kind of two big categories. Is and interacting is extremely important. And because models so big and how to kind of bring down the latency is very hard, it's completely different. So they are learning and trying to understand how to get on top of this. You deeply interact with app developers and partner engineer and machine learning engineer, I would love to hear from you like, what are the biggest challenges they are facing today?

[00:14:52] Raza Habib: Similar to you, I think about kind of what changed. So at Humanloop, I think like what changed for Gen AI. And I like before Gen AI, I was building application machine learning and what's new. And I think that there's three things that we think about as being like, fundamentally different and that break a lot of traditional software workflows. And so the first one is that a lot of the applications are harder to evaluate than they used to be before. So they're very subjective in nature. If someone's generating summaries or doing copywriting or they're even extracting information, but it's hard to know what's correct, right? It's difficult to write the equivalent of a unit test, even compared to normal machine learning, where you could calculate an accuracy metric on a test set. That can be quite hard if the ground truth answer is subjective in nature. If it depends on the opinions of the end user, that's really different. And if you can't measure things well, then it's difficult to know like as you iterate, are my Am I actually making things better? Am I improving my application? Do users like it? The second thing that I think is really different is that prompts are now a big part of how you build applications. When you know, if you go back even two or three years, everything was fine-tuning. There was no prompt engineering. And if you wanted to build an application, you had to build a large annotated data set and fine-tune well. Now you can actually get really far often as well, just through writing these natural language instructions, writing prompts to the models, but you now have an extra code artifact that's different to code, but has a similar impact to code, and you can't reason about it in the way that you reason about code. If I'm writing code, I can look at it and I can understand, even in my IDE what does this code do with prompt. You can get a sense, but unless you run them, it's hard to know exactly what their impact is. And the third big difference is who's involved in building so before, it was mainly just machine learning engineers who are building AI systems and products. Now it's more generally software engineers, but more importantly, it's also domain experts. So the product managers and the subject matter experts are much more involved. You know, we work with companies like Duolingo, for example, where it's actually the linguists who are doing prompt engineering. And if you have that kind of setup where they think evaluation is hard. You have these prompts and you've got to involve like domain experts, then your normal software processes are kind of broken, right? Like, how do you version things? Well, if you use Git to version them and you keep stuff in code, the non-technical people can't look at it. How do you do evaluation? You can't write normal unit and integration tests. So you have to have a new way of doing evaluation and observability, those are the problems we've been trying to solve at Humanloop. And so what we provide is an interactive place for non-technical people and engineers to collaborate on prompts and on tool definitions and on data sets, and then coupled to that is, you know, tools for doing rigorous evaluation and observability. Those are the problems that we tend to be speaking to people about, though we do also, you know, people do speak to us about fine-tuning and about latency and bringing those things down, which I guess is more your wheelhouse. I had a question that occurred to me as you were explaining this problem about latency for Fireworks and trying to drive it down. The fact that, you know, whilst Gen AI has changed a lot of things and made it easier to get PMF, it hasn't necessarily changed the need for high interactivity and fast experiences. And a question I wanted to put to you is, is that potentially, because we're currently going after some of the lower value use cases, and what I mean by that is that the amount of time that people will be willing to wait is in some sense, proportional to the value of what they're going to get from it. And so the earliest versions of LLM applications were things like copywriting and for copywriting, each example copy, which is just for inspiration, has to be pretty fast. It's not super valuable. But I've seen people working now on augmenting a CEO's decision making or replacing what might have been a question to a lawyer, where the alternative thing might have cost me many hours and hundreds of dollars, in which case I, as an end user, might be willing to wait a few minutes, even if that's what it takes for me to get an answer. And so I guess just as a challenge question or thought back to you, like, do you think that actually this constraint on latency maybe isn't as hard as it might seem at first?

[00:19:00] Lin Qiao: Yeah. So I think first of all, latency and cost efficiency are two sides of the same coin. It's kind of general system performance optimization. I will say two things. One is because of nature of content generation. So the value is, there's what are value creation here, right? So value creation is you create new product experience that becomes new center of gravity. So that's one level of value creation, and that's where all of that majority experience actually requires interactiveness, and the value like provided is actually when humans are in the loop and they are getting the right content in timely fashion to the receiving end. So that's clear use case. But second is even like you mentioned, right? So not everything needs to be real-time. It can be batch processed, asynchronous. And we do see many use cases, for example, Product Catalog cleansing, for better search and relevance for E-commerce, right? So there are many use cases that you don't have to have kind of real-time, sub-second latency, but costs then becoming an issue, right? Because you still have high volume of this source content generation, and you want to make sure why a lot of that is because you want to replace expensive human out of the loop a bit and kind of automate this right? And that pricing point needs to make sense. So I think either way, I think high system performance optimization is needed. So that's what kind of our inference engine, our distributed inference engine, comes in to solve that problem. But I do want to tap into you brought up a really good point, as in, you know, the tension between fine-tuning and prompt engineering, right? So it's clear in the industry, from the model provider point of view, we see a clear trend of the context window size become longer and longer, right? And that's just a trend. And also, the other trend I really like, I completely agree with you, is the kind of people working with Gen AI move from people with deep Gen AI expertise, right, the machine learning engineers, to broader, much bigger set of audience, general software engineers, software engineers building applications and products and even PMs, right? So it's much broader, and that's where the value creation is going to be concentrated on in the future. And if you think about it right, fine-tuning is a very, very deeply involved process. First you have to trace data, and then you have to figure out what data to label, and then you just kick off the fine-tuning process. There are so many different ways to fine-tune, whether you do full model, or LoRA, Q-LoRA, or, you know, DPO, there's so many options, and then you have to create this eval data set that's also true to get a score. And then if the score is not going in the direction you want to go, and then you have to dissect, and that's how to figure out what's going wrong. And it just takes forever. And even if you invest time and get the talent in, it's no guarantee, all right? So people feel so frustrated, especially when we move up the stack to a broader audience, they're like, I'm gonna do prompt engineering, at least I can see result instantly. And then the challenge becomes, the problem gets so complicated after a while, and then, how do you maintain manager? So it goes back to your point, right?

[00:22:03] Raza Habib: And, you know, for developers who are building AI products, there's obviously a few options now for the inference piece, right? There's Together, and there's Modal and Fireworks, and NVIDIA has Triton. You know, if I'm a developer, what is it about Fireworks that makes Fireworks special? Like, why should I choose Fireworks over one of these alternatives?

[00:22:24] Lin Qiao: Right. So first of all, the product design we have, the product offering we have. So we are not just inference provider, right? So we're a one-stop shop where you can get what we heard from our audience, our community, is it's how frustrating to keep chasing models. Like, every week there's, you know, we at least one new model is emerging, and they're like, should I keep chasing it? Or, you know, I'm just stay with my current course. And there's new hardware emerging, like, pretty fast before, the hardware cycles every two years as a new hardware release. And now it's every one year. And now it's kind of maybe multiple hardware skills within a year, and how do I even tap into that? And it's just also very frustrating, and this kind of hardware fatigue also so we want to be the platform that just you don't need to worry about it. You don't worry about stay on top of the best models. You don't need to worry about staying on top of the best hardware. And by using our platform, you automatically get it. So that is one angle, the second angle is quality part, right? You may you also touch on that, as you heard a lot of concerns from developers and how to get the quality that is the best for their app and partner needs. And it's all various. There's no standard, because every product and apps are different, so we want to simplify that process by just customize. We have this workflow to customize towards your workload, so it's catered. It's tailored for you individually. So that saves a lot of your time to think about it, to kind of get to the know-hows and just kind of move forward. And the third is, eventually, our world is going to be way more complicated than single model right access. Think the app developers, the PMs, the product engineers didn't know how to assemble models digital on how to assemble models and APIs together to solve a problem. So we want to be on the forefront to help them solve that part. And that's why we invest our software stack deeply in that layers of problem that's way above single model as a service. And those are kind of few vendors you provide, you just mentioned, they will focus on single model as a service, and we are moving in a much more higher level.

[00:24:24] Raza Habib: Okay, that makes sense. One thing I'd love to dive in a little bit deeper on is one of the things that you guys strive to do is help people serve these models at extremely low latency. It's something you mentioned earlier. There's a lot of different possible research techniques that people apply to doing that, from whether it's like low rank fine-tuning or quantization or pruning. Would be curious if you could share some of the research you guys have done, or what are the methods that you guys are using behind the scenes, to give some people some sense of how hard this is in practice, to actually get those really low latency.

[00:24:57] Lin Qiao: Right. So we actually are pretty transparent in publishing our blog posts about our technology. I would encourage everyone to kind of read more details from our blog posts. We will definitely publish more and just kind of in a nutshell, right? When you think about performance optimization, when it comes to latency, it's not just one piece, this whole entire system. And that's kind of angle we're tapping into, of course, like in the lowest level there is those are the different model layers, like heavy hitters the map most attentions the cross GPU communication. And we heavily optimize by handwritten our customized kernel, right? The customized CUDA kernel, we use all various different kind of NVIDIA's library, cuBLAS, like software layers really good, especially those are low-level library and but we hand curate those important kernels to kind of really be the most efficient. So that's one layer of investment, but there is also, like different model components, they are bottlenecked by different things. Some components are bottlenecked by memory bandwidth. Some components are bottlenecked by FLOPS, some components are bottlenecked by which is communication. And we cannot scale one model as one piece, because no matter how you scale, it will hit bottleneck here and there. So basically, we have ways to chop the model into pieces and scale them completely differently. So that's the best way to remove bottlenecks in the most efficient manner.

[00:26:20] Raza Habib: So just to make sure I'm following you, there's kind of two things you mentioned so far. So one is writing custom CUDA kernels to make sure that really at the lowest level, you got the most efficiency possible. And then the second big thing you mentioned is the way you shard the models in order to get the most benefit from distributed training. So there's some smart ways in which you're sharding that are changing kind of what the limiting factor to speed is for different parts of the architecture. Is that correct?

[00:26:46] Lin Qiao: Yeah, that's right for distributed inference. Yeah, distributed training, but it kind of similar like thinking, because, like, PyTorch also support distributed training, right? And distribute training has all different way of sharding for different kind of model architecture. So through a similar like design philosophy and the third, third level is we build various different layers of semantic caching, right? Because a lot of time the prompts are heavily repetitive, and this we roll it out. If you go back to our product release we rolled out the prompt caching last year, and Anthropic just announced it now, so, but this is just one of the technique there's so many other techniques. For example, with a lot of assistant they are like they really, they're pushing the boundary of how low the latency should be. So one hot user can screw up the whole entire like latency, right? So if you think about they have a lot of users, and some users are power users, right? There's kind of this skewed distribution, and we have a combination of sticky routing plus plus kind of hotspot management, what's sticky routing. So basically, we want to know, think about, you know, you want to maximize the cache hit ratio, and of course, that helps latency the most, because if you hit the cache, you don't even need to compute anything. So right? So that's kind of you trade memory and cache for computation, and computation is expensive, is slow. So yeah. So there are a lot of techniques we pull together to ensure this extremely low latency.

[00:28:10] Raza Habib: But how do you actually do that? You mentioned the sticky routing. So trying to hit the cache more. How are you doing that? Or, yeah, how does that work?

[00:28:17] Lin Qiao: Yeah. So we basically, you know, you can base out certain kind of ID to make sure it's kind of dispatched to the right cache and so on. So that's all kind of the completely new layers what we're building. So yeah, so my point is, it's very complex system that we're building that is all these expertise extracting from years of serving production at Meta. That's kind of the scale we're talking about. And it's kind of small scale production, where many inference providers are operating at, versus large scale production, where Gen AI inference is heading towards, is very different ballgame. At small scale, like small problem doesn't matter. At large scale, every small problem matters. I'm not joking. Is that we're for example, you never think billing will be a problem, right? So we're using, like some third party billing code, and that caused a bottleneck for us, because that was not designed for scaling, and then our customer workload starts increase. We have to rewrite part of that code so every single thing matters. And as I just mentioned, some user can be hotspot, and it can affect the whole entire replica of the deployment, and we have to kind of handle that right? So those are the corner cases. Is seemingly not important at small scale. It becomes hugely important at large scale, and that requires a lot of expertise to handle as well.

[00:29:32] Raza Habib: One final question on this was kind of fresh. You mentioned, you know, NVIDIA software layer, etc. Obviously, they've been having an incredible time recently. But I know that you guys, you know, are trying to optimize across all of the best hardware available, no matter where it comes from. Do you think NVIDIA will maintain their dominance like, how do you imagine the hardware battle playing out over the next few years?

[00:29:54] Lin Qiao: Oh, so both NVIDIA, AMD, are ambassadors first of all.

[00:30:01] Raza Habib: I'm sure, I'm sure Jenson is truth-seeking. He's going to want to know your true opinion. Don't, don't sugarcoat it.

[00:30:07] Lin Qiao: I think my opinion is not anything controversial, right? So because in any business market, when the margin is high, it's going to invite competition. So that's just not how capitalism works. I think clearly other like big hardware providers are looking into this space, especially AMD, and their new MI 300 is launching on various different CSPs has become available, and they're gonna compete on latency and they're gonna compete on performance, in price performance, right? Both ends. I think it's in general. Competition is good for the community, right? With the competition, it drives efficiency. So that I have seen a lot from PyTorch is it's a vibrant community. There are multiple providers working in some area, and they just become better, but it's nudge each other to be better.

[00:30:54] Raza Habib: So the community is definitely going to win. But you think that for the hardware providers themselves, the market's going to become busier, and there will be other options other than NVIDIA, you think in the near future?

[00:31:05] Lin Qiao: I think so, yeah, I think it's just inevitable. But NVIDIA is also like, of course, that company is very good execution, right? So I don't think they're gonna sit on a sideline and think, you know, there's no company. I think they must have realized that deeply, and they're gonna move fast too.

[00:31:21] Raza Habib: Many ways. Fireworks is a bet on open source models, you know, over proprietary models. If it turned out that OpenAI and Anthropic were to dominate the space, that would be less good. And I think I share your optimism there about kind of there being a plethora of options for people. But can you share a little bit more of your thoughts of why you think open source might win, or what are the advantages that open source has versus closed models? How should developers who are choosing between these think about it?

[00:31:49] Lin Qiao: Yeah. First of all, if you look at the history here, very short history of the past year and a half,

[00:31:55] Raza Habib: Beginning feels like a long time, doesn't it?

[00:31:58] Lin Qiao: Yes, like a century, right? So beginning of last year, just no good open source models, and the Llama 1 first come out. So that drives a huge wave to today. Llama 3.1 and the quality gap between open source and closed source are closing, right? So Phi-2 is almost on par with with the best closed source models. So there's a profound reason behind that eventually, like, where's the differentiation? Differentiation before is data right? And, of course, the training technology. But people move around. I think that part is going to go away.

[00:32:33] Raza Habib: So you think the insights that people have in terms of how to train these things is not credible because of how promiscuous the labs are with each other. Researchers just going all over the place.

[00:32:43] Lin Qiao: People move around, right? So it's just over time, it's going to be diminishing in terms of differentiation technology. But of course, there could be still new. Hey, remember, transformer has been here for too long. There's new, no new architecture yet, but maybe tomorrow, who discover a new model architecture that will be breakthrough, but at the same time, who has the data differentiation? So everyone qualified the internet, and they use the same property proprietary data providers too. So the data differentiation part is not that big, but there are companies owning private data, right? Get a huge amount of access to content, and they do have to have advantage.

[00:33:22] Raza Habib: Is that advantage durable? So I used to kind of think that as well. And I remember Bloomberg had going to build a custom, fine-tuned version of an LLM that they used a lot of, you know, finance specific data as part of that. And then eventually, when GPT-4 came out, it turned out that it outperformed that model, even on finance, despite the fact that Bloomberg had access to this, you know, extra data, to what extent like might scale like overcome for some of these data deficiencies. Or is the private data more durable, in your opinion?

[00:33:54] Lin Qiao: So I feel like the if you think about the base model, I would just let the model provider compete on base model. And it's a hard competition, because basically, in my mind, that model depreciate really fast these days, as I said, every week there's a new model coming up. They're like, they're on the leaderboard for a short few weeks, and then someone else you know, go above them. So that depreciates. It really hard. Let's ride that wave.

[00:34:20] Lin Qiao: For the special domains, there's still, there's still a lot of room for post-training alignment, right? Because the base model is trained with public information or some proprietary data, but not as deep as within an institution, and they are making assumptions, like, what is important, how to curate the diversity of data and so on. And sometimes the assumption just matches certain industry segments. Sometimes assumption doesn't match. So you have to, like, fine-tune this process of alignment, and that's where, like, we provide that service.

[00:34:51] Lin Qiao: And also, if you think about aligning a large model versus aligning a small model. Aligning small model is much easier, right? Because it's like steering a cruise ship versus still steering a tugboat. Of course, like smaller the boat is easier to...

[00:35:07] Raza Habib: I'm not sure if I share that intuition. Can you help me bridge the gap? Might like I can see an argument that would say that with large models, they have more generalization capability, or more ability to kind of spread ideas from one domain to another, and so it might be easier to align them. If I'm trying to get a model to understand the concept of honesty, say, then actually a larger model that has more general knowledge of the world might actually be able to grok that concept more easily than a small kind of model that has less good knowledge or world understanding. What am I missing, or why might I be wrong?

[00:35:41] Lin Qiao: So think about that as in, you have a problem to solve, right? You want the model to align with your problem, but you also want to preserve some existing capability of the model.

[00:35:53] Raza Habib: Yeah, makes sense.

[00:35:55] Lin Qiao: But catastrophic forgetting, you know, is worse in smaller models than larger ones. So again, it's sort of my intuition would be that it'd be easier to align a larger model.

[00:36:04] Lin Qiao: So you cannot just give the model the fine-tuning data set for the specific task. You have to blend that data with almost like you have to emulate the training data. A small portion of the training data for areas you care about.

[00:36:18] Raza Habib: But that would be true for both big and small models, right? Like, why would it be better for a small model?

[00:36:24] Lin Qiao: Because small models have, you have less parameters. What about and you can iterate a lot more faster, and in the big models, like the training data is much more comprehensive, right? So, so then the question is, you know, how do you drive this gigantic model to get hold of the areas you wanted to remember, that's kind of the difference. And we see the difference of aligning, especially here we're talking about enterprise task. We're not talking about, like, generic tasks. And for enterprise, it's very specific. It's very it's like pinpoint. I just want to solve this problem really well. That's enough for me. And that's kind of their motivation of going small is, of course, at the end, it makes economic sense, it drives latency, but the quality needs to need to match.

[00:37:08] Lin Qiao: And you can also try various different fine-tuning mechanism on small models easier than on a large model. So for various different reasons, driving alignment or small model is much more attractive and you have a lot more control. But as we discussed before, right? So we also want to simplify the alignment process, because it's very evolving, and if you can provide small number of samples and the lab system do a lot of work for you and apply small number of samples, boom, it will just work. That will be ideal. But if you need them to do to provide hundreds of thousands of samples, make sure it's high quality, make sure it has the diversity needed, and it's a heck a lot of overhead, and a lot of people didn't want to spend the investment in time do it. So that's kind of the tension here right now.

[00:37:54] Raza Habib: That makes sense. All right. Well, we're coming we're coming up to time. So if you picture the world five years from now, what do you think will have changed because of AI, and what do you think will stay the same?

[00:38:06] Lin Qiao: I firmly believe that AI will be like our electricity, like utility, deeply ingrained into our day-to-day life. It's not fully there yet, and we're at early, very early stage of this transition. Remember when we do the previous big wave of tech-driven industry transformation is mobile first. It basically free us up from we do stuff in front of desktops and laptops, like we have to stick ourselves in front of this screen to we can interact with anybody, with any business anywhere. And that's a huge transformation. It unlocks so much business opportunities. And that's deeply we're glued to this, this little screen and and I think AI is going to drive that. It's going to be deeply inviting to our day-to-day life.

[00:38:47] Lin Qiao: And who's going to push for that? The app developers, the product engineers, the PMs, they're going to push for that. And our position of Fireworks, we enable them to move really fast, experiment with their brilliant ideas, you know, and productionize and get the business value out of it. Again, we're at a very early beginning stage of this innovation. I still remember the at the mobile first wave, right? We don't have many apps. The map apps are pretty lame. I still remember flashlight used to be one app. I don't know if you remember,

[00:39:17] Raza Habib: I don't remember that.

[00:39:18] Lin Qiao: And using now, it's kind of it's part of the building feature. So I think we are at this flashlight app stage of Gen AI and mobile first push for generational companies like Instagram, Uber, DoorDash, and there's so many digital native built on top of mobile, and we will get there for Gen AI and for AI in general.

[00:39:40] Raza Habib: Well, Lin thanks so much for joining me. It's been a really interesting conversation, and I'm glad we finally got to do it.

[00:39:46] Lin Qiao: Awesome. Very nice chatting to you.

About the author

avatar
Name
Raza Habib
Role
Cofounder and CEO
Twitter
𝕏RazRazcle
Raza is the CEO and Cofounder at Humanloop. He was inspired to work on AI as “the most transformative technology in our lifetimes” after studying under Prof David Mackay while doing Physics at Cambridge. Raza was the founding engineer of Monolith AI – applying AI to mechanical engineering, and has built speech systems at Google AI. He has a PhD in Machine Learning from UCL.