Building the Nervous System for AI with Russ d'Sa from LiveKit

Raza Habib

In this episode of High Agency, Russ d'Sa takes us on a journey through the world of real-time communication infrastructure for AI applications. From early experiments with voice assistants to powering OpenAI's ChatGPT voice mode, Russ shares the technical challenges and innovative solutions behind scaling AI communication. He discusses the importance of low-latency, high-quality audio and video streaming in creating convincing AI interactions, and offers insights into the future of multimodal AI and its potential impact on society.

Subscribe to Humanloop’s new podcast, High Agency, on YouTube, Spotify, or Apple Podcasts


LiveKit is an open-source infrastructure company that provides real-time audio and video communication capabilities for applications, including AI-powered systems. Their technology is used by major players like OpenAI to enable voice interactions with ChatGPT.

Chapters

00:00 - Introduction and Background
01:34 - The Evolution of AI and Lessons for Founders
05:20 - Timelines and Technological Progress
10:32 - Overview of LiveKit and Its Impact on AI Development
13:39 - Why LiveKit Matters for AI Developers
19:08 - Partnership with OpenAI
21:25 - Challenges in Streaming and Real-Time Data Transmission
30:07 - Building a global network for AI communication
37:21 - Real-world applications of LiveKit in AI systems
40:55 - Future of AI and the Concept of Abundance
43:38 - The Irony of Wealth in an Age of AI

Highlights

"My recommendation to folks is one of two things: either build the picks and shovels—work on the infrastructure parts of AI, because it's still difficult to build a complex AI application and all the orchestration you have to do between different pieces of the stack—or go into something at the application layer that has a very domain-specific application."

"That's the hardest thing about working in this space—real-time is unforgiving. It has to just always work, and the user expectation is that when you tap that button, it's just going to be like you're teleported to wherever that other person or AI model is."

"If humans interface with one another using eyes, ears, and mouths, you're probably going to interface with that human-like computer—with AGI—using those same inputs and outputs."

Podcast

[00:00:00] Russ d'Sa: I think that the creative output of humanity is going to explode. And it's creativity that I think is one of the things that we enjoy most about life. I started to play with it a little bit, and I realized, wow, this model can actually respond to me, and it sounds human. You know, at that moment, I said, "Hey, I could build Samantha from Her."

[00:00:24] Raza Habib: I think you and like a thousand developers all around the world were thinking that.

[00:00:27] Russ d'Sa: We want to show people that it's possible, inspire them about the future that is coming—not about making an actual product that is super accurate and that you could deploy somewhere. And so we did this, and we posted it, and it didn't do that well.

[00:00:42] Raza Habib: This is High Agency, the podcast for AI builders. I'm Raza Habib. I'm excited today to be talking to Russ d'Sa, who's the CEO and co-founder of LiveKit, a company that builds the infrastructure for real-time audio and video applications like ChatGPT and Character AI. They're leading at the forefront of AI applications today. Russ, it's a pleasure to have you on the show. Thanks so much for being here.

[00:01:02] Russ d'Sa: Thanks for having me, Raza. It's great to be here. I'm excited to talk about all kinds of AI stuff with you.

[00:01:07] Raza Habib: Fantastic. So Russ, you've had a really interesting career. You were early at Twitter, early at 23andMe, you were a founder at YC, and now you're at the forefront of an AI infrastructure company. You've seen a few different waves. If you were a founder today or someone building in the AI space with all the hype that's going on, what lessons would you take from your story? What advice would you give to someone who's starting off today about how to think about that, how to choose problems? Is the current AI situation different from what you’ve experienced before?

[00:01:34] Russ d'Sa: I mean, I've seen, you know, I kind of grew up in Silicon Valley. I've seen my dad go through some of these waves as well—early GPUs, then there were batteries, computer networks like DSL, flash memory, and all kinds of stuff. And yeah, it's moved through these waves. And I think what's been an interesting learning for me—this is maybe my fourth or fifth company—is that they've kind of steadily gotten better and better. I think I've gotten a little bit wiser over time, or at least I like to think so. And I think with AI, what's really exciting is that it's the future. I think everyone sees that now. I mean, it's not necessarily about computers becoming super smart and being able to interface with you like you would with a human. I think those are new ideas, but I think it's now finally seeming like it's going to be possible, and there's kind of a scalable way to get there over time. And I think that is the really exciting part about working in this space and pursuing ideas in this space as a founder. The trickiest part, though, is the timeline. You know, one thing about my previous companies that I worked on—and I put this on my LinkedIn too—I kind of have a habit of working on things that are a bit too early for their time. And that is a real concern when you're starting a company. There's a huge reward for a company that goes after a market risk opportunity where there is no market. It's brand new, it's emerging, and they almost make the market themselves when they go out into the space or when they launch and grow. But it happens very rarely that you strike the timing exactly right. On the other side of the spectrum, there are ideas that are more like execution risk. It's about, in some ways, building a better mousetrap—or hopefully, like, a five to ten times better mousetrap, however you quantify that—and then you effectively take market share away from incumbents that have cemented the space and have a lot of revenue, but their product might not be as good. I'd say with AI, it's interesting because I talk to a bunch of AI founders now, and we talk about, okay, where do you think the market's going? What space should I get into? Is OpenAI going to compete with me and kill my company with the next Dev Day or launch? I think it's a really interesting time to be a founder in AI. My recommendation to folks is one of two things: either build the picks and shovels—work on the infrastructure parts of AI, because it's still difficult to build a complex AI application and all the orchestration you have to do between different pieces of the stack—or go into something at the application layer that has a very domain-specific application. I think the larger foundational model companies have an incentive to encode more and more knowledge, information, and utility into the foundational model in a generic or more general way. But within certain domains, there's a data moat. Within those domains, there are workflows that are very specific to that domain. I think it's a good idea to go deep rather than broad with your application that's powered by AI. So it's one of those two things—either all the way at the bottom of the stack, or all the way at the top where you want to focus on a specific domain.

[00:05:20] Raza Habib: Yeah, I think all of that resonates. You mentioned timelines being critical to making some of these decisions. I'm curious, what are your own timelines? How do you imagine the short-term and long-term future playing out here?

[00:05:32] Russ d'Sa: It's funny. I find myself remembering when I was in college in 2005 or 2006, I read this book, The Singularity Is Near by Ray Kurzweil.

[00:05:47] Raza Habib: I actually met him when I was at... He's still giving that talk, by the way.

[00:05:55] Russ d'Sa: Oh wow, it's nearer now, right?

[00:05:55] Raza Habib: Yes, his latest title is The Singularity Is Nearer.

[00:06:00] Russ d'Sa: Yes, that's right. I remember that too. In the book, he has these S-curves of technological progress. You take all these S-curves together, and it adds up to an exponential curve when you zoom all the way out. When I read it, I started thinking, "Oh wow, the singularity is going to be here in 2030," or maybe it was 2050—I can't remember exactly. But for a long time, I don't know if a lot of people believed that was actually going to pan out. Now, it's interesting to see that accelerating timeline, especially in the context of AI. It's gone through a few winters, and then all of a sudden, it does feel like it's really accelerating. Fast forward to today—we recently raised a round, and in the pitch, we talked a lot about multimodal AI in particular because that's where my company LiveKit tends to focus in terms of the infrastructure we provide. I got asked a lot, "I believe in the future you're painting here, the picture you're painting, I think you're right. I think there's going to be voice and video and computer vision everywhere. But when? I think it's going to be five to ten years from now." I'm like, yeah, it might be, but we're also on an accelerating timeline—maybe it's three to five years. I also tend to find myself sometimes looking back and not realizing that these timelines are accelerating. To put it very concretely, I was anticipating what OpenAI is doing with speech-to-speech. I was thinking to myself, "Okay, I could see someone—OpenAI, Anthropic, or Google—training a model that natively takes in audio and can output audio as well, and it's all in the same embedding space," and all of this stuff. I thought that was going to happen maybe later this year—there would be an initial demo of it, someone would come out with a demo, and then OpenAI does GPT-4, oh, and like, breaks the internet with that demo. It happened maybe six to eight months before I thought it was going to happen or before I thought someone was going to do it. I was just shocked. I think a lot of other folks were too. It just kind of speaks to this accelerating timeline. So timeline-wise, I think that speech-to-speech, or having these foundational models encompassing multimodality so they can do computer vision, synthesize video, handle audio in and out, and effectively take in any modality and output any modality, and presumably...

[00:08:49] Raza Habib: Within that, some form of action space of interaction as well, being some of the things that come out...

[00:08:55] Russ d'Sa: Totally, that's something else that... We just put something out at the last AGI House Hackathon this past weekend, and I'm happy to talk about that too. But yeah, performing actions, some of these agentic workflows—I think next year, you're going to start to see that stuff happen in a serious way. So it's exciting.

[00:09:11] Raza Habib: It is exciting, and your experience of being constantly surprised mirrors mine as well. I would consider myself a very optimistic person, and yet, when it comes to having been an AI researcher, I've been consistently surprised by how quickly we've blasted through particular milestones. I've tried to update to be more optimistic, and I still keep getting surprised, which—maybe I'm just not learning fast enough—but it definitely does feel like things are accelerating, even now.

[00:09:38] Russ d'Sa: Totally. This understanding of how to actually train these models on sequence data and have them output something that is consistently at a certain level of performance... I think now that the genie is out of the bottle on how to do that, we're starting to see this being applied to all kinds of domains, right? Like, so there's, of course, audio and video, but even just like ACTG coming out of the model—being able to do genetic analysis or things like that. I worked at 23andMe, so I have a little bit of familiarity with some of this stuff. But yeah, it's really going to be incredible—the kinds of domains where this is going to be applied, and this idea of having large amounts of data that follow a sequential format and then being able to output these tokens or sequences... It's a very powerful tool in our chest.

[00:10:32] Raza Habib: Yeah, and that maybe leads us naturally... We've talked a lot about the background here and what we're excited about for the future of AI. Let's maybe turn to LiveKit for a moment. So for people who maybe don't know, what is LiveKit? How is it useful to people who are building AI?

[00:10:49] Russ d'Sa: Yeah, LiveKit is really... We started off as an open-source project in 2021, kind of when it was peak pandemic, and everyone was living on the internet, using their cameras and microphones to connect with one another. There’s a protocol called WebRTC. WebRTC has been around for, etc. It's a protocol designed for transmitting high-bandwidth data, specifically audio and video, over the internet using a protocol underneath called UDP. So WebRTC is kind of a higher-level protocol, and it's a collection of different standards together. Ultimately, what it amounts to at the bottom of it all is transmitting packets of data of the internet was designed for TCP. If you think of HTTP, it's Hypertext Transfer Protocol. You know it's designed for transferring text. The internet, for the most part, wasn't designed for transferring audio and video in real time. And so that's where WebRTC really comes from. It's a protocol from Google. And in 2021, if you looked at what was out there, there was no open-source infrastructure for being able to use WebRTC at scale. The thing about WebRTC—just the vanilla protocol that's available in every browser—is it's peer-to-peer. And so when you have three people talking to one another, I'm uploading two copies of my video and audio to... I'm uploading one copy to one person, one copy to another person. I add a fourth person there, and now I'm uploading three copies. A fifth, four copies, and most internet connections cannot handle uploading even three copies of 720p video. And so WebRTC, kind of by itself, doesn't really scale. What people end up doing is they end up putting a server in the middle that acts as a WebRTC peer, and everybody just sends their streams to that server, and then the server is the one that routes the streams back out to all the people that need to receive it. What LiveKit does, and we wrap it behind a clean, but low-level, general-purpose layer that makes it very easy to reason about and wire up one-to-many people as you want together in a real-time session where you can exchange audio and video.

[00:13:39] Raza Habib: Okay, that makes sense. I'm going to ask a ton more questions, specifically about how it works. But before we do that, if I'm someone building an AI product today, why should I care about this? Like, what about LiveKit is relevant to people in the AI space? I have an opinion here—obviously, that's why we're chatting—but I want to hear your version.

[00:13:55] Russ d'Sa: Yeah. So one thing that we knew for sure was that robots in the future would have cameras and microphones on them, and they would need to be able to stream audio and video coming from the robot. But, you know, from here—or like, say, a year ago—to robots in the future, humanoid robots walking around, we didn't really foresee—or at least I didn't foresee—this use in multimodal AI. What happened ended up happening was GPT-2 was out, and it wasn't that impressive. I kind of wrote it off.

[00:14:29] Raza Habib: I was super impressed by it at the time. Maybe it wasn't impressive to you? I don't know. I was working on language modeling at the time, and so for me, it felt like a huge leap. Like GPT-2, for me, was my "Oh shit!" moment already. It was my first...

[00:14:42] Russ d'Sa: I think with your expertise in ML and as an ML researcher, it... Maybe it's just because...

[00:14:47] Raza Habib: I was struggling so hard! Maybe it’s because of my lack of expertise as an ML researcher, but certainly, I remember the thing I remember most was, you know, the coherence of knowledge and understanding over a long text. They had this example on the GPT-2 website when they launched it. It was a story about researchers in Peru at some university who had discovered a unicorn or something, and there was coherence across three or four paragraphs of text about this clearly fictional and fantastical scenario. The best language model I'd seen before that—we'd get excited when it closed brackets correctly. really did feel like a huge leap.

[00:15:25] Russ d'Sa: That makes sense. And so it took a full year, I guess, and another generation before it got on my radar. When GPT-3 came out around, I think, the end of 2022, I started to play with it a little bit, and I realized, wow, this model can actually respond to me, and it sounds human. What I thought about was, well, okay... And you know, looking back retrospectively, I think OpenAI got into some trouble for it. But, you know, at that moment, I said, "Hey, I could build Samantha from Her."

[00:15:58] Raza Habib: I think you and like a thousand developers all around the world were thinking that.

[00:16:02] Russ d'Sa: Totally. And so you saw a lot of demos of folks kind of trying to make this happen using... and the text-to-speech and speech-to-text models weren't as good back then, so it was a bit slower. But what I saw on Twitter was people were okay. They were recording their audio, and then they were making an HTTP request and uploading their audio. Then they would run it through a speech-to-text transcriber, and then take the text on the server and put it through the LLM GPT-3. Out would pop the text from GPT-3, and they'd wait until the entire output was done, and then they would convert that into speech or send the text down and then use Google speech synthesis on the device to try to have it talk back to you. And it was just really high latency. It was a slow experience, and it wasn't convincing that you were actually talking to this model in a human-like way. So I said, "What if we use LiveKit and WebRTC underneath? What if we stream all of the audio to the server, and we run WebRTC on the server?" Nobody was doing this at that time. We have a WebRTC client that’s almost like a phone or a desktop, but a client on the server side that receives the audio, and it's transcribing it in a streaming fashion. Once the user is done speaking, you put it through GPT-3, and the tokens stream out. As the tokens stream out, let's cut it off by sentence—right, limit it by sentence—and as the sentences are coming out, start generating speech on the fly for those sentences and stream them out as soon as they're ready. So we built this flow, and we said, "We'll sacrifice some accuracy in the transcription for speed because what we want to show people is that it's possible, inspire them about the future that is coming, not about making an actual product that is super accurate and that you could deploy somewhere." So we did this, and we posted it, and it didn't do that well. Like, I don't know... I worked at Twitter, but I—or I worked at X, but I don't know how to post on X to get virality or something like that. And so while it didn't get a ton of love on Twitter or X, I guess I should say, what did happen was, six or seven months later, OpenAI ended up finding that demo and the blog post we wrote, and they were starting to explore an architecture for how to build voice mode. And so it was at that time that they started to play with LiveKit, build a prototype using it. They had some other ones too. They used different technologies. We can get into that. But effectively, in a couple of weeks, they built this prototype of it and decided to... We started a relationship together to work together to bring it to market and actually put it out there for people to use.

[00:19:08] Raza Habib: I mean, OpenAI is a pretty epic customer to have as an early-stage company. And I guess now it's operating at ridiculous scale for ChatGPT. What has that been like? I know they're also kind of a startup, but they're now a very big startup. What has it been like? How big was your team at this point? And what was it like being attached to that juggernaut?

[00:19:25] Russ d'Sa: The team was about 20 people, 18 or so engineers, so engineering heavy. But it was surreal. It was definitely surreal. I won't say it was the moment in my career that came later with GPT-4, but it was definitely one of the highlights of my career. For sure. I think that the deep realization that I had when we started working with OpenAI on voice mode was, okay, the goal of these foundational model companies is to eventually build AGI. And what is AGI? I mean, AGI is really... We're building a synthetic human being in a lot of ways. It's a computer that can talk to you like a human, perform actions like a human, etc. We're really building ourselves, ultimately—the ultimate tool, I guess. And if that's true, if you believe that, then how will you interface with that human-like computer? If humans interface with one another using eyes, ears, and mouths, you're probably going to interface with that human-like computer—with AGI—using those same inputs and outputs. We have to use a keyboard and mouse today because that's the only way to have kind of a dumber computer understand... Or that's the only way to give it information—we have to adapt ourselves to give it information. But now, in the future, the computer is going to adapt to us. And so in that moment, I realized, okay, they're building the brain. LiveKit is building the nervous system. These are complementary parts of the stack you need to be able to carry the signals to and from that brain. In that moment, I realized that this is what the company is ultimately going to do—this is kind of this huge future and vision where we can play a critical role in what I believe is the next era of computing. Of course, it was a very huge moment for me. I think, like, when I first fired up the ChatGPT voice mode demo before it came out, and I used it for the first time, I was just blown away. Of course, the latency was higher than it is now, but it felt incredible. I was driving around with it, having tutoring lessons from a teacher that never judges me. I was doing therapy sessions when I was getting stressed out at work. It was really special.

[00:21:25] Raza Habib: Yeah, and it continues to be. So as you say, the latest models have pushed that even further. People sometimes joke, we talk about AGI—people sometimes joke when you're in Zoom video conferences or something that we're going to solve AGI before we get video and audio streaming to work properly. I don't have a background in this kind of infrastructure. My background is as an AI researcher. For someone who doesn't understand this technology from Adam, what about it makes it so hard? What is actually so difficult about getting this to work?

[00:21:53] Russ d'Sa: To build a demo of something I think is quite straightforward. You can use even a WebSocket. So there are a few different protocols that are available to you when you want to build with streaming data. You can stream text with a long-lived HTTP request, and you have that open for, I don't know, 30 seconds, and it can stream data back to you. An alternative version of this, which a lot of folks do actually when they're building these multimodal or voice-based assistants, is they'll use a WebSocket. WebSocket is nice because it's really easy to use. It's almost like a raw TCP socket in some ways. You can send data back and forth. It's full duplex or bidirectional, and it works okay for a demo. But what ends up happening, and the reason why you end up needing something like WebRTC—and really what I mean by that is underneath you need UDP—is when you get to scale, the scale of the internet networks look very different depending on where you are in the world. In the US, on desktop computers, wired connections, Wi-Fi routers—you can probably get away with a WebSocket because there's not a ton of packet loss. The internet connections are quite good and stable. But if you think about the internet like the road system of the world, there are all kinds of different roads and potholes and bridges. And to build something that is robust to all of the different roadways of the world, you need to kind of change how the protocol underneath fundamentally works. For TCP, which is WebSocket and HTTP—they're both built on TCP—there is a fundamental part of the protocol, which is you have to effectively acknowledge receipt that you received a packet. There are optimizations and stuff that they've done to the protocol to make it a little better, but fundamentally, you have to acknowledge that you received packets. For streaming data, especially real-time streaming data, older packets are not really useful. You always want to be on the edge, the latest packet. And so if you have to acknowledge packets as they come, when you lose a packet, the receiving end is not able to receive any new information until it gets resent the packet that it missed, or the set of packets that it missed. What that leads to is you get choppy information, your video freezes, your audio freezes or stutters. It's not a great user experience. In parts of the world where the internet isn't as good, or there are also other kinds of weird situations... To give you an example, if someone in your house starts to use the microwave, and your computer is reasonably nearby, it will actually interfere with the amount of packet loss that happens on your network. That's one example. If someone runs Netflix, that's probably more common in the same house, you have issues as well because Netflix is very bursty in how it buffers, and it kind of crowds out other packets. So you end up in all these weird situations that you cannot predict. And so for that reason, TCP doesn't work at scale because baked into the protocol itself is this acknowledgment aspect of it. UDP is another alternative protocol where you do not have to acknowledge the packets. UDP will just keep sending, keep streaming packets to you, and the receiver can make a decision about whether it wants to halt and request a packet again—it's called a NACK, a negative acknowledgment—or if it just wants to keep going and accept a certain amount of packet loss. WebRTC is a protocol that Google built into every browser, and it's built on top of UDP. Right now, if you want a way to get a UDP socket in every browser—mostly every browser in the world—you need to use WebRTC. That's really the only way to get a UDP socket, period, or access to a form of UDP. But WebRTC adds stuff on top as well that is really useful for sending audio and video. For video in particular, video is very high bandwidth, much higher bandwidth than audio. So when you're sending a lot of video and high-resolution video over a network, you want to be doing things like congestion control. You want to be measuring the network in real-time, upgrading and downgrading resolution, maybe altering frame rates. You want to kind of be adaptive to the network conditions, either on the sender side or the receiver side. WebRTC builds that into the protocol. Another really key thing that it does is it has hooks to all these media devices. So if you need to get a handle to a camera or any kind of microphone hardware, WebRTC has that built in. The other thing that's really significant is encoding and decoding. Sometimes, what people will do when they're building multimodal AI that’s voice-based, and they're using WebSocket, is they'll just send raw PCM packets over the network. You don't have to do that. You can actually compress that PCM using Opus, which is automatically built into WebRTC, and it will compress that audio in a way that doesn't sacrifice quality, send it over the network—it’s much less data, so it's better for the network itself. It scales to more people around the world if they might be on slower networks. Then it gets decoded on the client or the receiving side by WebRTC internals and then played out through your speakers. Ultimately, WebRTC provides a bunch of functionality on top of just UDP that you would have to build yourself. Then there's LiveKit. The tricky part with WebRTC is that it's very complex. You have to have a lot of domain knowledge to be able to use it effectively. And then, on top of which, you have to figure out how you're going to scale it. How do I run thousands and thousands of servers if we're talking about OpenAI scale? OpenAI has users all around the world. How do you have people connect to the closest edge for the internet backbone to that user, and then transport information? The other reason you want to do that is because when you go over the public internet, you're going over the road system, so to speak, and most packet loss happens between the user and the ISP. So you want to shrink the path where you don't have control, or shrink the path where the network is not as reliable. The private internet backbone is extremely reliable, with very, very little packet loss. So you want to get that user's data to the edge as fast as you possibly can. How do you actually do all of this with vanilla WebRTC? It's very tricky. What LiveKit does is provide an open-source server and clients. So if you're building an application for multimodal AI, you would integrate a LiveKit SDK across any device that you want your users to use, and that SDK is making a connection to a LiveKit server. The LiveKit server is a WebRTC media server. You deploy it somewhere, and everyone connects to that server when a session is started. Let’s use an AI example. We can use voice mode. You have an LLM agent that you're having a conversation with that connects to a LiveKit server, and then the user device connects to that LiveKit server, and there's a session between them. For LiveKit Cloud, which is what OpenAI uses, that's a cloud-hosted version of our open-source stack. It works slightly differently from open-source LiveKit. The cloud system is a horizontally scaling mesh network. So what we did was we took our open-source server, we built all these hooks into it to allow us to horizontally scale it. What that means is you can put all these servers all around the world. A user that wants to connect to it, whether it's an AI model or a human user, they connect at the closest server to them, most likely geographically.

[00:30:57] Raza Habib: And Russ, can I ask a dumb question at this stage? How do you literally decide where to place the servers, or how are you controlling that?

[00:31:05] Russ d'Sa: So the way that we started was very basic. We put a server in every continent—not Antarctica, of course, and technically we don't have one in Africa, but we're about to do that. We put servers in all of these continents. That was the first thing that we did. What we found is that this works okay for some users, but what we also saw was that when you run other people's applications at scale—and this is before we did anything with OpenAI—when you're running people's applications at scale, transmitting audio and video, users will complain. They'll tell you, "Hey, I had a bad experience. The video was choppy or it was fuzzy, etc." We'll have our customers' users...

[00:31:49] Raza Habib: Users don't care how hard it is, right? It doesn't matter how crazy complex it is in the background.

[00:31:53] Russ d'Sa: That's the hardest thing about working in this space—real-time is unforgiving. It has to just always work, and the user expectation is that when you tap that button, it's just going to be like you're teleported to wherever that other person or AI model is. It is very unforgiving. We got this feedback from users, or end users of our customers: "This isn't working as well here." I remember when we first started, Brazil was an area where we didn't have a data center in South America at that time, and we heard a lot of complaints from Brazil. We thought, okay, Brazil to our data center in New York won't be so bad. But it turns out, once we deployed a data center in São Paulo, all of a sudden, all those user complaints started to go away. When we first started, we were a little bit too confident that you didn't need that many data centers. There's definitely a limit. Zoom has something like 17 to 20. When folks say, "Oh, we have like 100 data centers or POPs around the world," I'm not sure if that's really necessary. Zoom has pretty incredible quality of service all around the world. There's definitely a point of diminishing returns, but having the data center as close as possible to the user is definitely incredibly important. So the way that we started to expand our footprint was once we discovered this issue in Brazil, we said, okay, we really need to expand our footprint more and more and more. Then what we did was we started to build software. We did two things: one, we started to expand our footprint all around the world. The other thing we realized was that different cloud providers have varying degrees of quality depending on where in the world they are. Certain data centers are better quality than others across all the cloud providers. So we built a system that is multi-cloud inherently. We run one overlay network, and then we run a whole bunch of different cloud providers together, and we have software that measures the network connectivity between them all in real-time to figure out the best cloud provider in which region, the best path to take to connect a user to another participant, whether it's a human or an AI model. We built this over time—this software-defined routing layer that is kind of like this networking brain that is running in every data center all around the world, measuring the performance and the quality of a session and making decisions on the fly. The other thing that we've started to do now is when we see degraded quality, we look at connectivity times and quality based on where those users are. In certain parts of the world, what we've seen is that, okay, there's degraded quality. Let's test spinning up a data center in another location and see what kind of improvements we get. So we spin up kind of like a light data center in one spot, see what kind of improvements we get, and if there's a significant improvement there, then we actually just add it to the fleet.

[00:34:28] Raza Habib: This is not just you on the phone talking to the ISP like, "Hey, we need to open up here, please."

[00:34:32] Russ d'Sa: No, no, this is in some public cloud provider's data center.

[00:34:37] Raza Habib: You're not racking servers somewhere?

[00:34:38] Russ d'Sa: Not yet. I think over time we will get there, but it's a significant CapEx to do that, and you want to have utilization to rationalize making a move like that. One really beneficial part of the way that we've architected the system is because we went multi-cloud out of the gate and we don't depend on any one cloud provider, it allows us to go to a location like, say, Singapore, where we have a lot of utilization. We can put in our own metal in Singapore specifically, and it just becomes part of the fabric because it's all run as one network, and it's agnostic to where that compute is or who's running that compute. It allows us to not have to swap the entire engine out to our own data centers in one shot, but to be able to do this piecemeal and iteratively over time. Eventually, at enough scale, we'll have our own entire global network that is operated by us from the bottom all the way to the top. So that's a goal over time, but we're not there yet.

[00:35:47] Raza Habib: Amazing.

[00:35:48] Russ d'Sa: I would say, in a nutshell, you have to first swap the protocol. You need to make sure that you can send this high-bandwidth data—audio and video—unimpeded over a network. That's the first step. The second step is you can't scalably send it point-to-point from one user to another user, so you have to put a server in the middle and use that as the router for, one, the fastest path, but also, if you want to connect more and more users to these sessions, you need to have a server in the middle because they all can't do point-to-point with one another. The second step is having a server coordinator that everybody connects to as the router. The third step is eventually that server is going to run out of capacity—physical resources—so now you need to be able to scale that horizontally and have people connect to separate servers, but have those servers talk to one another. That's the next step of scale, which is the cloud stuff. The fourth part of it is, okay, now that you have a fleet of servers, where are those servers running? Who are they running on? Who are you dependent on? What is the server-to-server connectivity like? Sometimes those networks can change, or you can have data centers go down, so nodes go down in this graph. So you need to have another software layer that is figuring out, okay, I need backups, and I need to route around things, or I need to take certain data centers out of commission or rotate new ones in. Then there's the next step. So it ends up being a layered approach, where you layer on the complexity, and that's what makes it a hard problem.

[00:37:21] Raza Habib: Amazing. Can you give me a sense of what some of your favorite use cases have been so far?

[00:37:27] Russ d'Sa: Wow, there's a bunch of them. I talk a lot about my favorite one. There's a company that is putting LLMs in 911 dispatch centers. I think that's just an incredible use case. The way it works is you call 911, and while you're speaking on the phone, there's an LLM that is paired with a dispatch agent. It's enriching the call. It's helping the dispatch agent understand where that user is, showing a map, being able to assess sentiment on the call, and things like that. It helps with routing and dispatch to other service providers out there. One remarkable thing that happens every week with this technology—LiveKit paired with this company that builds the software on top that uses LiveKit—is when there's a really significant emergency. One example that happens every week is someone has a heart attack. The dispatch agent will send a text to your phone with a link. You tap on that link, and you're in a browser streaming video to the dispatch agent. They can do things like, in this heart attack scenario, coach the person on how to administer CPR by watching the video. Someone's life is saved every single week from this exact thing using this software. What's even more exciting is that this is going to be built into iOS 18. So when you install the new version in September, there's going to effectively be a FaceTime button when you call 911. When you tap that FaceTime button, that's going to be streaming video directly to the dispatch agent in addition to your audio. And that's all powered by LiveKit. So that's one really amazing use case.

[00:39:06] Russ d'Sa: There are other ones in the multimodal AI space that I think are pretty interesting as well. There's patient intake—there's a company that's building an agent that is in the hospital, and people are calling in. It has to collect a bunch of information to be able to make an appointment or determine when to tell them to come in. They're deploying this AI across hospitals. There's another one that is really exciting too, in a completely different space, which is NPCs for video games. It's a very natural use case but really hard to get right because, in video games in particular, it's not just a one-on-one conversation with an AI model, but there might be multiple AI NPCs and a human user or player of the game. There might be multiple players and one NPC, there might be multiple NPCs and multiple players. So you have to figure out the conversational dynamics when you use your voice and interact with these AIs. It's just different in a group situation than it is in a one-on-one. It's like, who is looking at whom? Are you having multiple conversations at the same time, but in different pockets within a larger group? There's a lot to get right about that experience that is exciting and kind of like an emerging use case that I think is going to be everywhere, and every game is going to have these dynamic storylines that are based on your interaction with characters in that story, and it's all LLM-powered. So that's another one I'm really excited about.

[00:40:45] Raza Habib: Super cool. So how do you think this is all going to play out? Do you think the current wave of AI ishyped, or appropriately hyped?

[00:40:55] Russ d'Sa: I think it's underhyped. The reason why I think AI is underhyped is that, for me, in my career—I've done five startups; this is my fifth one—I. What's so difficult before AI was I could only really work on one of those ideas at one time. I was thinking about this recently—where I can maybe be like my parents and reject technology, say, "Back in my day, this is how it worked, and this is how we used to do things," or I can think about, "How can AI work for me? How can I make AI work for me? How can I leverage the tool?" I think what's really exciting and what you're going to see in the next five years, especially for creative endeavors and folks that are building things, is AI is going to drastically accelerate how many things and how fast you can build things. If you think about building companies and products as a creative endeavor, just like painting or making music, I think the creative output of humanity is going to explode. And it's creativity that I think is one of the things that we enjoy most about life—making things, seeing other things people have made, and those things get us thinking, leading to other ideas, and that's how humanity ultimately moves forward. So I'm really excited about how AI is going to accelerate that. We're just on the tip of what is possible right now, and I think in five years, the world is going to be somewhat unrecognizable, or the way that we do work is going to be unrecognizable.

[00:42:13] Russ d'Sa: I think it'll be a little bit longer than five years, but I'm also really excited about what's going to happen once you start to embody AI in these humanoid robots that can navigate the world, do certain types of tasks for us. I know Elon Musk talks a lot about an age of abundance. It's a little bit of a weird idea, but I was thinking about this as well. When we do enter into an age of abundance—let's assume that's true—the billionaires, the tech billionaires, and this 1% versus 99%, I fully understand why we have those conversations. I think there's an irony to it, which is if the tech billionaires are working on AI and embodying AI and building robots that can do all of this work, and we enter into an age of abundance where robots do most of the tasks for us, freeing us up to do more creative endeavors, and we can have our dinner... The ingredients for dinner are farmed by robots, and the food is made by robots. What it almost ends up doing is dropping prices of all kinds of things almost to zero, and it starts to diminish the importance of even being a billionaire. If everything is much cheaper and more accessible to more and more people, what...

[00:43:37] Raza Habib: What is money for?

[00:43:38] Russ d'Sa: ...is money for, exactly. So I think that the material wealth part of it diminishes. But I think you're totally right—there's this idea of positional wealth or status, and that part doesn't go away. That's a little bit unfortunate, I think.

[00:43:58] Raza Habib: I think you're right. I think that on the first part about the material wealth, there's definitely an upper limit to materialism. Eventually, there are just no more new things that you can acquire. But the status aspect of humans is unfortunately true, and that part—yeah, that's a great point. If we have more time to think about our positional wealth, maybe that ends up being a negative.

[00:44:22] Russ d'Sa: That's a great point, and you're right. I think that on the first part about the material wealth, there's definitely an upper limit to materialism. Eventually, there are just no more new things that you can acquire. But the status aspect of humans is unfortunately true, and that part—yeah, that's a great point. If we have more time to think about our positional wealth, maybe that ends up being a negative.

[00:44:48] Raza Habib: Yeah, I was just thinking about, as you were talking about this, it's an election year, and I was just thinking about the current climate. It's easy to pull off for an individual perspective, but very difficult once you scale it out to the world.

[00:45:05] Russ d'Sa: Totally agree.

[00:45:07] Raza Habib: Thank you so much, Russ. I really enjoyed the conversation. I appreciate the questions and am excited to talk again.

[00:45:13] Russ d'Sa: Thank you, Raza. It was great to be here.

About the author

avatar
Name
Raza Habib
Role
Cofounder and CEO
Twitter
𝕏RazRazcle
Raza is the CEO and Cofounder at Humanloop. He was inspired to work on AI as “the most transformative technology in our lifetimes” after studying under Prof David Mackay while doing Physics at Cambridge. Raza was the founding engineer of Monolith AI – applying AI to mechanical engineering, and has built speech systems at Google AI. He has a PhD in Machine Learning from UCL.