How Paras Jain is building the future of AI video creation
In this week’s episode of the High Agency Podcast, we sat down with Paras Jain to understand his experience working on autonomous vehicles, why he chose academia over an offer from Tesla, and the research-minded approach that has lead to AI video generation startup Genmo's rapid success.
Subscribe to Humanloop’s new podcast, High Agency, on YouTube, Spotify, or Apple Podcasts
Chapters
00:00 - Introduction
01:52 - Building data pipelines and scaling video models
07:01 - Working within GPU constraints and transformer architecture
11:18 - Moving from research to startup success
14:36 - Leading the video generation industry
16:05 - Training diffusion models for videos
19:36 - Evaluating AI video generation
24:06 - Scaling laws and data architecture
28:34 - Issues with scaling diffusion models
33:09 - Business use cases for video generation models
36:43 - Potential and limitations of video generation
40:59 - Ethical training of video models
Podcast
[00:01:13] Raza Habib: This is High Agency, the podcast for AI builders. I'm Raza Habib, and I'm delighted today to be joined by Paras Jain, who's the co-founder and CEO of Genmo AI, a company that's aiming to democratize video creation using AI technologies. They have over a million and a half people using their text-to-video models today. And before founding it, Paras did a PhD at Berkeley, where he was focused on machine learning systems. He was one of the founding engineers at DeepScale, an autonomous vehicle or AI for autonomous vehicles company that was acquired by Tesla. So he brings a lot of experience across machine learning research, machine learning industry, and now leading on the frontier of Gen AI. Paras, it's a pleasure to have you here.
[00:01:51] Paras Jain: Wonderful to be here, Raza.
[00:01:52] Raza Habib: First, I wondered if we could start with your background a little bit, the journey from being at DeepScale through PhD and then to Genmo. Can you tell us a little bit about what it was like being at a company that was acquired by Tesla? And are there any lessons from doing machine learning in the autonomous driving space and the scale there that transfer, or is it, you know, what comes across with you?
[00:02:12] Paras Jain: I mean, when I jumped into self-driving, it was such an exciting time. You have to roll back. It's like 2016 or 2017. You know, the story of self-driving as it started with the DARPA Grand Challenges, right, driving through the desert. So let's roll forward about eight to 10 years after that, and the tech was starting to come to market, but it was still really early.
I had always been really passionate about artificial intelligence. I think, for me, what was really interesting about self-driving is the scale that you could get in terms of data acquisition. For a lot of computer vision, you're hand-annotating, literally, the stories of like grad students having their mom or their grandma annotating data as they sit around a dining table. Those are old stories in computer vision, but with self-driving, we could scale up 10-100,000x beyond that. And that was really exciting.
So when I started at DeepScale, we were really big believers in scale, and we would just drive our vehicles all day, collect huge amounts of data, annotate it, and train really high-performing models. And what was really interesting for the first time, is scaling up that data pipeline improved your models, which sounds obvious today, but I think at that time it wasn't quite clear yet. We didn't have that recipe. That was a lesson that was hard-won there. And I think it's something that informs so much of how I think about machine learning at scale today.
[00:03:25] Raza Habib: And any lessons from that that translate across what you're doing now?
[00:03:30] Paras Jain: Yeah, absolutely. I think it's thinking about data pipelines more as an architecture, rather than an end artifact. It's a process that you do every day. You wake up, you collect data, you annotate it, you clean it, and then you use it to train and you do it again the next day, right? That's such an interesting mindset that came from self-driving.
I see a lot of parallels to that with large language models today, where data is ultimately garbage in, garbage out. But I think it's that practice of building data loops really intentionally that's so important. And you see that generalized to self-driving today, and that definitely carries forward to what we do at Genmo.
[00:04:04] Raza Habib: So you went from DeepScale onto doing a PhD at Berkeley. You didn't join Tesla, by the way. How come you chose not to join Tesla? What was the logic?
[00:04:12] Paras Jain: Yeah, it's a really interesting story. So this was again, like 2018 or 2019 I think. So the company was being purchased, and self-driving was there, and I actually had a couple of chances to meet Elon and Andre and some other teams. So it was really interesting that this autopilot team was really small at the time of acquisition.
But at the same time, I remember seeing an early talk by Alex Radford and Ilya Sutskever. I had a brief opportunity to chat with them briefly. I think it was like 2018 or so, and they were talking about scaling laws and some of the work on large language models they were doing, and I just felt like this was tantamount to shaving a mile a minute off the mile time, like it's just groundbreaking. I felt like that was super exciting. And I felt I really had to jump head first into this.
So I came to Berkeley, which, to me, was the epicenter of deep learning at the time with computer vision. And, you know, Trevor Darrell and Caffe and all these interesting projects, Alexei Efros, like, all these heavyweights in the industry of computer vision, were centered there. And I felt like this was the place to go for machine learning and for natural language processing. And so that was compelling enough for me to walk away from an acquisition opportunity at a company like Tesla.
[00:05:18] Raza Habib: Okay, that's exciting. And then, from my understanding, at Berkeley, you worked more in machine learning systems research, rather than on the actual algorithms and models themselves. Do you mind telling me a little about what machine learning systems research is? What are the challenges that you were personally most interested in? Because I assume it's more to do with machine learning in production and at scale versus the algorithms themselves. So what are those challenges?
[00:05:41] Paras Jain: Yeah, I think when coming out of DeepScale, I learned it was the methodology and the systems and the frameworks with which you collected your data and trained your models that were really important, that was actually more important than the architecture itself. And so to me, that pushed me to come to Berkeley with this lens of, how do I actually work at the intersection of AI? I want to train these machine learning algorithms, but at the same time understand how they're scaling computationally and from a data perspective, so that we can scale these techniques that work well up.
And I think through my PhD, models scaled about 10,000x like, from ResNets to modern LLMs. It's very rare to see that in orders of magnitude, and I feel very privileged to have been able to work in this area of machine learning systems at that time.
The key question here in machine learning systems is, from a training perspective, how do you leverage large distributed fleets of computational resources to train these models efficiently? The key interesting tidbit there is, as you throw more GPUs at the problem, sometimes your training gets slower, like it's sublinear scaling, and you're not going to get linear speed-ups. And so it's a waste of money. And so it took a long time for the community to figure out how we can train on hundreds and 1000 GPU and now 10,000 GPU fleets.
[00:06:51] Raza Habib: It sounds like it was worse than sublinear, right? You said you chuck more GPUs at it and actually get slower, not just growing slowly. What makes it actually get slower? Like, what are the things that are actually hard about scaling this up?
[00:07:01] Paras Jain: Yeah. So with large GPU fleets, you have your GPU accelerators that are connected to your hosts, and then you have the East-West network with which GPUs communicate. Deep learning is broken down with: You do a forward pass, which means you compute your algorithm. You'll do convolutions or transformers, or whatever architecture you're running. You compute your loss, and then you do your backwards pass, which produces these things called gradients that are basically updates to the model.
And now a bunch of GPUs have a bunch of gradients on them. And there's this funny thing we have to do called all-reduction, or just like this synchronization of these gradients. And so this is the aspect that's really slow when you do distributed training. As you scale up the number of GPUs, you have more and more GPUs you have to communicate between, and this creates a bottleneck as we train large models.
For a long time, networks were slow. Even within a node, intra-node fabrics were slow. They weren't like NVLink based. They were like PCI Express buses. So they were older technology, and the algorithms for sharding and distributing these models weren't defined yet. So that's what the field of machine learning systems really worked on. The answer was, given a model, what is the optimal way for me to break it up into different GPUs, and then how do I go about training the model? By solving this communication problem of reducing the gradients, and that's the part that gets really slow when you scale up.
[00:08:15] Raza Habib: Where have we landed? So if people are trying to split their model across multiple GPU training now, people could do it layer-wise, you can do lots of different forms of parallelism, like what works best for different models, like where have we ended up as a community?
[00:08:28] Paras Jain: I think what's really interesting with the transformer becoming this canonical architectural building block for so much of generative AI, is the techniques consolidated. And so what works well in one modality, like large language models, actually transfers really well to the video generation models and video diffusion models we train, surprisingly.
The architectural building blocks there are, you know, there's really interesting work on optimal scaling out of like NVIDIA and Stanford, I think. But essentially, if you have a big model, you try to split it up within your node as much as you can, first, before you spill past your node. Because when you're within a node, you have really high performance. And then for distributed computation, there's techniques like pipeline parallelism that have become canonical for training these big models. That's really communication efficient. I mean, I've seen some work that even does pipeline parallelism across the wide area network, meaning you have GPUs in one continent and GPUs in another continent, and they can communicate using pipeline parallelism. That works really well.
And then I think what's really remarkable is the software tool chain is really good. So there are technologies like fully sharded data parallel that just work out of the box for training big models at scale. It took us about five, six years, though, to build these architectural building blocks. And so I think consolidation on transformer was a big driver for how easy it is to scale distributed deep learning now.
[00:09:37] Raza Habib: Is there a risk that that sort of locks us into the architectures we have today in that if we're optimizing hardware, the way we parallelize things, all of these different stages around the transformer architecture, to what extent does that make it hard to discover the next thing?
[00:09:52] Paras Jain: It absolutely makes it hard. If you're a PhD student sitting there with a few GPUs, I mean, you're going to do what's efficient on that. So I think there's a really interesting paper, I believe it's from Sarah Hooker. I think, I'm not sure, on the hardware lottery. It basically talks about this concept of GPU infrastructure and distributed training, what's efficient from an ML systems perspective, constraining the models at the same time.
I do think so much innovation happens in resource-constrained environments. When I talk to PhD students today, and they're demotivated by not having like, 1000 GPU clusters, you know, I always tell them, I think what's interesting is, papers like the lottery ticket hypothesis wouldn't have happened if Jonathan Frankle tried to run it at large scale from day one, right? Like he had to start at small scale to discover the insights which the bigger companies missed. And so I'm really encouraged by new architectures, like, I know people are excited about state space models, for example, as a successor to transformer, but yeah, it is a tough pill to swallow here that transformers just run so well on distributed GPUs.
[00:10:53] Raza Habib: I mean, I think it's definitely the case that academia continues to produce surprisingly good outputs and results despite the resource constraint, right? Something like Flash Attention you would have thought would have come out of one of the big labs, but actually came out of academia first. I mean diffusion models themselves came out of academia first. So I don't think PhD students should despair just yet. I think actually small groups can still do really impressive research, just maybe not on the scaling side.
Okay, so that's a lot of background. So you did DeepScale, you did your PhD. Now you're the CEO of Genmo. You already have over a million users, despite being, I think, less than two years old. Is that right? Tell me a little bit about Genmo. What's the story of how you guys got started, and where you're at now?
[00:11:34] Paras Jain: So at Genmo, we're trying to build some of the state-of-the-art foundation models for video generation. We believe that the best way to understand the world deeply is to work to recreate it, ultimately, with models. And so we are building these general-purpose video generation systems. So today, they take text as input or image as input, and they can generate high-fidelity cinematic videos as output.
We started the company just over about a year and a half ago, and so my co-founder, AJ, and I were both at Berkeley at the time, and we were really excited about some of the potential for some of our research that we had been working on. I mean, I had been working on LLM systems and AJ, my co-founder, had worked on a lot of the early work in diffusion models. And this was, for the first time, technology that was pretty widely being taken up by the industry. I mean, people were really excited about it, and we felt very strongly it was important to train the models to the best of our capability, but also deploy them to real people, because otherwise, I mean, what is the point of the research and what's the impact of what we're doing?
The best way for us to do that was ultimately a startup, and so we started Genmo, and I think the founding story is really interesting. I mean, we were actually in an AI hacker house that we had started in Oakland, in Temescal, with five PhD students in a converted shoe factory. We literally turned that into the hacker house there. So our dinners were always wonderful, and we were sitting around, and it was a really creative space. It was covered in murals. It was a really, really cool space.
And from there, AJ and I launched our first AI video model in January 2023, and so that was an image-to-video model that could take any image and generate kind of an output that was video-like, but different than today's video models. It was more kind of trippy. It was a little bit more psychedelic and interesting, like we really leaned into what made AI video AI here, which is hallucinations, and that grew like wildfire.
I mean, we were sitting there, just bootstrapping the company. And just from that, I remember one day like in May, we were acquiring like 80,000 users a day for some time period. And this was 80,000 users a day, yeah. Completely organic.
[00:13:36] Raza Habib: How are people finding you, yeah? So other than podcast appearances, yeah?
[00:13:41] Paras Jain: I think video markets itself, right? Good video. People want to just share. They take it to TikTok, they take it to YouTube shorts, they take it to Instagram reels, and they just want to share it. And they say, you know, I found this cool AI product that let me make this video. We also, if you don't pay us, we put a little watermark on there and that's like, the minimum fee we kind of have to charge for that premium usage, but that creates a huge amount of word of mouth. And so what's really powerful is when we launched the product, we originally just DM'd like 100 different creators who we found were doing interesting things with AI, and we told them, "This is a tool that we just launched." And from there, the word of mouth loop just started, where they started building things with our product, sharing it on social media, and other people began to come in as they discovered this capability. So we hit, actually, I think, 25 countries within two weeks of our launch, completely organically. And so it just spread like wildfire. And that's not just in the US. Like I said, it just went global. So I just share that, because I think it's so powerful, what the potential for video is.
[00:14:36] Raza Habib: So you are one of the first of the video generation companies to be out there. Since then, there's been a few more. Sora has got its video product. There's Runway. I'm sure Midjourney will eventually release something as Pika, et cetera. Are you guys trying to specialize in some way particularly, or what makes Genmo special? What's your USP or secret sauce?
[00:14:57] Paras Jain: Our team is almost entirely research today. We don't have product. We don't have a lot of things, and it certainly might improve the product experience. It might improve things. But I want to be relentlessly focused on one thing, and that's quality, the quality of our video, and the quality, specifically, of our motion in our video. I believe fundamentally, what makes video, video, and not just an image or a live photo, is the motion. And so I spent a lot of time working with our team to understand things like solid-body interactions, like how do liquids move and how do fabrics behave, right? Trying to understand from first principles, how are our models acting as world models, or physics simulators, these kind of physics engines of the world.
[00:15:33] Raza Habib: So can you tell me more about that? Actually, that's something I find fascinating, and it was one of the things I was looking forward to asking about, actually, is like someone who's working on these models directly. Okay? The question I want to ask is the extent to which these models have consistent world models, and how you can test that. Before we go into that, it would probably be useful for the audience to understand like, what is the model we're talking about. So can you maybe tell us a little bit about what the architecture is like, what is the style of model and how it works, and then we can try and dig into how you understand it and improve it?
[00:16:01] Paras Jain: At a high level, what we train at Genmo are basically diffusion models. And so with diffusion models follow this paradigm of generative models, where, I mean, it's trained with your standard, like log likelihood objective. What that means, essentially, is it's just trying to compress the world into its latent space, right? It's just trying to produce a pure model of the full world. And it's a really beautiful loss function that it's just learning to recreate and mimic the whole world and compress it into just what is really a model that's 1000 times smaller than its training data set.
[00:16:31] Raza Habib: Maybe just a little explainer there, because I feel like people say this a lot, and I think people hear it and they think about it colloquially. You know, this equivalence between log likelihood and compression, just very briefly, I'd love to just give an opportunity to unpack that a little bit. So when we say log likelihood loss, what we're saying is that the model, the neural network, parameterizes some probability distribution, and the log likelihood is the probability that that parameterized distribution puts on the data current and then we optimize the parameters to make that as big as possible. The reason why that's equivalent to compression is you can actually make a practical lossless compression out of that model. So given a probabilistic model of some data, you can create something like using an algorithm like arithmetic coding, you can create a compression algorithm, and the compression rate you get is exactly equal to the log likelihood. So it is, it is literally the same objective. What you're optimizing explicitly is the compression rate of your model. And so when people say compression and log likelihood training, or we know what an LLM is doing is just compression. It's not just a turn of phrase or something colloquial. It's literally true. It is what the model is being trained to do.
[00:17:38] Paras Jain: That is a phenomenal explanation, by the way. Yeah, it's a really beautiful objective this way, right? I mean, in some sense, I think about our diffusion models as world models, modeling world physics, and implicitly, through this compression, like, the parameters are 1000 times smaller than the world around them and the data around them, and so they must learn fundamental properties of the world, like you might imagine Newton's third law and so on, right? Like it has to learn all of this in order to have any shot at compressing information at that density that we're asking it to do.
[00:18:09] Raza Habib: And Paras, my understanding of diffusion models, I know the video Generation Space less well, but my understanding of diffusion models for imaging, right, is you have this process of essentially, like iteratively denoising an image with a trained sort of denoiser that you apply again and again. How does that work for video? Like, how do you translate that to get a temporal sequence? Are you generating frames one at a time and somehow getting consistency? Or, yeah. How does it work?
[00:18:33] Paras Jain: Yeah. So with diffusion models for video today, they generate all frames at once in one kind of pass. So what's really interesting...
[00:18:40] Raza Habib: Okay, wait, so you're generating, you're generating all the frames of the sequence simultaneously, all in one forward, like pass of this thing, correct?
[00:18:49] Paras Jain: Yeah. So it's, it's for each multiple iterations. It's like you have this tensor that's like height by width by time, and that's the whole video. And it's really noisy, and you iteratively denoise that until you get your rendered video, right? That's the key idea. If you can actually see this in practice, if you go to our website, genmo.ai and make a video, we provide instant, real-time streaming previews of your video as it's generating from the GPU that we're generating it on. And the preview is not an image, it's not a frame, it's a full video. It's just like a blurry video that's still decoding. So you can actually see the process through which the model is actually denoising to produce your output video. And that's what I think is really interesting. It's a whole video. It's concurrently decoding the whole thing. It's not like ChatGPT, where you see it autoregressively decoding one token at a time. It's the whole video.
Now there's two general approaches to which people have trained these models, and I'll outline kind of the trade-offs there with these different approaches, because I think it's actually really interesting. There's a traditional camp with how many people train these video models, is they would first train an image model which could generate a single image, and so you spent a bunch of time doing that, and then the idea was, okay, I want this image to begin to move. I want to make some motion. So people began to adapt these image models like Stable Diffusion in order to inject some motion, and the way they would do that is kind of fine-tuning. They would do this thing called expansion. They would take the model and they would blow it up to many frames, and then they would try to decode it. This serves a huge computational benefit, because you save a ton of resources. You don't have to train in video space for that long, which is very computationally intensive. But the trade-off is, the motion you learn is not that good, like you might imagine. You start with image, you try to learn a video representation. It's not the right representation, ultimately, for high-quality motion.
A simple example here is the prior with which you find images in the world, meaning, on the internet, doesn't match the prior with which you find frames from videos. A simple example is like if you're taking pictures of portraits, like a photo of me or you here, we're probably going to be looking head-on at the camera, or maybe it's a side profile, but you're not getting all of those interpolates in between, right? And so image models suffer this collapse where they represent people looking head straight on the camera really well and looking to the side really well, but in between, it's not that good. And so you try to train a video model on this, and you'll see the people's faces snap. They kind of fade out here. And then they'll look straight ahead.
[00:21:05] Raza Habib: So you guys are training diffusion models on like, entire sequences. And so it's, it's still fundamentally the same style of model of this iterative denoising with the log likelihood objective. Now I would love to circle back to that question that I wanted to ask earlier, which is, how do you guys interrogate the extent to which the model has like a consistent, let's say it's a scene. It has to get temporal coherence. Is there some kind of consistent latent representation within the model of the objects, say, and where their positions are, or something like this? Have you been able to get that level of interpretability? How do you even begin to think about it?
[00:21:38] Paras Jain: Evaluating diffusion models is a really big open area right now, to be honest. So many of our users judge the model's output by its visual fidelity, meaning, how detailed is it? How much resolution does it have? Is it cinematic? How aesthetic is it? And these are kind of like surrogate metrics, like as I described, they don't actually measure, is this a faithful representation of physics and how the world behaves? Like that is a really interesting, deep question that we're really interested to study. I mean, ultimately, we've been working to build evaluations for this, but it's still an imperfect science. Like I said, it's literally...
[00:22:12] Raza Habib: How do you do it at the moment? Is it just annotation? Is it just end user feedback? Like, what's the current process?
[00:22:17] Paras Jain: Yeah. I mean, so we have a product in production, more than one and a half million users, and they give us feedback, which we can then, you know, review on a case-by-case basis and understand, maybe, where did the model fail? How do we mitigate that? How do we improve that? That's actually really useful to understand in the wild prompt distribution people have. We also construct our own evaluations in-house to benchmark this capability. And so those are like tests that measure specific interactions, right? So, like a bowling ball colliding with a bowling pin, right?
[00:22:43] Raza Habib: And then, how do you measure that? Like you're getting human annotation in or you have some automated...
[00:22:47] Paras Jain: Human-driven today is where that is today. You know, I'm really interested to study and in general, for our company we've been investing really heavily, to understand, how can we leverage synthetic data environments more heavily, both from a training perspective, but also for evaluations, to more quantitatively benchmark. How are we modeling these kind of, you know, world models effectively, right? Like, are we actually faithfully recreating some phenomena where you really do need some kind of, like, gold standard model that you can index against and measure?
To be clear, I think at some point the video generation models will supersede our ability to simulate conventional physics with these, you know, traditional rendering pipelines. There's a ton of shortcuts that are taken in order to make this stuff work. For example, shadows are completely inaccurate as rendered in traditional 3D engines. But today, where their models are, I think this will be a really effective way to begin to start to understand in a quantitative way, like a metric-driven approach, how well can we recreate this? But I mean, I don't think this is an area I would love to see academic work, standardized and benchmarks, honestly, here.
[00:23:43] Raza Habib: And just from playing with them, like, do you have an intuitive sense for how good that world model is? Like, you spend a lot of time with these models, you know, we've been on a journey where the early versions of these, you know, there's got that very famous meme of the Will Smith video eating spaghetti, and there's no consistency to the physics. And then you've got, you know, more recent, very high fidelity images, but we only see, we tend to see the cherry-picked examples, right? People share the ones that have worked well. They rarely share the ones that are like, absolutely terrible, or maybe unless they're really terrible, then they get shared. You play with them day in, day out. What sense do you have for do the models have a consistent internal representation?
[00:24:21] Paras Jain: I think we're still quite a ways far off, honestly, because one of the capabilities that's still very tough is, first, just long video synthesis. How do you make 60-second plus long videos? But number two, how do you have consistency across long temporal horizon? So I think when you look at cherry-picks that are like five seconds long, I mean, that task is not so bad, to be honest. I mean, it's a lot easier, but if you look at that on a long-range time scale, that's where it gets a lot harder.
So, I mean, for example, one metric to look at is like environment consistency. Like you might have a 3D environment, which you're like panning through. And if you just like you might have in SLAM, where you want to say, this is like cycle consistent, or something, where you end up back at the source and the cycle closes. Like, the same thing with these video models is they're not actually consistent yet, where you'll have, like, a big error, if you like, tell it to make, like, a U-turn, for example, or something, right?
[00:25:07] Raza Habib: Okay, so if I if I was to sort of pan in a circle around the room, then when I get back to the start, it's just going to generate something completely new.
[00:25:14] Paras Jain: It'll have some similarity. But the question is, how close of a match is that, right? And, like, how do you maintain those long-range dynamics, I think is a really interesting question. I think it connects to the model's ability to do like, long context reasoning. Ultimately, how has it been trained for these kind of causal relationships over long time horizons? To me, that's the best, one of the better models of measuring, like, intrinsically, how good is a model actually reasoning about the world? Fundamentally, like, some amount it might be compression, some amount of it might be like, memorization, how much of it is like, truly just like from scratch. It's inherent ability actually reasonable these things. And I think, like, this kind of test is actually a really good measure for how good these models are. And the fact that there is this error that exists means that the models still have some distance to close.
[00:25:55] Raza Habib: So what are you actually doing to improve the models day to day? You mentioned, you know, looking at this feedback, and you maybe identify an issue, then you try to fix it. What does that look like in practice? How do you improve the model in a specific way?
[00:26:05] Paras Jain: The thing we spend a ton of time on is, like the base capabilities of the model. So that comes from pre-training, ultimately, and that means spending more effort to improve the intrinsic reasoning capability of these models. That turns out to be really hard. You're really beholden to scaling laws there. The way we think about is data architecture and scaling. And we've been working hard, from a systems perspective, to say what I'd like to call steeper scaling laws for each unit of compute we begin to deploy, getting more kind of bang for our buck in the model's capabilities. You can think of that as like architectures and data like that's where I talk about synthetic data. I think that will have a lot of impact here from a scaling laws perspective
[26:42] Raza Habib: What's the ratio of time you as a small research team spend, say, on data curation versus doing fundamental research into the methods, versus just trying to scale the systems? Honestly,
[26:55] Paras Jain: I would say like a solid 60-80% of this effort goes into scaling. So many things break at scale. You know, we just brought up a new training supercomputer, and the GPU failure rates are high enough that you just deal intrinsically with at like, 1000 GPU scale. You have to spend a ton of effort to make sure that the challenge with training big foundation models is about getting 1000 things right, rather than one thing. And if any single thing of those 1000 things goes wrong, your model is like, it's blue bar, it doesn't work. And so that's really the challenge of training these models.
I think the strange thing is, I fundamentally feel pre-training is where reasoning emerges in models. No amount of fine-tuning or RLHF will improve that. I think it makes models considerably more usable, considerably safer, and can improve certain capabilities. But when I look at things like physics and stuff like that, it just comes from observing a huge amount of the world's interactions, right? Like objects interacting with each other and mining those interactions at scale.
[27:53] Raza Habib: What is the scale we're talking about here? I think there's like a conventional intuition that image and video models don't need as much scale as language models to get good results. Is that conventional wisdom still true? And how big are we getting with these models now? What is the scale?
[28:07] Paras Jain: The models are starting to scale pretty quickly in parameter counts, which is really interesting. I think video is unlocking something that traditional image generation was bottlenecked by. It might just be that motion and these kind of physics and stuff like I'm describing are just so much more complex than the image generation setting that they do need that additional scale.
You know, I think the funny thing about scaling laws that people forget is scaling laws, first and foremost, are actually about data more so than parameters. And so people get indexed to parameter counts very often, but it's really about dataset scale. And so I think it's like a big question diffusion, for a long time, is finding architectures that are sponges, ones that want to just absorb dynamics of the world and absorb more information continuously, and just continue to benefit from scale. I think the emergence of the diffusion transformer was an interesting watershed moment where it really unlocks scalability in these models in a deep way.
[28:57] Raza Habib: Can you explain a little bit about why?
[29:00] Paras Jain: Yeah, I think it's half-half. Like the transformer is a pretty inductive, bias-free representation, compared to, say, convolutional neural networks or other architectures. Because of this n-squared attention, how it's built, they generally allow a lot of different patterns of reasoning to emerge in the model. And so it's really powerful. So it benefits from scale naturally. At scale, they can mine long-range correlations that other architectures just couldn't. That's one aspect, architecturally, from an inductive bias perspective, they are more sponge-like in that sense that they are inherently more capable of absorbing this information.
I think the second one is this hardware lottery that we talked about earlier. They just scale better in big clusters, and so you deal with fewer fires to fight as you scale them. We can benefit from the hard work of big companies, Meta and others who are investing heavily in scaling and translate that to video generation in a really interesting way. We have our own unique challenges. Generally, it looks like, how do you deal with very large scale, kind of petabyte scale datasets, and deal with data ingestion and data pipelines of that scale? That's actually a unique challenge you haven't had to deal with with large language models, it's really difficult.
[30:04] Raza Habib: Can you maybe just run me through, like, what are some of the biggest challenges? If you do, you know, let's say someone's like, "I think I could do this. I'm going to go and try and scale up some of these video models." What are the challenges they're going to hit? What are the ones that actually make this really hard in practice as you're scaling it?
[30:18] Paras Jain: So I think, like I said with that framework, there's data, you have architecture, and you have scaling. From a data perspective, just dealing with these bigger datasets is really hard. I have to say, like these models have to consume data at such a high rate that you end up needing really esoteric storage systems, for example, during training, in order to sustain the I/O bandwidth needed at scale. And so there's a lot of interesting work in data pipelines to just from strict distributed systems to manage this, and then, by the way, also from serving these things. That's actually really difficult.
We've had to build our own in-house CDN and kind of edge accelerator serving architecture in order to provide low latency of serving. That's actually one of the things we're really good at. Within 10 seconds when you use our product, we get pixels to you off the GPU. And so that's a first in this market, and that too is just an architecture from a distributed systems perspective, also very complex.
I think architecture is really interesting. I actually think, unlike large language models, the playbook for training big diffusion models is not there's no consensus yet. It's just not clear yet. I mean, even one simple angle here is, how important are LYDD priors versus just from scratch, like large diffusion models. You know, actually, Ajay worked on some of the first image... Ajay is your co-founder, my co-founder. Ajay, yeah. So again, he did his PhD at Berkeley, working on DDPM and some of the early work in diffusion. He did image generation. He also worked on 3D generation with Dream Fusion and Dream Fields.
And, you know, it's interesting is like, how much I think video generation begins to hybridize these different architectures in really interesting ways, that leads you to more like spaghetti-like architectures, rather than perfectly clear form, fit stack transformer models, like we deal with with language models. So that also is really challenging. I would say there's a lot of just fundamental science that has to go on to begin to understand what architectures we should use.
And then the scaling, dealing with big GPU topologies is very fault-prone. I would say, at scale, you're dealing with things like seeing a GPU fail every like, let's say, two, three hours somewhere in a large training job. And so you need fault-tolerant and elastic training so that you can kind of just like, roll with it and continue to train as failures happen. It's just law of large numbers, GPUs fail with a certain percentage and at large scales.
[32:28] Raza Habib: May I ask, actually, a very naive question that occurred to me earlier when we were talking about this, which is, you mentioned needing to synchronize these gradient updates across all of these different GPUs. And the question that just jumped into my head is like, Why? Why can't we just do async updates? I mean, SGD is inherently a stochastic algorithm. Why are we trying to synchronize these at all?
[32:46] Paras Jain: I think it's really interesting. You buy a huge amount of systems efficiency from async updates. I mean, one of them... There's a paper I love called PipeDream, for example, which studies pipeline parallelism in a world of asynchronous, stochastic gradient descent. And just as background, I mean, the world I talk about, which is where most people live in, in large language models today, for example, is you kind of stop the world until you synchronize all the gradients, and then you consider the next data batch.
[33:09] Raza Habib: Which, if you think about, like, how, what SGD is as an algorithm, like, that's a very strange thing to do, right? Like, it's a stochastic gradient update anyway, we're forming a mini-batch estimate to this gradient, to an extent the noise is helpful. So yeah, it surprises me that this is where we've ended up.
[33:24] Paras Jain: So one thing I can say for diffusion models, for example, is that the loss is so noisy, inherently from the denoising objective that we're training this thing with, because it's just so noisy, like, approximations end up becoming really numerically challenging, I would say. Like, there's benefits to large batch training for large diffusion models. And so there's, like, inherent noise in the training objective that makes these diffusion models actually uniquely more challenging to scale in some ways, than large language models.
I would say this async question is something I was always really interested in during my PhD. I mean, I think, honestly, I kept trying to work on it, and then reviewers would just complain that this is not numerically preserving and so, you know, that made it a lot harder to get these papers accepted. I think it's just a lot less innovation and effort has gone into training these things.
[34:08] Raza Habib: Is it just a path dependency right now that, like, we've just built things for synchronous training, and that's what's optimized right now, but it doesn't seem as any fundamental reason why async wouldn't be good.
[34:19] Paras Jain: So there is a small accuracy trade-off. But the question is, is the systems efficiency gain you get from asynchrony worth that accuracy trade-off? Because you can just do more cycles.
[34:28] Raza Habib: I think that's with your like, systems hat on, right? The accuracy trade-off, like, bothers you. But with my like pure, like, probabilistic modeling, like ML hat on, I'm like, the noise is good. Like, let's not worry about it. Like, we're injecting noise into the system anyway. And so like, the fact that you get a slightly noisier estimate to the gradient isn't inherently bad. We can make use of the sources of noise in our system to help us train these systems. We don't necessarily have to mitigate it.
But anyway, this is a large tangent that I suspect you and I could talk about for a long time, and I don't want to lose our listeners completely, so maybe I'll put a pin in that on terms of technical challenges, because I think you've given a really good overview and a sense of why this is hard. What I'd love to spend the last bit of time talking about is a little bit more down to earth. How are people using Genmo? Like, what are the most common use cases? What are the ones you're most excited about for the future? So maybe today, how are they using it? And if everything goes to plan, and you're, like, able to really build the models you want to, how would you hope people use it?
[35:24] Paras Jain: Yeah, our 10-year mission is to invent a tiny filmmaker that fits in the pockets of a billion people. I think this will be fundamentally transformative. I'm really excited to see a world where the new baseline for consumption is just creation. People aren't just consuming content produced and scaled for audiences of millions of people, rather, it's on-demand, synthesized for them. Or every time you go on YouTube, there's an H1100, or, let's say, X1100, or something in the future, spinning in the cloud, streaming that video to you, just completely synthesized. Where, if I search a tutorial for how I fix some part of my car, and it doesn't exist yet, it's just synthesized for me on demand.
We're really far from that world, I think. And so if we look at where we are today, you know, we have this product, more than one and a half million users actively creating video today. Every day, our product creates many multiples of the entire Netflix video catalog, just like every single day, completely from scratch. I would say it's about half-half, like consumer for fun use cases, and then interesting for work, or B2B kind of use cases where people are productively using this.
[36:24] Raza Habib: What are some of the B2B use cases?
[36:26] Paras Jain: Yeah, we see a lot of interesting use cases where people just, like, brainstorm and prototype honestly, before they want to cut out video for something they want to create, they just hop in and they try to just prototype with the prompt. So it's kind of a world where iteration traditionally would be whiteboarded out and they would have to, like, sketch it out, or they would better type it in outlines. And now they're synthesizing rough drafts with AI, right? That are, you know, let's say 80% of the way there, and that lets them move at the speed of thought, I think it's really powerful. And that's a workflow we see a lot of usage of AI today, and that's a really powerful one.
I think another one that's interesting is people beginning to produce what I would say is output suitable for social media today, so they're not like into high fidelity assets, but their assets are going to TikTok or Instagram or YouTube. And what's really remarkable is a ton of businesses today are heavily dependent on video production. I mean, we're doing a podcast here, and this is a form of communicating, video communication that will be disseminated out to the broader public. And every business does this, and it's hugely resource-intensive, and so many are starting to leverage AI as a pathway here to make things faster. I would say the general theme today, the use cases that we're seeing people use our product for are ones where hallucination is a feature and not a bug.
[37:32] Raza Habib: Yeah, yeah. So creative tasks and ideation, which is very similar to the pattern we saw with LLMs, right? The early use cases of LLMs were all writing assistance or storytelling or copywriting, where actually the randomness was like helpful and hallucination was a feature, not a bug. I don't think that Genmo has an API today. Is that right? And will you guys in the future?
[37:54] Paras Jain: So I'm really excited about a world where developers can access this technology. It's not possible today. I think the way I see this today is video is really expensive. It costs hundreds of dollars, sometimes thousands of dollars per second for high-value video production. And so that's the world with which applications exist. That's why we have big streaming platforms and content catalogs. I'm interested in what's kind of under the iceberg as we drive the marginal cost of video creation down from where it is today towards zero, ultimately, is why I would like to see things driven what becomes possible.
And I think we're going to see this massive explosion of video AI applications emerge. You know, close to the top are those hallucination-tolerant use cases I talked about, right, where people are really, you know, in social media or creative applications, or maybe gaming, they're really okay with that hallucination. It's something good. You know, at the bottom of the iceberg, you have, like, really interesting use cases in, like, e-commerce, marketing, advertising, retail, where, like, hallucinations cannot exist, like big issues if they happen, but it would fundamentally transform. Like, for example, how we go and we purchase products at scale, right? It would fundamentally transform. And then, I think at the limit, when the costs are really small, you have AI-generated YouTube where every video is synthesized on demand, right?
And what's missing today is this developer ecosystem, this developer platform to let people build that longer term. It's something I'm really passionate about building. You know, my roots, or my PhD were in the UC Berkeley AMP Lab. It was the home of, you know, Spark and Databricks. And, you know, I think open source was something I was really steeped in for a very long time there, and I saw the power of what developers can do when they're given this technology. We don't have an API today. It's one of the biggest customer requests today, and it's something we're keenly working on developing and keeping an eye for when the technology really is at a point that developers are going to be able to consume it and build next-generation applications powered by it.
[39:43] Raza Habib: And is that the main limiting factor today that you think, if people did have API access, it's not quite good enough yet, in terms of these hallucinations, et cetera, for people to want to just programmatically interact with it.
[39:54] Paras Jain: Where we are today, I think, is like GPT-3 or GPT-2, where like, there's a lot of cherry-picking needed for the technology just as a whole to be useful. And so that kind of precludes API-driven use cases, if you think about it, because you want programmatic push-button reliable workflows. And so it wasn't until we had the invention of techniques, like RLHF, for example, which did alignment to reduce the variance in output, that LLMs became really useful.
You know, for me, that's a really big open question is, how far as a community, are we from getting that to the point that you can call an API and reliably expect it to produce high-quality output. This is true in the field, but I would say we're actually a lot better than many of our peers in this way. From cherry-picking some of our competitor products, you might get two usable videos out of ten, right? Given how much AI generation costs, that might be all you can generate. I would say we benchmark somewhere between six to seven, depending on the use case. But for those hallucination-tolerant use cases, we ensure users get through enough volume to finally produce something usable. API reliability is what we're working to invest in.
[00:40:59] Raza Habib: Last couple of questions. It would be impossible to have an in-depth conversation about video generation without touching on ethical considerations or the consequences for wider society. There was a podcast that Caterina Fake used to do called Should This Exist? where they would deep dive into the societal consequences of new technology and whether they would be good or bad. Video generation feels like it could have amazing positive outcomes, but also scary use cases like disinformation, deepfakes, or child pornography, which are quite worrying. How do you guys think about that? What is the world you're most excited about, and what are the things you hope don't happen? How do you think about the ethics of this?
[00:41:48] Paras Jain: From a safety perspective, we've spent a ton of time thinking carefully about this and have invested heavily in our safety stack, which includes everything from input-output prompt filtering. Our approach has been to ship this technology early. When everyone looks at AI video, they know it's AI. It's far from reality today, so it's not something directly replacing real content. This gives us an environment to safely scale the technology and observe what jailbreaks exist—what people are trying to do that's potentially unsafe—and build safety mitigations today while it's still early and safer to do so.
[00:42:26] Raza Habib: What do you view as a safety mitigation? It sounds like your implicit model is that there are good and bad use cases, and you're going to stop people from doing the bad ones. Is that roughly your mental model?
[00:42:39] Paras Jain: We have community guidelines that outline what we support as the values for this product. What's interesting is that the normative values that should govern these systems vary a lot across the world. We have customers in 40-plus countries, so what adheres to Western standards may not adhere to Eastern standards. We have to think carefully about how to localize appropriately for our users. But I do fundamentally believe it's important to be safety-conscious, and we've invested heavily in this. With our million and a half users, we've observed a huge variety of use cases, which gives us a lot of ability to build defenses in the product. If you ask our productivity tool to do something against our community guidelines, it will sometimes refuse. That's something users are sometimes frustrated by.
[00:43:27] Raza Habib: The safety aspect of preventing models from generating things outside your community guidelines makes a lot of sense. I'm curious about your thoughts on the second-order consequences. If we succeed and video generation gets really good, that will change society in many ways. You mentioned that the cost of video generation could go close to zero, allowing on-demand video generation. That will have effects on the way we work, relationships, and the dissemination of information. Paint me a picture of how you think society might look different and which changes you're excited about, as well as those you think we need to work hard to avoid.
[00:44:10] Paras Jain: I'm generally an optimist about technology, so I spend a lot of time thinking about what good it can do rather than why it shouldn't work. When I think about this technology, I look at large language models as the closest analog. Today, we live in a world where a very small handful of companies control this technology, and they're generally large incumbents. OpenAI has basically conquered and won the left brain of AGI, and it's kind of funny—I view Anthropic as almost the same entity as OpenAI. They split in a sort of mitosis event but came from the same root organism. If OpenAI B set controls the entire left brain of AGI, it would be a huge shame if they owned the right brain as well. Their model is relatively closed, where they're extremely prescriptive about what you can and can't do with it. Users don't have much choice or control over their models, and they don't have much consent in the process. I don't think that's the best path forward.
It's really important for a startup like us to exist in this ecosystem, trying to chart an alternative path. I believe the best path to safety is transparency and openness—providing open access to these models early and quickly to our user base, not pitching it to Hollywood boardrooms first. I truly believe the end state here is that the models themselves eventually run, potentially at the edge, on our users' devices, where they personalize it with their own data. It serves them rather than being one giant, monolithic model in the cloud for everybody. I think that charts an optimistic path to a world where models are hyper-personalized and serve the needs of end users rather than monolithic organizations.
[00:46:09] Raza Habib: I'm going to ask you a couple of questions we try to ask almost everyone. The first one is, we've spoken a lot about video generation with AI more broadly. How do you think the world will look different five years from now? What are your predictions for how good the tech will get, and what impact that will have?
[00:46:27] Paras Jain: Over the next five years, I think AI will become a basic capability in everyone's life. When I talked to my niece, it was really funny—she actually believed that smartphones had a chip in them that could generate images as a base capability, just like they can add or subtract. Already, the belief in what these computers should be capable of is shifting quickly. In five years, AI will be an everyday capability for a billion-plus people, like water or electricity.
[00:46:54] Raza Habib: The other question we ask is, with the rapid progress and hype in AI, how do you keep up with the latest research papers or track the field? Is there anything you're particularly excited about that might have been overlooked?
[00:47:16] Paras Jain: I spend a lot of time reading papers, specifically studying evaluations. The interesting thing is always Figure 1 and Table 1. It's important to track how fast progress has been made on key benchmarks over time. What's remarkable in large language models is how fast progress has been on core reasoning benchmarks like math or human evaluation—faster than I ever anticipated. I think it's really important to be evaluation-driven when consuming this information to cut through the hype.
I've always believed that dense models can't be the single path forward for the future of large generative foundation models. It's terrifying to think of scaling laws where you need exponentially increasing resources for linear progress. This just can't continue in a resource-constrained world. Sparsity has to be effective at some point. We've seen mixtures of experts begin to emerge, but one area that really needs more investment is low-precision arithmetic. As we conquer the eight-bit boundary, we need natively sparse hardware architectures to support scaling in the future. It's crucial that we find ways to break the quadratic scaling law of modern transformers.
[00:48:34] Raza Habib: All right, I think that's a great place to end. Paras, I really enjoyed the conversation and learned a lot. Thank you very much for coming on the show.
[00:48:39] Paras Jain: Thank you. I had a wonderful time.
[00:48:42] Raza Habib: All right, that's it for today's conversation on High Agency. I'm Raza Habib, and I hope you enjoyed our discussion. If you did, please take a moment to rate and review us on your favorite podcast platform like Spotify, Apple Podcasts, or wherever you listen, and subscribe. It really helps us reach more AI builders like you. For extras, show notes, and more episodes of High Agency, check out humanloop.com/podcast.
If today's conversation sparked any new ideas or insights, I'd love to hear from you. Your feedback means a lot and helps us create content that matters most to you. Email me at raza@humanloop.com or find me at RazaHabib on X.