Transcript: How Adaption Labs Is Rethinking AI With Continual Learning | Sudip Roy, Co-founder & CTO of Adaption Labs
In this episode of Startup Project, host Nataraj Sindam sits down with Sudip Roy, Co-founder and CTO of Adaption Labs, to discuss why the AI industry is shifting from scaling compute toward inference-layer efficiency, what is really driving inference costs, and how gradient-free continual learning could close AI's stubborn "last 5%" reliability gap.
2026-06-18

Listen to this episode
How Adaption Labs Is Rethinking AI With Continual Learning | Sudip Roy, Co-founder & CTO of Adaption Labs
Nataraj: Hello everyone, welcome to Startup Project. My guest today is Sudip. Sudip is the Co-founder and CTO of Adaption Labs. I think we're seeing two trends play out simultaneously in AI. One is scaling purely with compute — adding more GPU clusters, putting more data into large training models. But then there's a set of new efficiencies coming in at the inference layer, and Sudip and his team are working on adapting to those strengths. Sudip was previously a director at Cohere, working on inference and shipping the serving and fine-tuning infrastructure there. Before that, he spent considerable time at Google Brain. He co-authored TFX, the platform that powers production machine learning at Google, helped build Pathways, and is now working on Adaption Labs. Sudip, welcome to Startup Project.
Sudip: Thank you so much for having me. Looking forward to the conversation.
Nataraj: Before jumping into that, I want to talk a little about your background and your experience working at Google Brain. Can you talk about some of the problems you were working on when you first joined Google Brain? And the follow-up: did you imagine the space would go this parabolic in such a short time? It caught everyone off guard in some sense.
Sudip: So I actually come from a data management background. I went to Cornell University in upstate New York, where I did my PhD in data management systems, working mostly on more traditional systems like distributed transaction processing. When I joined Google, I joined as a researcher, again in the data management space. But over the years — back then it was more "machine intelligence" — I saw machine learning becoming pervasive across all products, and I decided that would be an interesting area to stretch myself into. So I moved more toward doing end-to-end machine learning. This was still in the era where every task gets its own set of models, and you'd effectively train thousands of models. The product we developed, TensorFlow Extended, was effectively a harness around TensorFlow. TensorFlow was the core infrastructure used for the training piece, but end-to-end machine learning involves a lot of other pieces — from data processing to data validation to the model training itself to post-model validation and evaluation. TFX basically standardized all of that end-to-end processing.
Then in my last three years at Google DeepMind, I got the opportunity to work on a really interesting project: ML Pathways. The idea was to develop the infrastructure for the next generation of AI models. This was also when the field had pivoted toward foundation models, which were substantially larger than the earlier generation of small machine learning models. We developed Pathways as a system that could run across multiple TPU pods, possibly even different data centers. There were a lot of interesting systems challenges in developing it, and I think the system is still used for training and serving the Gemini series of models now. During those last three years, I also wanted the flip-side experience of viewing these systems problems from the point of view of a machine learning researcher, so I explored a lot of transformer-based models and how they could be used to solve complex optimization problems in systems like ML compilers. But that's briefly my journey at Google DeepMind.
Nataraj: What was your experience working at Cohere? I think your last job was at Cohere, and your focus was primarily on inference. Tell us a little about that.
Sudip: Going back to your earlier question — did I actually anticipate large language models and foundation models taking off? I would say I could see the potential of large language models, and that was one of the motivations for switching to Cohere, which was still early in its days, around the seed stage. I could see the impact this technology could broadly have. But I definitely would not claim I could anticipate the scale of the impact we've seen over the last three years — even more so within just the last six months. The pace of technological innovation and how pervasive it has become in our day-to-day lives were very hard to anticipate. One of the reasons I switched to Cohere was largely because I wanted to work in a fast-paced environment where the gap between research and end-to-end product was much smaller.
Nataraj: I want to point out that, before foundation models, you mentioned we had models for each purpose. If a company like Intuit wanted to convert tax forms into structured data, they used to take all their previous tax documents, scan them, create datasets, tag them, and build a very specific model — or use an existing model hosted on AWS or Gemini specifically for image recognition and then optimize that. So we were focused on very specialized models. I want to get your view on how long it would take to get a project like that online versus doing the same thing with large language models — both in terms of time and resources. We always think of LLMs as this costly thing where we spend hundreds of millions of dollars on training clusters, GPUs, and infrastructure. But there's also the angle of what we used to do before, and a large portion of that is now being outsourced to large language models. A whole set of subtasks we'd otherwise do with specific ML models has gone through LLMs. What are your general thoughts? How would that change?
Sudip: If you think back maybe seven or eight years ago, when we were still doing these very bespoke models for specific tasks, the end-to-end time from envisioning that we needed to solve a problem with a machine learning model to having something deployed as part of the product still used to take one or two quarters — especially at a company like Google, where a lot of production issues also need to be resolved. It's not just about the training piece. The resource footprint of training those models was fairly small, and the infrastructure around it was primarily TensorFlow, which was still in its infancy. There was a period where hardware increasingly moved from general-purpose compute toward accelerators, and frameworks like TensorFlow matured. All of that helped reduce that quarter-or-two to maybe one quarter, but it still took a lot of engineering hours to productize a model.
With foundation models, if you look at the current iteration velocity of new models arriving in the market, it's still a six-month-to-a-year gap. It feels much shorter — like there's a new model every week — but those models are produced by different companies and different groups of people. If you focus on one particular lab, the gestation time is almost six months to a year. And it makes sense, because now you're training one model that is good at thousands of tasks. So you can justify spending massive amounts of compute, human resources, and data to get this one model, because it'll solve a thousand different tasks instead of having to train a thousand models. That's the justification for why it makes sense to train these really large foundation models. Over the last three years, what we've done is spend a humongous amount of resources on training and then optimize serving as much as possible to reduce costs. But now we're running into challenges there as well, because the models have moved into the trillion-parameter regime. Inference costs have skyrocketed for a multitude of factors we can get into later. And we are slowly changing to a world where smaller, verticalized models that can deliver equally good results are more appealing than general-purpose foundation models.
Nataraj: Can you talk a little about why inference costs are still high? What are the different factors driving the cost? Why is inference cost so high?
Sudip: If you take a broader look at inference cost on a per-token basis, the cost has actually fallen by orders of magnitude. By some estimates it's around a 300x drop in inference cost per token, even for the most premier models. But overall, the amount being spent on AI has dramatically increased, even though per-token inference costs have fallen. That's for a few reasons. One, the number of tokens used for any particular task is substantially higher, because now we have reasoning models that like to think a lot. Third, with agentic workloads, a single task can now be served by a parallel set of requests running in the background, and some of those background jobs can run for hours — almost like cron jobs in the traditional systems sense. All of this has contributed to somewhere between a thousand and a million times increase in the number of tokens we're producing. So inference costs have gone down by 300x, but demand has gone up by almost a million x and is still accelerating. That has created a demand-supply gap, which makes inference feel really expensive.
Nataraj: And there's this factor of the probabilistic nature of the output. We could both try to do the same task but give slightly different prompts, and one prompt might get an efficient, immediate answer while another takes longer or more tokens to get the same output. It's like taking two different paths to the same destination, and one path could be costlier. We don't know how efficient the user is at prompting, or just how systems interpret a particular prompt. Previously, when you put out an API, we used to be able to estimate the cost — how many API calls means how much cost in the cloud. That was a more reliable way to calculate cost. With inference APIs, we're seeing pricing changes all the time. There's a product I was using that started out as a subscription product, but now they've had to overlay a credit limit on each subscription. In some sense that's not new — we had bursting credits in APIs for peak volatility, like at 9 a.m. when everyone logs in. But now we have to combine a provisioned model with a pay-go model. And this comes from the fact that these are probabilistic in nature, which is part of why they have higher inference costs.
Sudip: Partly it's the probabilistic or non-deterministic nature, but I think the fundamental property of large language model inference is that it's autoregressive — each new token depends on the previous token. You cannot emit the entire generation at once; it's really hard to parallelize. It has a very sequential nature. The implication is that most standard systems APIs were built around the fact that each API call is more or less deterministic and homogeneous — I can treat each API call as one atomic unit of work, scale my infrastructure horizontally, and parallelize things. But with the autoregressive nature of LLM inference, that doesn't hold anymore, and it introduced a lot of challenges. Even two years ago we ran into a lot of these challenges, and there was a lot of innovation around continuous batching, paged attention, and flash attention to optimize inference. More recently, TurboQuant was published to optimize KV-cache movement. But a lot of that is not just about the probabilistic nature of LLM inference. Even the traditional ML models before foundation models were probabilistic — a classification model would give you one class or another depending on the model — and a lot of SRE tools went through a change to adapt to that more statistical way of building systems. With foundation models, the shift is not just the probabilistic nature, but more the high variance in request and response profiles combined with the autoregressive nature, and those together make it really hard to build robust and reliable systems. That has implications on cost as well, because it makes it challenging to pack requests efficiently onto a given set of hardware. A lot of GPUs are more or less underutilized because we need to over-provision them to handle surges. Combined with the fact that serving foundation models is so expensive, that over-provisioning has an associated cost that has to be passed on to the consumer somehow — which is why a lot of subscription-based pricing structures are breaking now, and people are increasingly moving toward usage-based pricing.
Nataraj: I think that's a good segue. We talked about some of the inference serving techniques like KV caching, quantization, and LoRA. So what led to Adaption Labs? What is the main thesis you started with — that you need to start a new company — and what are you trying to attack?
Sudip: With Adaption, we saw two sets of users who were being fairly limited in their access to AI. The first set is users working in underserved communities or underserved languages, who don't necessarily have the resources to own their AI and are limited to whatever the frontier model APIs provide them. On the other hand, enterprises had a lot of private data they were not able to leverage properly to build a moat for themselves through custom vertical models. Largely, that was because foundation model training or customization was considered a bit of a black art limited to only the frontier labs. Our mission at Adaption is to enable all of these people to have more control over their AI end-to-end. That was the long-term vision and mission we started with.
Our technical approach is to invest in gradient-free continual learning, where we want to enable intelligence to evolve as the world around it changes — and to do it in a gradient-free manner. What that means is we want to make sure the learning, or the change in behavior of the AI stack, feels almost instantaneous. You shouldn't have to wait weeks or months. As the environment around it changes, the model or AI stack should be able to interact with the environment and evolve very naturally and gracefully over time.
Nataraj: So "gradient" comes from gradient descent, which underpins everything in ML and AI.
Sudip: Pretty much — gradient comes from the fact that you want to update the weights of the model. But one of the positions we're taking at Adaption is that we want to take a more full-stack view of AI, as opposed to just considering the model to be AI. Even today, if you look at the wider field, a lot of the innovation has actually moved out of the model itself to the systems built around it and the interfaces designed around it. The model is still an integral part, but to have a successful AI system you need to invest in and consider what you're building around it. At Adaption, we want to innovate and co-optimize across the full stack — starting from the interface or environment and how the model interacts with the external world, whether humans or other agents, to the harnesses built around the models, to the models themselves. We strongly believe that by co-optimizing across the full stack, we can uncover solutions that are just not possible by focusing on any one of these layers in isolation.
Nataraj: The way I'm understanding it: take coding, for example. Since around last November, coding with AI spread like wildfire. But theoretically, even when GitHub Copilot was launched — I think with GPT-1 or GPT-2, I might be wrong, but very early on — what really changed from there to here is, yes, the underlying model quality definitely helped, but the form factor and the harnesses built around it for reasoning, and the agentic form factor, are what really created the advantage and got more people to use it. Sometimes a simple UI change actually improved things a lot. And then you added multiple reasoning capabilities — making multiple calls to plan better, reason better, store a plan, and reiterate on top of it. Some of that is pure technique on top of the model. You could do a version of this on top of GPT-2; it might not be as good because you still need a model that's better at coding, but you could already see this evolution coming. So you're almost saying we'll have these models coming out from large labs, but you'd still gain a huge advantage just by building new things on top of them — whether at the inference level or in how you think about the interfaces on top of models. Is that the right way to think about what you're approaching?
Sudip: Yes, we definitely want to think about the interface and everything else you just said. One of the reasons we emphasize that is, as I mentioned, we want to do continuous learning, and the way to do continuous learning is to interact with the environment. In this case, the environment is either the human user providing feedback, or other agents or sensors you're interacting with — which is why the interface is important. That's your primary feedback-collection mechanism. We want to use those signals to optimize long-horizon tasks — tasks that may take hours or even days in the future. And that's only possible if you think about the right mechanisms for collecting that feedback and how to fold it into the model layer through the harness. That's why we believe it's important to innovate across the full stack. One reason coding especially has been really successful over the last six months is not just the interface or harness, but also that it's an easily verifiable problem — the code either executes and does what it should or it doesn't. So the feedback loop when it makes a mistake is really fast. But there are a plethora of other use cases where that mechanism is not as binary — it's not just true or false; it's mostly about how a human feels about a particular piece of content. Those are much softer signals, and there are still a lot of interesting problems to solve in how we make feedback loops around soft signals help the system improve over time.
Nataraj: What would some of those use cases look like? Are there the same use cases where, in some form, we're not doing so well — where we're solving maybe 80% with Claude Code, M365 Copilot, or Glean — because I'm talking about your second set of users, the enterprise context. We have some tooling, it's good, some use cases are unlocked. Give me an example of what could be better, where Adaption really succeeds if you nail that last signal.
Sudip: The most canonical example is a customer support agent. Let's say you use some form of AI to power a chat interface driven by a customer support agent. Today, if a user asks for something, the agent spawns subagents, tries to do the task, and comes up with something the user doesn't like, and the user says, "This is not what I wanted." That's usually a bit of a dead end — the system did not learn from that. The next time another user comes in and makes the same request, it's going to give them the same result after making those, say, 100 inference calls. What we want is for the system to be able to learn. Sure, it failed in that instance, but how does it adapt? Now that it has an explicit signal from the human that this is not what they intended, how do we fix it automatically? That's an example of what we want to solve.
Nataraj: This is something I encounter even when I'm coding with AI: it forgets that the last three chats, we spent almost 20 minutes figuring out the problem, and somehow the next time it deploys it in the wrong way we'd already figured out, and doesn't keep track of it. So now I have to explicitly tell it that when we find the right solution, document it. Or I create a README or a deploy file so it doesn't forget the fact, and also note down why we made the decision, so we stop reinventing it.
Sudip: Yeah, we kind of went through this journey with prompt engineering, to be honest. We had LLMs that understood instructions in a very specific format, and as humans we basically adapted ourselves to give really elaborate instructions. If you've seen the prompt structure of any app using AI, it could be multiple pages of "do this, don't do this, do this, don't do this." That makes the system very fragile, and it's an unnatural way of communicating with the AI. We're seeing something similar with the example you gave. I experience the same thing when I write code: I have to give it the context of the overall system architecture it has to keep in mind, then the context around the current design I'm working on, and then it may do the right thing. But it would be much easier if the system just learned that, kept the relevant context for me, and learned from it over time.
Nataraj: I think it's also proof that LLMs are not conscious, because consciousness by itself is self-contained. So that's always a proof that LLMs are not conscious, in a way.
Sudip: That's going into more of a philosophical conversation now.
Nataraj: So for the first product — I think you mentioned Adaptive Data — what does Adaptive Data do, and how does it solve this gap of that last-mile context?
Sudip: Adaptive Data was the first product Adaption launched. We launched it about four weeks ago, and we've gotten a really strong response from the community. We've had more than roughly 25 million data points processed within the last four weeks or so through the product. What it enables you to do is, if you have low-quality data, or if you just have an intent but don't even have the data in the right structure or format to connect to your downstream AI systems, Adaptive Data makes that transformation really seamless. Today the product enables you to do that for downstream training purposes, but in the future we'll expand it to other integration points with AI as well. The idea is, if you're an enterprise and you have some high-quality but sensitive data you can't use as-is, or you want to customize a model but you're starting with zero or very little data and you want a larger dataset to create that custom model, Adaptive Data can help you seamlessly achieve that.
We recently also announced AutoScientist, which is a step further in the direction of, "Why stop at the data itself? We'll also solve the model problem for you." AutoScientist allows us to co-optimize both the data and the model so that you ultimately get a custom model trained on that adapted data end-to-end. What we've seen is that by doing this co-optimization across data and model, we can achieve much higher quality — both in the model generations and in fidelity — than you'd be able to do by just iterating on the model itself.
Nataraj: What types of customers are finding this useful?
Sudip: We've seen a broad variety of users. We're working with some financial companies who are using it to create vertical models that are really good at trading. We're working with companies that have customers in low-resource languages across the world, which are usually not very well catered to by the standard set of models, so they're using Adaptive Data to generate high-quality data in those low-resource languages and train custom models on top of that. We have customers using it to generate really long-context data because they want to fix long-context issues they're encountering with the models. So, really a wide variety of use cases.
Nataraj: So if I look at the abstraction layer, you're sort of between pre-training and inference — you're offering techniques at the customization layer. Is that a good way to say it?
Sudip: With Adaptive Data, it's mostly offering you high-quality data that you can use to customize models either at the mid-training or post-training stages.
Nataraj: So basically you're creating synthetic data. Is that how I should understand this?
Sudip: Yeah, the output is going to be a high-quality SFT dataset.
Nataraj: Got it. You also talked about adaptive intelligence and adaptive interfaces. What is the next step here in terms of providing more intelligence?
Sudip: The next step, which I mentioned, is AutoScientist, which is a step toward adaptable intelligence. There we're not just stopping at "here's a really high-quality output dataset," but rather we'll also take care of the training process for you, where we co-optimize both the data and the model and deliver a model that is very highly tuned to your particular task, with a guarantee that it does really well on that task. You don't necessarily have to have an in-house set of expert researchers to get to that model, because that's one of the unsolved problems. There are a lot of players who offer fine-tuning APIs, which remove the infrastructure complexity from training itself, but a lot of people run into hurdles using them because they either don't have the high-quality data needed to do the fine-tuning, or they don't have the in-house knowledge to tune the training process itself. With Adaptive Data we solve the first piece of the puzzle, and with AutoScientist we solve the second piece — but more importantly, we co-optimize it with the data itself, as opposed to just doing the second piece by itself.
That's a step toward adaptable intelligence. If you think more broadly about what model training is, it's an example of a long-horizon task — the training process can take hours to days. So it's an instance of a long-horizon task that we're now optimizing, and with adaptable intelligence we want to push more in the direction of optimizing other long-horizon tasks in the future. Interfaces is the third pillar we're focusing on. We want to explore what intuitive, task-specific interfaces look like — where it's easy for users to consume information in a very native interface mapped to their specific task, but also easy enough for them to provide feedback seamlessly, so that, using adaptable intelligence, they can see the system correct those behaviors or errors over time.
Nataraj: Give me an example of an adaptive interface. I know you're still thinking about it, but what would an adaptive interface look like?
Sudip: I can't go into too much detail because it's still in the works, but hopefully we'll be able to show more about it. Think about it this way: today, if you go to a chat interface, chat is the primary way of interacting with AI. It's not very task-specific — you get the same interface irrespective of what you want to accomplish. But as humans, we find it much more palatable if information is organized in a specific format that makes it easy to grasp. It may be charts in some cases, images in others, or a combination. An adaptable interface builds on that thesis to create really task-specific interfaces that make it much easier for you to consume information.
Nataraj: So in some ways it's almost a full-stack play. You start by providing data — if you have data, good, bring it; if you don't, we start by providing high-quality data for the specific task you're optimizing for. Then once you have the data, you build the intelligence layer with all the post-training techniques. And then once you have those techniques in your system, you provide interfaces that feed back into your techniques, which you can implement to again give back more optimized results. It's a full-stack approach starting from data, then intelligence, then interfaces. It also looks like, if I'm trying to create a new Cursor competitor and I don't have a lot of coding data, I could bring some data, get more data from you, take an open-source model that's really good, use all the post-training, fine-tuning, or inference-layer techniques, have your intelligence APIs incorporated in my product, and then get that feedback cycle going and improve further. Is that a good way to think about it?
Sudip: Yeah, absolutely — you captured it really well. Our overall mission is to enable a much broader class of users to exercise control over AI across the full stack, as opposed to being relegated to building products on top of rigid APIs. Right now, there's maybe 100 to 1,000 people in the world who have the knowledge to train frontier models. But what if we could 10x, or 100x, or 1,000x that number? That's what we want to achieve in the long run.
Nataraj: It almost feels like this is the next version of fine-tuning. If you map out what different people are doing, there are only five to ten real companies pre-training, and then the rest are doing some version of fine-tuning — also a small class. If you think about the broader market, the Fortune 1000, maybe 80% might not even be doing fine-tuning yet. And you're creating a new class of techniques on top of fine-tuning. How do you think about this versus fine-tuning — when do you do fine-tuning versus the more inference-layer, adaptive techniques you're talking about?
Sudip: In the long run, we almost don't want the user to even be aware of when to fine-tune or why to fine-tune; the system just does the right thing for you behind the scenes. In some ways it's a bug in the system that we developers have to be so exposed to the nuts and bolts in order to change the behavior of the system. Of course, there'll be incremental milestones we take to get to that long-term vision, but our long-term vision is that the system takes care of the adaptation for you — whether it's fine-tuning, something else, or a combination of the two. Those are all implementation details in some ways.
Nataraj: One related question: where are the new techniques coming from? Once the whole inference-optimization wave started, people shifted toward KV caching, LoRA, and quantization, and companies started building on top of those techniques. Where are the new techniques coming from these days? Is it mostly the big training labs? Is it open source? Or are you internally looking aggressively for the next technique?
Sudip: We are a frontier research lab. The term that's popular in the industry these days is "neolab," so we're one of the neolabs, and there's a host of other neolabs. A lot of them are innovating in various spaces — some are more specific, like "we'll solve AI for science." We're definitely innovating as well; we have our own in-house research teams constantly looking for new approaches and techniques. And obviously there's also the broader research community outside the industry that's constantly innovating. It's a broad enough set of problems that there's plenty of space for research and innovation, and we think it's a very valuable problem to solve in the long run.
Nataraj: Are there techniques that maybe the San Francisco research community knows about that aren't yet popular, where you're like, "this thing is going to be big"? Are there techniques like that you see coming?
Sudip: There are definitely some in-house techniques within frontier research labs that haven't been widely available. Having been at some of those frontier research labs, part of our mission is to productize these techniques so a broader class of users can access them. Part of our mission is also to consistently continue to innovate and bring new techniques to the world. Are there unpublished techniques? Sure — many companies consider them valuable intellectual property. But the space of innovation is rapid enough that I'd hope those techniques don't remain limited to a small set of individuals, but become more broadly accessible over time.
Nataraj: You use this phrase, the "last 5% reliability gap," and it really resonated with me, because in some sense all AI problems are starting to look like the self-driving car problem. We make 80% progress very quickly, then 10–15% progress over six months, but the last 5% takes forever. The last 5% is where you wish it would change, and then it changes again. I see this in coding, in agents doing Excel updates — whenever I'm trying to adapt to more of these use cases, the last 5% is the dealbreaker. The last 5% is where all the real value is, because that's what tells you whether it goes from a demo to production. Talk to me a little about that last 5% and your thesis there.
Sudip: We see this fairly often in enterprises. Enterprises want to integrate AI into their products, but in many cases they're used to having 99.9% reliability from when they weren't using AI. Now, suddenly, even getting to 90% is really hard because AI is very stochastic in nature. A big reason for that last 5–10% gap is that a lot of it is very dynamic — the context changes over time, and the relationships you need to be aware of to make an accurate deduction change over time. But AI is more or less static, so it doesn't do well with that dynamic drift in the environment or the data around it. That's what I'd attribute a lot of that last 5% failure to. But if the AI system itself were more dynamic and could continuously learn, then we certainly have an opportunity to bridge that last 5% gap. It's not that it won't make mistakes. Let's say it makes a mistake, and you tell it today that going forward this is not what you want because it did it wrong, and it internalizes that and learns from it. Over time that 5% becomes 1%, and the 1% becomes 0.1%. But it has to be through systems that continuously learn and evolve, as opposed to static systems with a fixed set of behaviors that don't change with the environment.
Nataraj: I work in storage, and one thing that surprises people is that most storage systems still use HDDs, not SSDs. It underlies a point: in all the tech companies, we spent a lot of time making those systems so efficient that even though the hardware isn't cutting-edge, you could get a lot more efficiency from software — software-defined storage and more techniques on the software side rather than the hardware. In AI, because things are moving so rapidly, we've not gotten to that stage of optimizing systems on top of the hardware layer. I think you're one of the examples of that shift slowly happening — optimizing systems on top of raw training. Until the last three years, we were getting benefits mostly from training, and we've seen only one or two form factors, primarily chat-defined products, come out of it. But there's a whole set of products that will come out of optimizing things, and that will yield a lot of value. You're falling into that next era of products we'll see.
Sudip: Yeah, absolutely. The storage analogy is really good, and it's a sign of maturity in the landscape that we're now trying to focus on broader optimization of systems, as opposed to just one component. A lot of it is also driven by the fact that there's this big demand-supply gap in the industry right now. Because of all the agentic workloads, demand for AI inference has gone up by multiple orders of magnitude. On the hardware side, there's still a lot of supply constraint, which is driving up compute prices, and that supply constraint is going to take a while to resolve. It's not something that will go away in the next six months — it's probably optimistically 12 to 24 months, but conservatively it could take up to five years, because it starts right at the level of energy and acquiring land to build data centers. There are a lot of fundamental core infrastructure issues that need to be solved to bridge that demand-supply gap. And if you have that gap, you need to innovate at the system level to extract as much efficiency as possible from the capacity you have. So the gap is essentially creating an opportunity for a lot of innovation at the systemic efficiency level.
Nataraj: It reminds me that pretty much every new product goes through this cycle. You start with some foundational change directly correlated to hardware — even the Apple iPhone — initially it's all about form-factor innovation, but then post that, it's relentlessly improving the system. The full-stack system is what led to what we see today.
Sudip: Yeah, absolutely. The only difference is just how fast things are moving. In previous changes, the changes happened over a period of a few years, but now it's almost instant.
Nataraj: You had breathing space to take a step back. Now everything is immediate.
Sudip: Exactly. It's much more relentless now.
Nataraj: I think that's a good note to end the conversation. Sudip, thanks for coming on the show. One last question: a couple of years down the line, will we still be 10x-ing our GPU clusters for training, or do you see that dying down? What's your take on what the next three years look like?
Sudip: I think we've already seen a major shift from training to inference. Three-plus years ago, more than two-thirds of compute was going to training and one-third to inference. Over the last three years it has shifted, and now two-thirds is going to inference and one-third to training. That's a sign of maturity, because it means people are actually using those models much more. I'd expect that trend to accelerate. More broadly, I also expect compute to become increasingly decentralized. AI is going to become much more pervasive — it'll move onto devices and toward the edge. That will enable innovation at every level of the stack, but it also requires a lot of continuous learning, because if you have AI deployed on an edge device, you want it to continuously learn while it's deployed. In that sense, the gap between "this is training" and "this is inference" will also blur over time.
Nataraj: Thanks for coming on the show. I'm looking forward to what you guys launch next.
Sudip: Great. Thank you for having me.