Transcript: How Vast Data Built a $30B AI Data Platform | Jeff Denworth, Co-founder of Vast Data

Nataraj: Hello, everyone. Welcome to Startup Project. We're in a big AI super cycle. Post-AGI, we've seen the scaling of GPUs driving new AI labs to emerge. We've encountered different sets of bottlenecks — the GPU bottleneck, the power bottleneck, the H100 supply bottleneck. But I think it has created new opportunities for companies to come up with innovative solutions that support AI workloads. One of the fundamentals of AI workloads — or any workloads in the cloud — is having a great storage product. One of the companies that emerged in this process is Vast Data. Today we have Jeff Denworth from Vast Data. He's one of the co-founders, and we'll talk about what they're doing differently from other storage products, how they're supporting AI workloads, and more. They're backed by Nvidia, they recently announced a new funding round valued at an impressive $30 billion, and they have a deal with CoreWeave. With that, Jeff, welcome to the show.

Jeff: Nataraj, thank you for having me.

Nataraj: Jeff, let's talk about the origin story. You guys started in 2016. Storage in 2016, I remember, is not a super exciting story. So why start a storage company in 2016? What was the thesis?

Jeff: It was super unsexy. There was an investor at Andreessen Horowitz — I won't name his name — who, a year or two prior, had taken a bath on a bunch of investments that Andreessen had made in storage. He traveled around the world saying that this market was totally dead. What he didn't appreciate was that most of the world was still building new derivatives of an architecture that came out in 2003, when Google introduced a new concept called the Google File System. That concept — where you take large data sets and spray them across a bunch of commodity servers — was very popular. It created what is today probably a hundred billion dollar industry of distributed systems, storage, databases, hyperconverged systems, backup systems, and everything in between. You even have companies today whose claim to fame is that the founders came from the Google File System team. So it was really impactful.

We saw three opportunities — or maybe one challenge and two opportunities — that we wanted to take a stab at solving. First, when we looked at the compromises introduced by what Google made, we realized a lot of them could be solved through the implementation of a new distributed systems architecture. The best example is scalability: transaction performance never really scaled as you built really large clusters. The most obvious representation of that is a classic Hadoop cluster — you can't really write real-time data into it. That's why you have to go build other systems, like Kafka-based processing. These things weren't really transactional. The bigger clusters get, the more they slow down due to internal communication between nodes. Most organizations outside of Google couldn't figure out a way to scale that architecture to exabyte scale.

Second, the world was still very much convinced that the only way to build capacity was with hard drive-based media. What we realized is that the only value hard drives had in an enterprise data center was low cost per capacity. If you could solve that one problem and build a very low cost all-flash system, two things happen: nobody needs hard drives anymore, and once your archive is built from flash, you don't need a faster tier of storage. From a first principles perspective: capacity equals performance when you get to flash. If you can afford to put all your data there, you don't need a classic storage hierarchy.

Then the challenge was: if you have this highly parallel all-flash system, what's the thing that's going to unlock this new concept? We spent a lot of time looking at the knock-on effects of deep learning — what would happen if it became commercially viable, and how it would change the relationship between computing and data. Something was already happening: that year, Google won the game of Go with AlphaGo; OpenAI was founded; and Jensen introduced the DGX server — basically regarded as one of the first AI computers. When you put those three things together, the world was starting to get its act together on how deep learning could scale. We realized there was a huge opportunity to solve the data problem and get ready for that scale with a new systems architecture.

When we started, there were probably 20 other new flash companies entering the market, all shipping new spins on old architectures. What we realized is that it's almost never the case that a new systems architecture gets invented in storage. We feel like we're picking up the work that Google laid down back in 2003 and carrying it forward. That has given us a ton of advantages and helped people really start to scale up their AI initiatives.

Nataraj: What were the initial customers, and what workloads were primarily being used before mainstream adoption happened?

Jeff: We were looking at what was happening at places like Meta, Google, Tesla, Uber, and Baidu. The reality is we weren't selling any of them at the time. We needed to get the first product to market. The first version was introduced as a product we called Universal Storage. With a very low cost parallel all-flash system, we realized the high-performance computing market looked prime for modernization. We started reaching out to two types of organizations that largely do similar types of processing. First was the life sciences market — people doing high-scale genomics, for example. First customers were organizations like the National Institutes of Health. The other market was quantitative and high-frequency trading-based hedge funds — organizations that would spend any amount of money for competitive advantage.

While we were waiting for the thesis to manifest — getting flash to the cost of disk — those customers were paying a small premium for flash because it made their compute a lot more efficient. People justified the expenditure on VAST on the back of: if this makes my pipeline two to three times faster, that's essentially like getting back half, or 65%, of my servers to use for other applications.

Nataraj: At that point, were there other players with flash-based storage systems — NetApp or EMC Dell? Were they technically competitors?

Jeff: Back then — and today we kind of view the whole world as our competitor — the other unsexy decision we made was to build a system for on-premises deployment. I took a lot of time to decide to come to VAST just because I thought the whole world was going cloud. The reason we didn't start in cloud was because a lot of the concepts and technologies we bet on — containers, NVMe over fabrics, and very low-cost NVMe storage devices — just weren't available in cloud. Try as you'd like to get Amazon to ship a new hardware SKU just because you have a cool concept; it doesn't really work that way.

So with a few partners, we co-engineered reference architectures that we brought to market. We started to compete with two styles of companies. On one hand, VAST was highly parallel and designed for extreme levels of performance — that had us competing with parallel file system companies used in high-performance computing. On the flip side, we were very quickly building in enterprise data features that allowed us to compete with legacy enterprise products that nobody had ever thought of from a high-performance perspective. The same product that today powers five of the world's top supercomputers from a storage perspective is also being used as a backup appliance in a Fortune 100 insurance company.

The idea is: if you look far enough out and deep learning becomes commercially viable, all the world's data will need to be processed by GPUs. That's why we spent a lot of time building in enterprise features — so people could be ready when that moment hit.

Nataraj: In some sense, you were competing with NetApp and Dell because they were the big players.

Jeff: All of the legacy storage players for sure, as well as some of the older incumbent high-performance computing storage players.

Nataraj: Today, you've obviously partnered with CoreWeave. Are you primarily on-premises or in the cloud? What does the footprint look like?

Jeff: In 2022, ChatGPT hit. Right on the back of that, starting in 2023, all these new AI clouds were emerging — some coming from Bitcoin miners evolving their business. They were popping up very frequently in the back half of the year with new requirements for large-scale GPU deployments that needed things like multi-tenancy and enterprise data management features, because they didn't just want to serve AI labs. They also wanted bigger enterprises to use their platforms. If you think about CoreWeave, they sell to companies like Jane Street — a very serious trading business — and ServiceNow — a very serious enterprise software business. Once you get to that level, you can't just use legacy high-performance computing research technologies.

We had built all those features — the scale, the performance, the cost efficiency. That got us into the Neo cloud business, where we started selling to the biggest players in the space. CoreWeave made an early bet on VAST, and it's been a great partnership. We sell to companies like Lambda, Crusoe, and Nscale. Around the world, we probably sell to about 60 AI clouds. I'd estimate roughly 90% of on-premises and Neo cloud GPU deployments — lumped together — are powered by VAST.

Most of that is Neo cloud-based or on-premises for enterprises building their own infrastructure. We're now also starting to work with the biggest cloud players to deploy our software in hyperscale platforms. We've made announcements with Amazon, Google, and Microsoft, and we're slowly rolling customers into those environments as the platform matures. The cool thing is customers don't need to choose — they can federate everything together into one unified data management system. Jane Street spoke at GTC two years ago and said they have VAST systems in their data center, VAST systems at CoreWeave, and a unified environment stitched together by VAST that helps them manage data across both destinations.

Nataraj: So you're present in all three clouds. That's a similar strategy other storage companies have taken — being part of hyperscalers is now a must for companies like VAST.

Jeff: I agree. There's a whole spectrum of partnership types you can make with them. We've announced a very strategic relationship with Microsoft that we're ramping up now. There'll be another one announced over the next couple of weeks that we think is equally strategic. When you're VAST and someone mentions your valuation, I think of that as a reflection of market traction and product-market fit. We've done a great job showing people that we're the right platform to scale up AI training and inference investments. That memo has been received by the bigger cloud players, who are now starting to take us very seriously as they build out the next generation of their infrastructure.

Nataraj: What is the primary competitive differentiator, do you think? Because when someone like Crusoe or Lambda is evaluating you versus DDN, Hammerspace, or some of the other players out there...

Jeff: First, as I mentioned, we got into the market with a very strong focus on multi-tenancy and enterprise features. Some of those other companies — DDN, where I used to work — their pedigree was really enabling very high performance with a lot of feature compromise and complexity that has historically come with those platforms. Great for research computing, not necessarily great if that's the back end of your business.

What you get from VAST is a system with all the enterprise features you'd expect from a traditional IT platform, with the added benefit of the scale and performance you need for AI workloads. But the more interesting part of our story is that in 2023, we made a move that nobody in our industry has ever attempted — and it's now paying a lot of dividends. We moved into the tabular data space. Now the competitive landscape really gets expanded: you can use VAST for everything from event brokerage to large-scale SQL-based analytics to vector database search — all from the same platform you originally started using as a file, object, and block storage system.

The relevance to VAST AI customers: for reinforcement learning, I need to capture feedback from models using Kafka streams. As I contextualize data, I need very scalable and low-cost vector search services for RAG routines. For analytics, I need to analyze that information for model evaluation. All of that is already in the platform. If you think about the larger players in the AI data space — today, Snowflake and Databricks are used for data preparation and model evaluation by some of the larger AI model builders. VAST now has a platform that couples both the training and inference parts of those two problems in the same environments where you train and infer in the Neo clouds you run in. We're in a different zip code than where those parallel file systems are — now you've got a product competitive with the best streaming platforms, the best vector databases, and the best analytics systems on the market.

Earlier last year, we took the covers off a compute engine that also runs in the system. Think of it as a serverless compute platform coupled with workflow orchestration tools and eventing infrastructure that triggers pipelines when something happens in the system. That looks like a combination of Kafka, Apache Airflow, and AWS Lambda — in the same system that also has all your database infrastructure, streaming infrastructure, and unstructured data storage all the way down to bare metal. It's basically a full computing stack you can deploy wherever you want for full-stack AI inference. I would argue there is no natural competitor to what we do. The only place you can find the equivalent of this capability is in the cloud — but in a large cloud, you'd have 10 to 20 different services you'd have to put together to get the same thing VAST gives you with one codebase.

Nataraj: And if you're building in your own cloud, you'd have to reinvent the whole thing yourself.

Jeff: Right. And time to market is absolutely critical. These customers tend to really like our value proposition. We had a physical AI end customer just yesterday who needed a vector database that supports half a trillion vectors. It was just sitting there already in the same space where he's doing his training. It makes it really easy.

Nataraj: Does VAST implement its own vector database or use open source?

Jeff: We try to avoid integration as much as possible and build wherever we find big opportunities to save customers money and scale applications. The whole vector database market is built on approaches that require memory. We said: we want to build something that doesn't require memory. A lot of our early focus on using flash as network-attached persistent memory took us to a place where we could build much more cost-efficient vector database infrastructure than was ever thought possible.

The way VAST systems work: you have cores that handle all processing — think of that as the logic of the system — and then over a network, a shared set of SSDs that all the cores can see in parallel. Unlike commodity server architectures where each node owns a portion of the system and must communicate with others to coordinate a transaction, in our case all the cores see the same global volume and none have to talk to each other. They can all see the transactional data structures in parallel.

For vector databases, not only can I build something you can search in constant time regardless of the size of the vector space — getting through billions to trillions of vectors in less than a second, which is very unique without memory-based indices — but because the system has no East-West traffic, I can write into all the vector database servers in parallel. That makes it very nice: if I'm putting a RAG pipeline at the front of my business, VAST can keep up with it. What am I vectorizing? Probably data that lives in my object store or file system. If I combine those into one unified data management platform, there are no inconsistencies between my data and its vector presentation, because all updates are atomic. When permissions change on my objects, those permissions immediately propagate into the vector database — you never have a synchronization problem. That's common with SaaS-based document search platforms that scan data from other environments and try to keep pace with changes. That doesn't work for a CISO. With VAST, it's all completely atomic.

Nataraj: Can you talk a little about the business model — when you have a Neo cloud customer versus a direct customer? With AWS FSx or S3, the pricing is straightforward, but how does it work when your software is behind Crusoe or CoreWeave?

Jeff: The first thing: we don't want customers to have to choose by protocol. It's a fully multi-protocol system. You could have a unified data model and see that data through an object presentation or a file system presentation. Amazon just announced a new file service — I think they call it S3 Files — but we've been in that business for eight years now. Customers never had to choose file or object; they just get unified access. That translates to an advantageous pricing model: you get a protocol-independent lens into a large capacity store. We charge by gigabyte per month. That becomes essentially a revenue share with our cloud computing partners, who offer it to their end customers for more than what we charge them.

Nataraj: I'm thinking out loud here — different clouds have their own offerings. If CoreWeave wants a different pricing model, they'd have to bring you along for that, right? Are you telling all the clouds "this is how we operate," or is it flexible?

Jeff: The PayGo model — we first started working on that with Lambda. They were one of the first to really get into multi-tenant, short-term contracts. They pay based on the infrastructure they consume. We have other customers that buy based on long-term contracts, and the same principles apply — you just have less tenancy in those environments.

Nataraj: Talk to me a little about the size of the opportunity. One of the things we've realized with AI is that a lot of data we were not able to monetize — even though we said "data is the new oil" — is now unlocking.

Jeff: Data was the new oil for model builders to start with. If you have a large enough training data set, you can apply that with GPUs to manufacture intelligence. But then the question becomes: how do you monetize that? If you think about an AI agent coupled with a model to accomplish a task — agents get memories, which is context, they get MCPs they can use as data resources, and they can work with other agents to accomplish tasks. We view all of that as a data opportunity. Even agents talking to agents needs to be captured and logged so that AI becomes explainable.

Our perspective is that most agents can't accomplish rational tasks without access to a robust enterprise data set. In the early days it was: let's build a supercomputer with a parallel file system and do R&D to get things working. Now it's: we need very real-time, highly robust and redundant, fully governed data infrastructure that can withstand high-scale agency. The foundational thesis of VAST is now starting to come into perspective — the world will redefine its relationship with data once you have deep learning applied at a commercial level.

We've got customers now eclipsing a million agents within their environment, with some going toward a hundred million right now. Imagine the computational pressure when you have agents upon agents — agent swarms — all working on a common data set. Inference puts data at the center of the computing paradigm, back the way it was during the big data era, coupled with the fact that you now also have LLMs for tabular data. An agent doesn't just need information in an object storage system — agents need whatever they need to accomplish their tasks. If you have a supply chain planning agent, it needs access to large-scale analytical infrastructure. Whereas AI storage over the last five years was focused on unstructured data, going forward you'll have a mix of structured and unstructured data that's important to feed to these machines.

Nataraj: Yeah, and even with unstructured data, the majority of it still sits in on-premises systems.

Jeff: S3 is a pretty popular service.

Nataraj: True, but estimates — Gartner has a stat where 85% of the world's data is still in NAS systems. It's a crazy statistic.

Jeff: Could be. There are some interesting generators of that. One classic example of a workload that hasn't gone to cloud — and is a huge data generator — is video surveillance cameras. Those now also need agents applied to them for things like video search, summarization, and action-taking on that data. We're working with a few of the world's largest government organizations to activate that, and oftentimes that data can't go to the cloud.

Nataraj: One curious thing: because of VAST's architecture, does it unlock new things technically — like maybe longer context windows during the inference layer?

Jeff: This is a very active dialogue within the company right now. We have researchers who work specifically on context for disaggregated inference. Key-value caching is the approach the market has taken to store context in persistent media so you don't have to recalculate it every time you have a long-running or multi-turn inference operation. Jensen, as of January, started making more forward-looking announcements about the work NVIDIA is doing at the compute level to integrate storage more tightly into these systems. The concept is now called CMX — Context Memory Extension. For every 1,000 GPUs you deploy, you need something like 15 petabytes of data very tightly coupled in a high-performance network onto storage.

The question becomes: where do VAST's advantages add value to how people think about context? One thing we did to remedy the price differential of flash was build a new form of global compression we call similarity. Using the same math that powers vector databases and search engines — fuzzy math and distance calculation — we can determine when two blocks look similar to each other. When we find they do, we compress them against each other. We do this at block level at global scale across an exabyte-scale cluster. Our customers typically save 50 to 70% on their flash investments because of this aggressive form of data reduction.

That also applies to key-value cache data, where you can get a 50 to 70% reduction on that payload. People really like this because they want to store a lot of context. We're working with some AI labs that see a lot of value in keeping potentially infinite context for the models they deploy.

The second, maybe more interesting thing: a lot of key-value cache research assumes the data is throwaway. We can show that the data is actually perceptible — you can get a semantic understanding of key-value data, whether it's a prompt or a response — that could be used for malicious purposes by an attacker. That means there's a mandate to protect this data and govern how it's been accessed. All the enterprise features we've been building into the platform now come back into the equation when you think about handling long-context data that you can store affordably. Where does it all lead? I have no clue — but it's an interesting space people should really be watching. We're watching the construction of a few systems with larger data estates than anything I've ever seen in AI, and I've seen the biggest of the big.

Nataraj: We've seen the shortage in HDDs, and you're all-flash. Are you able to get what you need from the supply chain, or is there a huge pipeline you're waiting behind?

Jeff: We're ultimately a software company. We don't actually sell hardware. We work with server OEM and ODM supply chain partners that take VAST-based appliances to market. Having said that, if these platforms don't make it out to market, we can't charge our customers — so we care very deeply about the flash supply chain. And it is arguably as bad as, if not worse than, the hard drive market right now.

We did a situational assessment in January, just as things were starting to get bad. What surprised us was that our business has been growing exponentially — and the thing about exponential curves is you don't realize what's going on until it hits you hard, because what you did yesterday wasn't nearly as big as what you're doing today. What we didn't realize is that VAST now represents a very significant component of total enterprise storage deployment. We're talking tens of exabytes delivered per year from a software perspective, when there are only about 100 exabytes of enterprise flash being made across the world. We're now in double-digit percentage of our market influence, and our business is growing 2x to 3x annually.

VAST may be contributing to this supply chain problem — or our customers are. On the flip side, our data reduction means our customers need half as much flash as they would with alternate approaches, so we're also acting as a pressure-release valve in the market. Going forward, it's going to be murky until 2028. People are going to have a hard time getting access to flash. We're working with all the major manufacturers, most of whom say they want to bet on VAST because it's very obviously the future. So we're in an advantageous position with great partnerships across the biggest companies.

One thing we're doing while they get their supply chain in order: we're going to customers and taking the flash they bought over the last few years, re-platforming it into VAST systems. They take a server bought for one application, run VAST software on it, and get two to three times the utilization out of that resource. We've got customers giving us half an exabyte of servers saying "please take this and make it bigger" — as a result, they won't need to buy anything for the foreseeable 18 months. That just kicks the can down the road, and it's happening more and more. So we're replacing data lake infrastructure, streaming infrastructure, and object stores — and customers get two to three times better hardware utilization by using our software.

Nataraj: We're almost out of time. I want to ask one last question — talk a little about profitability. I've seen mentioned that VAST is profitable. Talk about the economics.

Jeff: We have a very unique business from a commercial perspective. Typically when you see companies growing as fast as VAST, they're burning mountains of venture capital. In the AI space, there aren't a lot of profitable companies — in part because they're spending so much money on GPU infrastructure. VAST is a software company. Our gross margins are over 90% and we're growing fast.

What customers tend to do is sign multi-year software subscription contracts and prepay upfront. Two things happened: about three years ago, we became cashflow positive — we stopped burning cash. On an annualized basis, the company now generates hundreds of millions of dollars in free cash. Then last year, we became accounting-profitable. If we were a public company, you'd say: this is the first profitable, cashflow-positive company growing at this rate that we've ever seen.

There's a measure for this in the market called the Rule of 40, which says if you add up your growth percentage plus your free cash flow percentage and get to 40% or more, you've got a pretty good business. Our Rule of 40 was actually 228 — which means we printed the best number any investor we've talked to has ever seen.

Nataraj: And that's how you raise a billion dollars.

Jeff: Well, we didn't raise a billion. A billion was invested — not all of that went to the company. A lot of it went to early investors. The company raised just enough to publish that valuation, because as I mentioned, we don't need cash. No sense diluting ourselves.

Nataraj: That's a great note to end on. Thanks Jeff, thanks for coming on the show and sharing all about Vast Data. I'm super excited about the space and what you'll accomplish in the next couple of years. I'll be keeping an eye out.

Jeff: Thank you for having me.