• Activeloop raised $20M to solve AI's hidden bottleneck — unstructured data management.
  • Here's the Princeton origin story and why the data layer matters as much as the model.

The Princeton Lab Problem That Became a $20 Million AI Company

Before Davit Buniatyan founded Activeloop, he was doing neuroscience research at Princeton — working with petabyte-scale brain imaging datasets. The science was compelling. The data infrastructure was a disaster.

Managing large, complex datasets for AI research meant cobbling together incompatible tools, writing brittle custom pipelines, and watching AI engineers spend more time wrestling with data plumbing than building models. When Buniatyan looked at the broader AI industry in 2018, he saw the same problem everywhere — and almost nobody building a dedicated solution for it.

"When we started building Activeloop in 2018, our focus was on solving a critical challenge that many had yet to recognize: managing large, complex, unstructured datasets for AI," Buniatyan wrote. The company, incorporated as Snark AI Inc. and backed by Y Combinator in the Summer 2018 batch, spent years building what it calls the Database for AI while the rest of the industry debated model architectures.

In March 2024 that bet paid off: Activeloop closed an $11 million Series A led by Streamlined Ventures, bringing its total funding to approximately $20 million. Months later it was named a Gartner Cool Vendor in Data Management — independent validation from the enterprise analyst community that the problem is real and the solution is credible.

Why AI Has a Data Infrastructure Problem

The received wisdom in AI coverage focuses almost entirely on models: which lab released what, which benchmark was broken, which company raised the biggest round. The data layer — how training data is stored, versioned, queried, and streamed to models — receives far less attention, despite being the bottleneck that determines whether an enterprise AI project succeeds or stalls.

The core issue is that most enterprise data is unstructured. Images, X-rays, videos, audio recordings, documents, sensor readings — these account for the vast majority of what companies actually have, and none of it fits cleanly into traditional SQL or even modern NoSQL databases. When an AI team needs to train or fine-tune a model on this data, they typically face a choice between painful custom engineering or physically copying data to GPU compute — a process that for large datasets can mean hours of idle GPU time before a single training step begins.

"Instead, Activeloop enables companies to hand off just enough data to compute for the GPU to be fully utilized," Buniatyan explained. The efficiency gain is the entire value proposition: stop paying for GPU time spent waiting for data, and start paying only for GPU time spent computing.

What Deep Lake Actually Does

Deep Lake is Activeloop's core product — an open-source, multimodal database built specifically for AI workloads. Where a traditional database stores rows and columns, Deep Lake stores tensors: the mathematical objects that neural networks actually consume. Audio files, video frames, images, text documents, and vector embeddings can all live in the same database, queryable through a Tensor Query Language that Activeloop built to mirror how AI teams actually think about their data.

Deep Lake improves knowledge retrieval accuracy with LLMs by 22.5% on average. Its AI-native embedded architecture ensures easy on-premises setup in just a few lines of code, meeting top security standards like SOC-2 Type II. That last point — on-premises deployment with enterprise-grade security — is what opened the door to regulated industries. A hospital cannot send patient imaging data to a cloud vendor without significant compliance overhead. Activeloop runs in the hospital's own environment.

Bayer Radiology, a unit within the pharmaceutical giant, used Deep Lake to streamline AI data preparation in radiology, enabling "chat with X-rays" capability. Steffen Vogler, a Senior Imaging Technology Scientist at Bayer, noted that AI developers previously spent 50% of their time making data AI-ready. Deep Lake compressed that overhead dramatically, giving the team back time they were previously spending on plumbing rather than science.

The platform claims to boost AI engineering team productivity by up to 5x and reduce costs by up to 75% compared to market offerings. The productivity figure is the more believable one — the cost claim depends heavily on what you're comparing against.

From Dataset Library to Agentic Infrastructure

What has made Activeloop worth watching beyond its initial dataset management thesis is how cleanly its roadmap maps onto the industry's shift toward AI agents.

In October 2024, the team released Deep Lake 4.0, billed as the fastest multimodal AI search on data lakes. In February 2025, they launched Deep Research for Multi-Modal Data. By May 2025, they had released Activeloop-L0: Agentic Reasoning on Your Multimodal Data. Each release moves further from "database for AI training" toward "memory and knowledge layer for AI agents."

The most recent product, Hivemind, is the most telling. Hivemind is a continual learning layer for coding agents — working across Claude Code, Codex, Cursor, and others — that stores all your data on your own cloud. It solves a specific problem: when a senior engineer's agent debugs a tricky bug, that knowledge evaporates when the session ends. A junior engineer hitting the same bug the next day starts from zero. Hivemind gives coding agents persistent organizational memory — the difference between a tool and a colleague.

This is a genuine product evolution, not a pivot. The same core infrastructure that stores and retrieves multimodal datasets for model training is now being applied to make AI agents less amnesiac.

Bottom Line

Activeloop operates with approximately 15 employees, generates an estimated $4.5 million in annual revenue, and maintains a valuation of $14.4 million. For a team that size, that revenue figure represents strong capital efficiency. The deeper signal is strategic positioning: as AI moves from models to agents, the winners in the data infrastructure layer will be the companies that understood the data problem before most of the industry did. Activeloop started building in 2018, when the problem was not yet recognized. That six-year head start is now a moat.


Edited By Nabarun