Research That Ships
Most AI research is designed to be cited. Not deployed.
There's a pattern: build a model, beat a benchmark, publish the paper, move on. The benchmark gets more competitive. The models get bigger. The gap between what's published and what's useful stays the same.
I started On Ground Labs because I got tired of watching this cycle.
The Benchmark Problem
AI benchmarks measure what's easy to measure, not what matters. A document AI system can score 95% on field extraction accuracy while systematically confusing net pay with gross pay on every paycheck it reads. The benchmark sees a correct extraction. The mortgage underwriter using it sees a catastrophic error.
An AI memory system can perfectly recall that you're allergic to peanuts. But does it change how it behaves because it knows you? Does it adapt its tone over time? Does it know when not to bring something up? No benchmark tests for this. They test recall. They don't test understanding.
We're building evaluation frameworks that measure what actually matters for deployment — not just whether a model can extract text, but whether it understands what that text means. Not just whether a system remembers facts, but whether it acts differently because of what it knows.
The Cost Question
Here's a number that surprises people: our hard cap for v0.1 of any benchmark is under $200.
Not because we're cheap. Because if your evaluation framework needs $10,000 in compute to run, nobody outside a Big Tech lab will ever use it. And if only Big Tech labs can evaluate models, only Big Tech labs will set the standard for what "good" means.
We default to open-source models and free tiers. We run experiments on Colab. We design benchmarks that a graduate student at a tier-3 Indian university can reproduce on their laptop.
That's not a constraint. That's a feature.
What Grounded Research Looks Like
23 researchers. 14 projects. 4 papers in conference review. 3 patent domains filed.
The projects span three pillars. Agentic systems that operate in messy real-world environments — enterprise data spread across Slack, CRMs, contracts, and billing systems where no single source tells the truth. Evaluation frameworks that catch the failures current benchmarks miss. And model training research on making small models smarter instead of making big models bigger.
Every project follows the same principle: if it can't be deployed, it doesn't count. A paper that only works on a curated dataset isn't research. It's a demo.
The Alternative to Publish or Perish
Academic research is stuck in a loop. Publish or perish means optimizing for paper count, not impact. Researchers chase incremental improvements on established benchmarks because that's what gets accepted. Nobody gets tenure for building something useful that doesn't beat a state-of-the-art number.
We're trying a different approach. Our researchers are building things they can own and extend. Open datasets rooted in real community needs, not synthetic collections designed to make models look good. Benchmarks that practitioners actually use, not ones that only exist in conference proceedings.
The goal isn't to publish more papers. It's to build the evaluation and deployment infrastructure that makes AI actually work outside a lab.
The Bet
The best AI research won't come from the biggest labs. It'll come from the ones closest to real problems. The ones that build for ₹15,000 devices, not ₹15,00,000 servers. The ones that measure what matters, not what's easy.
That's what "On Ground" means. Applied, field-tested, close to life. The opposite of "in the cloud."
I expanded on the small-models thesis at Cypher 2025 and in a deeper technical piece later that year. The book we're writing — Inference Intuition — is part of the same effort: making this knowledge accessible, not gated.