Why GPT-5 Is a Step, Not a Leap

For months, the industry held its breath for GPT-5. The narrative from Silicon Valley was clear: this is the one that gets us to AGI. A model that doesn't just predict the next word, but actually thinks.

GPT-5 shipped. It's faster. It's cheaper. It's better at coding. And it is not AGI. Not close. The gap between what benchmarks say and what these models actually do in production has never been wider. Someone needs to say it plainly, so let me.

The Reasoning Problem

I've been following the research closely, and a paper from ASU earlier this year crystallized something I'd been seeing in practice. They took standard logic problems — the kind LLMs now "solve" impressively — and changed a single irrelevant variable. Something a 10-year-old would ignore without thinking.

The models collapsed. Not degraded slightly. Collapsed.

This tells you something fundamental about what's happening under the hood. These models aren't reasoning from first principles. They're navigating a high-dimensional map of patterns. When the map matches their training data, they look brilliant. When it shifts even slightly, they get lost.

I say this often: AI doesn't have a mental model of the world. It has a statistical model of language. Those are very different things. One can generalize. The other can only interpolate.

Benchmarks Don't Mean What You Think

In 2025, we saw models scoring 90%+ on bar exams, medical licensing tests, coding interviews. The headlines wrote themselves: "AI passes the bar!" "AI beats 99% of doctors!"

Two problems with this.

First, training data contamination. Many of these exams — or problems structurally identical to them — are already in the training corpus. The model isn't passing the test. It's remembering the answer key. That's not intelligence. That's memorization with extra steps.

Second, the production gap. Passing a coding interview is one thing. Maintaining a million-line codebase with 50 business-specific edge cases, a legacy database nobody fully understands, and compliance requirements that change quarterly — that's the actual job. No model does this today. Not even close.

When an engineering lead reaches out to me asking "should I trust GPT-5 to work on our core system?" — my answer is always the same. For isolated, well-defined tasks? Yes. For anything that requires understanding the full system? You're the understanding. The model is the tool.

Coding Is the Exception, Not the Rule

I'll be honest about where AI genuinely delivers: code. GPT-5 and its peers write boilerplate, catch syntax errors, and refactor functions faster than I can. I use these tools daily. They're real.

But this gets misread as proof that AGI is around the corner. It's not. It's proof that code is uniquely suited to statistical models. Code is the most structured, logical, well-documented data humans have ever produced. Billions of lines of it, with clear input-output relationships, type systems, and test suites that define correct behavior.

Of course a statistical model excels at it. Code is the easy case.

Now try the same model on a nuanced legal negotiation. On a medical diagnosis where the patient's symptoms don't match any textbook case. On a business decision where the data is incomplete and the stakeholders disagree. That's where the "reasoning" falls apart — because those problems require exactly the kind of first-principles thinking that the ASU paper showed these models can't do.

Syntax is solved. Judgment is not.

What Actually Gets Us Further

More GPUs and more parameters won't close this gap. We've been scaling for 3 years and the fundamental limitations — brittleness, hallucination, inability to self-verify — haven't gone away. They've just gotten harder to spot because the outputs are more polished.

The real progress will come from architectural shifts. Models that can plan multi-step actions, verify their own outputs against real-world constraints, and self-correct without human intervention. Agentic systems, not bigger predictive models.

At On Ground Labs, this is part of why we focus on small, specialized models rather than chasing scale. A 270M-parameter model that deeply understands one domain and can verify its own outputs is more useful than a trillion-parameter model that sounds confident about everything and is right 90% of the time. The last 10% is where production systems live or die.

The Honest Middle Ground

I'm not dismissing current AI. I use it every day. My engineering output has genuinely multiplied. These are the best tools I've ever had.

But I'm not pretending it's AGI either. The industry has a pattern: overpromise on capabilities, under-deliver on reliability, and hope the next model fixes everything. GPT-5 didn't fix the fundamental problem. GPT-6 won't either. Not without a different approach.

Use these models for what they're good at — automating the structured, repetitive, well-defined parts of your work. Save your own time for the parts they can't do: judgment calls, system design, understanding messy real-world problems.

AGI is a long road. We have a world to build with the tools that exist right now. Focus on that.

A month after writing this, I gave a keynote at Cypher making the case for why small models are the real future — not bigger versions of the same architecture. And the research lab I'm building is focused on exactly this: models that deploy, not models that benchmark.