Jeff Liu | Full-Stack Engineer

I build with AI most of the day. Multi-agent systems, content pipelines, infrastructure decisions, database schemas. It's my primary toolchain.

And I ran into a problem.

The longer I worked with one AI in a single session, the more I noticed the output starting to drift. The answers stayed confident, but the gaps got bigger — and harder to spot because everything still looked clean.

The two-pass discovery

Today I was planning a persistent memory system for my platform. Claude Desktop (Sonnet 4) gave me a plan. Clean architecture, three layers, a timeline of 5-8 days.

Then I asked Google Antigravity (Claude 4.6 Opus) to review it.

Antigravity came back with a table of gaps. The embedding model was assumed, not decided. The vector index strategy was not addressed. There was no conflict resolution for when two agents write contradictory memories about the same preference.

The compression prompt quality was hand-waved. And the timeline was flagged as a prototype estimate, not a production one. The real number is 2-4 weeks.

Claude Desktop's plan was not wrong. It was incomplete. And confident about being incomplete, which is the part that gets you into trouble.

Different models, different blind spots

This is not a knock on Claude Desktop. I use it every day. But every model has a perspective shaped by how it was trained and what it was optimized for.

Claude Desktop (Sonnet 4) optimized for clarity and structure in a consumer chatbot context. Antigravity (Claude 4.6 Opus) pushed back on assumptions using locally trained institutional data — project history, schema decisions, prior failures.

Neither is the full picture on its own.

The process now

I run a two-pass review on any significant decision. One AI generates the plan. A different AI interrogates it. Not to be adversarial — to be thorough.

This connects to something I've been measuring around context windows and token budgets. Every AI brings context biases into a session. The more specialized the model, the more pronounced those biases get. A model trained heavily on code will pattern-match technical problems differently than one trained on reasoning tasks.

The cost of shipping a plan built on a 5-8 day estimate that should be 2-4 weeks is real. I've shipped plans that looked airtight and fell apart in week two. The plan was reviewed by the same system that wrote it.

Running a second model over the first isn't redundancy. It's the cheapest quality gate I've found for catching the gaps that confidence hides.

Why I Let AI Fight With Itself Before I Ship Anything

The two-pass discovery

Different models, different blind spots

The process now

Related articles

How I Built a Publishing Stack for AI Search

Building Systems, Then Giving Them Away

45 Rule Files Were Making My AI Worse

Every Rule I Wrote Started as a Mistake

Related articles

How I Built a Publishing Stack for AI Search
People are finding things through ChatGPT and Perplexity instead of Google. I built a pipeline that handles the optimization for that automatically — here's how it works.
Apr 25, 2026

Building Systems, Then Giving Them Away
At Zeta Global, my team gave me room to figure things out. I tried to do the same for the people who came after me — from responsive templates to training engineers across time zones.
Apr 25, 2026

45 Rule Files Were Making My AI Worse
I had 45 rule files telling my AI how to behave. The output was getting worse, not better. Then I measured — 170KB of context loaded per session. Here's what happened when I deleted half of them.
Apr 11, 2026

Every Rule I Wrote Started as a Mistake
I added Husky because of an anti-pattern. CI because of an anti-pattern. PR requirements because of an anti-pattern. None of it came from reading best practices. It came from breaking things and writing it down.
Mar 28, 2026