Why I Let AI Fight With Itself Before I Ship Anything
I build with AI most of the day. Multi-agent systems, content pipelines, infrastructure decisions, database schemas. It's my primary toolchain.
And I ran into a problem.
The longer I worked with one AI in a single session, the more I noticed the output starting to drift. The answers stayed confident, but the gaps got bigger — and harder to spot because everything still looked clean.
The two-pass discovery
Today I was planning a persistent memory system for my platform. Claude Desktop (Sonnet 4) gave me a plan. Clean architecture, three layers, a timeline of 5-8 days.
Then I asked Google Antigravity (Claude 4.6 Opus) to review it.
Antigravity came back with a table of gaps. The embedding model was assumed, not decided. The vector index strategy was not addressed. There was no conflict resolution for when two agents write contradictory memories about the same preference.
The compression prompt quality was hand-waved. And the timeline was flagged as a prototype estimate, not a production one. The real number is 2-4 weeks.
Claude Desktop's plan was not wrong. It was incomplete. And confident about being incomplete, which is the part that gets you into trouble.
Different models, different blind spots
This is not a knock on Claude Desktop. I use it every day. But every model has a perspective shaped by how it was trained and what it was optimized for.
Claude Desktop (Sonnet 4) optimized for clarity and structure in a consumer chatbot context. Antigravity (Claude 4.6 Opus) pushed back on assumptions using locally trained institutional data — project history, schema decisions, prior failures.
Neither is the full picture on its own.
The process now
I run a two-pass review on any significant decision. One AI generates the plan. A different AI interrogates it. Not to be adversarial — to be thorough.
This connects to something I've been measuring around context windows and token budgets. Every AI brings context biases into a session. The more specialized the model, the more pronounced those biases get. A model trained heavily on code will pattern-match technical problems differently than one trained on reasoning tasks.
The cost of shipping a plan built on a 5-8 day estimate that should be 2-4 weeks is real. I've shipped plans that looked airtight and fell apart in week two. The plan was reviewed by the same system that wrote it.
Running a second model over the first isn't redundancy. It's the cheapest quality gate I've found for catching the gaps that confidence hides.