Michael Ouroumis logoichael Ouroumis

GPT-5.4 vs Claude Opus 4.6 vs Gemini 3.1 Pro: Developer Review

Three AI model icons compared side by side on a developer workstation with code on screen

The March 2026 AI Arms Race

Within a span of three weeks in early 2026, OpenAI, Anthropic, and Google all shipped major updates to their flagship models. GPT-5.4 dropped on March 3rd. Claude Opus 4.6 followed on March 10th. Gemini 3.1 Pro rounded it out on March 17th.

Developers are drowning in marketing benchmarks and cherry-picked demos. I spent the past two weeks using all three models on real projects β€” a Next.js monorepo migration, a Python data pipeline, and a TypeScript SDK β€” to give you an honest take on what actually matters.

The Models at a Glance

FeatureGPT-5.4Claude Opus 4.6Gemini 3.1 Pro
Context Window256K tokens1M tokens2M tokens
SWE-bench Score74.1%80.8%76.2%
Tool UseExcellentExcellentGood
MultimodalText, image, audio, videoText, imageText, image, audio, video
Best ForAll-around versatilityDeep code tasksLarge codebase analysis

Code Generation Quality

Benchmarks tell part of the story. Claude Opus 4.6's 80.8% on SWE-bench is genuinely impressive, and it shows in practice. But the gap narrows or widens depending on what you're building.

TypeScript and React

I asked all three models to refactor a complex React component that managed form state, validation, and API submission into smaller, composable pieces.

Claude Opus 4.6 produced the cleanest separation. It extracted a custom hook for form state, a validation utility, and kept the component focused on rendering. The TypeScript types were precise β€” discriminated unions where appropriate, no unnecessary any types.

GPT-5.4 gave a solid result but over-engineered it slightly, adding an abstraction layer I didn't need. It did, however, include helpful JSDoc comments without being asked.

Gemini 3.1 Pro produced working code but missed some TypeScript nuances. It used Record<string, unknown> where a proper interface would have been better. The component logic was correct but less idiomatic.

Python Data Pipeline

For a pandas-to-Polars migration of an ETL pipeline, the results shifted.

GPT-5.4 excelled here. It understood Polars idioms well and suggested lazy evaluation patterns that genuinely improved performance. It also caught a subtle bug in my original pandas code.

Claude Opus 4.6 was close behind, with slightly more conservative suggestions. It asked clarifying questions about data volume before recommending streaming versus in-memory processing, which was a thoughtful touch.

Gemini 3.1 Pro handled the migration competently but fell back on pandas-like patterns translated to Polars syntax rather than leveraging Polars-native approaches.

SQL and Database Work

All three models handled complex SQL well. The differences were marginal for query writing. Where Claude stood out was in migration planning β€” given a schema diff, it produced a safer migration with proper rollback steps and data backfill considerations.

Context Windows: Does Size Matter?

Gemini's 2M token window is the headline feature, and yes, it matters for specific workflows. If you need to analyze an entire codebase in a single prompt, Gemini wins by default.

But in practice, I found Claude's 1M token window sufficient for everything I threw at it. I loaded our entire monorepo's TypeScript source files (roughly 400K tokens) and asked for architectural suggestions. Claude handled it without degradation. Gemini handled the same plus our test files and documentation.

GPT-5.4's 256K window is the constraint you'll feel most. For large codebases, you need to be strategic about what you include. This isn't a dealbreaker β€” most tasks don't need your entire codebase in context β€” but it's a real limitation for broad refactoring tasks.

Retrieval Accuracy at Scale

Here's where it gets interesting. I embedded a specific function with a known bug deep inside a 500K token context and asked each model to find it.

  • Claude Opus 4.6: Found it and explained the bug correctly.
  • GPT-5.4: Context too large for its window β€” couldn't test.
  • Gemini 3.1 Pro: Found the function but misidentified the bug on the first attempt. Got it right when I narrowed the search area.

Larger context doesn't automatically mean better retrieval. Claude's attention mechanism seems more consistent across its full window.

Tool Use and Agentic Capabilities

This is where the models diverge significantly.

GPT-5.4 has the most mature tool-use ecosystem. Function calling is reliable, and the model handles multi-step tool chains well. If you're building agents that interact with external APIs, GPT-5.4 is the safest bet.

Claude Opus 4.6 has caught up significantly. Its tool use in agentic frameworks like Claude Code is excellent β€” it plans multi-step operations, handles errors gracefully, and knows when to ask for clarification. The computer use capability adds a dimension the others lack for UI testing and browser automation.

Gemini 3.1 Pro supports tool use but I experienced more friction. The model occasionally called tools with slightly malformed arguments, and multi-step chains required more prompt engineering to keep on track.

Pricing: The Uncomfortable Conversation

Let's talk money, because in production this matters more than benchmarks.

ModelInput (per 1M tokens)Output (per 1M tokens)
GPT-5.4$8$24
Claude Opus 4.6$15$75
Gemini 3.1 Pro$3.50$10.50

Claude Opus 4.6 is expensive. For batch processing or high-volume tasks, the cost difference is substantial. Gemini is the clear winner on price-to-performance for tasks where all three models perform comparably.

My approach: use Claude for complex, high-stakes code tasks where quality matters most. Use GPT-5.4 for general-purpose work and tool-heavy agents. Use Gemini for large-context analysis and cost-sensitive batch operations.

The Tier-Down Strategy

All three providers offer smaller, cheaper models. Claude Sonnet 4.6 at roughly one-fifth the cost of Opus handles 80% of my daily coding tasks. GPT-5.4 Mini is solid for simpler generation. Don't default to the flagship for everything.

Real-World Workflow Recommendations

Use Claude Opus 4.6 When:

  • Refactoring complex TypeScript or JavaScript codebases
  • You need high-confidence code that requires minimal review
  • Working with large files or multi-file changes that need consistency
  • Writing migrations or infrastructure code where correctness is critical

Use GPT-5.4 When:

  • Building agentic workflows with multiple tool calls
  • You need multimodal input (screenshots, diagrams, audio)
  • Working across many languages and frameworks in a single session
  • You want the most balanced all-around assistant

Use Gemini 3.1 Pro When:

  • Analyzing very large codebases or documentation sets
  • Cost is a primary concern and the task isn't highly complex
  • You need video understanding for debugging UI issues
  • Working in Google Cloud–heavy environments with tight integrations

What All Three Still Get Wrong

Let's be honest about the shared limitations:

  1. API hallucinations β€” All three still occasionally invent function signatures or package APIs that don't exist. Claude does this least frequently in my testing, but none are immune.

  2. Edge cases in concurrency β€” Ask any of them to write lock-free concurrent code and you'll get something that looks right but has subtle race conditions. Always review concurrent code carefully.

  3. Test quality β€” Generated tests tend to test the implementation rather than the behavior. They'll mock exactly what the current code does rather than asserting what it should do.

  4. Outdated patterns β€” Despite training data through early 2025, all three occasionally suggest deprecated APIs or older patterns. Always verify against current documentation.

The Verdict

There's no single winner. The honest answer is that the best model depends on your specific task, budget, and workflow.

If I could only pick one for pure coding work, I'd pick Claude Opus 4.6. The SWE-bench numbers reflect reality β€” it produces the most reliable, idiomatic code with the least need for manual correction. The 1M token context window is large enough for any reasonable task.

If I needed a general-purpose AI assistant that also codes well, I'd pick GPT-5.4. It's the most versatile and has the strongest ecosystem of integrations and tools.

If I were optimizing for cost or working with massive codebases, I'd pick Gemini 3.1 Pro. The price-to-performance ratio is hard to beat, and the 2M token context window is genuinely useful for large-scale analysis.

My Actual Setup

For what it's worth, here's how I use all three daily:

  • Claude Code (Opus 4.6) for all hands-on development β€” writing features, debugging, refactoring
  • GPT-5.4 via API for automation scripts and multi-tool agents
  • Gemini 3.1 Pro for codebase-wide analysis and documentation review

The era of picking one AI model is over. The developers shipping the fastest are the ones matching the right model to the right task.


What's your experience with these models? I'd love to hear which combinations you've found most effective. Connect with me on X or LinkedIn to continue the conversation.

Enjoyed this post? Share: