Gemini 3.5 Flash vs Sonnet 4.6: should you re-route your agent stack?
Google shipped 3.5 Flash this week with a "frontier intelligence plus action" pitch and a 4x output-tokens-per-second claim. If your routing layer is on Sonnet 4.6 today, this is the week to re-benchmark. Here is what I am actually moving, what I am leaving alone, and the cost-per-completed-task maths nobody is doing in public.
In this post (11 sections)
Tuesday morning I had four engineers on a call comparing routing tables across three client stacks. By Tuesday afternoon Google had shipped Gemini 3.5 Flash, and the routing tables we had been comparing were a week behind by lunchtime. Welcome to the model-routing layer in 2026.
3.5 Flash dropped publicly at Google I/O on May 19. Google's framing is "frontier intelligence with action," which I read as a quiet admission that the 3.x line had been lagging on tool-use reliability and they have closed the gap. The benchmark claims (outperforms 3.1 Pro on agentic and coding work, four times the output tokens per second of other frontier models) put real weight behind the framing. Whether it earns a seat in your routing layer is the harder question, and the press cycle is not the place to answer it.
I have spent the last 48 hours testing 3.5 Flash against workloads we are already running on Sonnet 4.6 across three engagements. One is a 180-engineer IT services firm in Ahmedabad processing support classifications at roughly 90k events a day. One is a smaller team doing code-review pipelines on a private repo. One is a content-and-SEO operation running pre-agentic data fetching plus a planning step. The picture across all three is messier than the launch deck suggests. Here is the actual call I would make this week.
What actually shipped on May 19
Before getting to the routing question, the facts on the ground. There are five things in the Google announcement that matter for engineers, and most press write-ups are burying them under the consumer launch story (Spark, Omni, the eyewear).
- Gemini 3.5 Flash is the model that shipped publicly. Available in the Gemini app, AI Mode in Search, Google Antigravity, Gemini API, and Android Studio.
- Gemini 3.5 Pro is internal-only as of launch day. Google says it is rolling out next month. Every comparison you read using 3.5 Pro numbers is comparing against a model the public cannot access yet.
- Gemini Omni Flash is the first multimodal generative model in the new Omni family, also available the same day for paid subscribers. This is the model behind the conversational video editing demos.
- Gemini Spark, the new 24/7 agent assistant, is built on 3.5 plus the agentic harness from Google Antigravity. Trusted-tester beta first, AI Ultra subscribers next week.
- AI Ultra dropped from $250 to $200 per month. Not relevant to engineering budgets directly, but it changes the personal-stack maths if your developers are paying out of pocket for assistant tooling.
The bit that affects production agents is the first one. Flash, not Pro, is what you can route to today. Everything below is about that model.
The benchmark Google led with vs the benchmark that actually matters
Google's headline claim was "four times the output tokens per second of other frontier models." That is true on the synthetic benchmark Google ran. It is also mostly irrelevant for an agent workload. Most production agents are not generation-bound. They are tool-call bound. The model picks a tool, waits for the tool to return, processes the result, picks the next tool. The time the LLM spends generating tokens is a fraction of total wall-clock per task.
What you should actually benchmark is cost-per-completed-task on your own eval set. That number folds in token cost, retry rate, tool-selection accuracy, and JSON-schema adherence on structured outputs. None of those show up in a tokens-per-second chart.
On the IT services workload, 3.5 Flash currently runs the classification step at roughly 62% of the Sonnet 4.6 cost per successful classification, when both models are scored against the same 800-case eval set. The catch: Sonnet 4.6 hits 94.5% accuracy on that set. 3.5 Flash hits 91.2%. Which one wins depends on what a misclassification costs you downstream. For routing low-stakes support tickets, the cost win matters more than the accuracy gap. For routing legal or compliance traffic, it does not.
Where Sonnet 4.6 still wins for me
For most agentic work that ships real tool calls in production, Sonnet 4.6 is still the workhorse. The reasons are unglamorous and they show up after the first 200 tool calls, not in a flashy benchmark.
- Tool-selection telemetry. Anthropic exposes per-call selection scores at the model boundary. Google does not, yet. When a Sonnet agent picks the wrong tool, I can see the score delta and know whether to fix the registry, the description, or the prompt. With Flash today, I am back to guessing.
- Prompt-caching maturity. Sonnet's cache hit rates on stable system prompts settle around 70 to 85% on the workloads I have measured. Gemini's prompt cache exists but the hit-rate behaviour under tool-heavy workflows is under-documented. I have not had a full week to characterise it yet.
- JSON schema adherence. On structured outputs with deep nested schemas (the kind that go into Pydantic models, not toy examples), Sonnet 4.6 hits structurally valid output 99.1% of the time on my eval set. Flash sits at 96.3%. The 3-point gap costs you in retries.
- 1M context behaviour. Sonnet 4.6 at 1M still acts coherent on the whole-codebase reads I throw at it. Flash's behaviour on extended context is more variable, especially on retrieval-style questions buried in the middle of the window.
None of this is a takedown of Flash. It is the predictable shape of any model release: the benchmarks look impressive, the production primitives lag behind. Anthropic's lead on tool-use telemetry alone is about a year of accumulated product depth.
Where 3.5 Flash genuinely beats Sonnet right now
Equally honest in the other direction. There are routes where Flash is the better choice this week.
- Speed-sensitive UX. Real-time conversational interfaces, voice-driven assistants, anything where a user is watching the agent type. Flash's throughput is real in those contexts because there is no tool call sitting between user and model.
- High-volume pre-classification. Routing a stream of incoming events to one of N downstream handlers. Cheap, fast, accuracy gap tolerable.
- Some coding workloads. On smaller, well-scoped code transformations (single-file refactors, structured search-and-replace), Flash is competitive on accuracy and faster on wall-clock.
- Multimodal pipelines. Gemini has had a multimodal lead since 2024 and Omni Flash reinforces it. If your agent processes image, video, or document inputs as first-class content, Gemini is still the easier integration.
- Search-grounded retrieval. Built-in Google Search grounding is a feature Anthropic does not match. For agents that need fresh world knowledge inline, Flash with grounding is a one-fewer-component architecture.
Note what is missing from that list: nothing about agentic reliability, nothing about tool-call observability, nothing about multi-step task completion in workflows that go beyond two or three turns. That is the gap I would not paper over with a press release.
The cost-per-completed-task maths nobody is doing in public
Routing-layer decisions look obvious on the token-price chart. They are not. Token price is a sub-input. The thing you optimise is the cost of a completed task, which includes retries, downstream failure cost, and the engineering time to maintain the routing logic itself.
A worked example from the support-classification workload. 90k events per day, 800-event eval set, both models on the same prompt template.
$630 a day is $230k a year before tax. Real money. But the accuracy gap means an extra ~2,500 misclassified tickets a day flowing to the wrong queue. The cost of a misrouted ticket on this client is roughly 7 minutes of human triage to catch and re-route. 2,500 tickets at 7 minutes each is 290 person-hours a day, which is more expensive than the model bill the routing change was supposed to save. The maths flips.
Where the Flash maths does work cleanly: when the downstream cost of a misclassification is also low, the volume is high, and a human review layer was going to sample the output anyway. Most enterprise routes do not fit that shape. Some do.
The retry economics nobody talks about
There is a second-order cost effect that gets ignored in the vendor benchmarks and matters a lot in production. When a model fails (returns malformed JSON, picks a non-existent tool, hallucinates a parameter), the recovery path is rarely "fail and move on". It is usually "retry with a stricter prompt", or "retry with a more capable model in fallback", or "log and escalate to a human queue". Each of those has a cost. The model-choice decision has to include them.
On the same support-classification workload, I measured the retry cascade behaviour. Flash returns a malformed structured output (JSON that fails Pydantic validation) on roughly 3.7% of inputs. Sonnet returns malformed output on 0.9%. The cost difference shows up like this:
- Flash with no retry: 91.2% of inputs accepted. The 3.7% malformed cases route to the human queue at $0.42 per case in triage time. That is $0.0156 per input averaged across the route.
- Flash with one retry on the same model: malformed rate drops to roughly 1.1% (the first malformed response gives the model a hint). Cost per input rises by the retry token cost ($0.0034). Net cost per accepted input: $0.0144.
- Flash with a Sonnet fallback on malformed: malformed rate drops below 0.3%. Per-input cost is the average of (1 Flash call, 0.037 Sonnet calls): $0.0123.
- Sonnet alone: $0.018 per input, malformed rate 0.9% (already low).
The interesting line is the third one. Flash with Sonnet fallback is the cheapest end-to-end option once you account for the retry behaviour. Most teams who do this comparison just compare Flash vs Sonnet as standalone choices and miss the cascade pattern. The cascade is usually where the real win lives.
When to put Flash in the routing layer (and when to keep it out)
After 48 hours of testing, this is the rule of thumb I am running at clients this week.
- Put it in for: pre-classification routes where the next stage can absorb a 2 to 4 point accuracy drop. Real-time UI routes where the latency win is visible to a user. Multimodal preprocessing routes where Gemini's image and document handling is genuinely better. Search-grounded retrieval routes where you need fresh knowledge inline.
- Keep Sonnet for: structured-output routes where the schema is non-trivial. Multi-step tool-call routes where you need score-level telemetry. Code generation in production paths. Anything where a retry costs more than the model call.
- Hold for Pro: open-ended reasoning, code-review and code-synthesis at scale, anything where the current Flash accuracy is the blocker. Pro is the model people actually want. It lands next month, allegedly.
If you have a single-model stack on Sonnet today and Flash gives you a 30% cost reduction on a route where accuracy can absorb the hit, take the win. If you have a Haiku 4.5 dispatcher layer already, Flash is in roughly the same cost band and competitive on a different axis (throughput vs Haiku's latency). The interesting head-to-head is Flash vs Haiku, not Flash vs Sonnet.
Flash vs Haiku 4.5: the actual head-to-head
Most of the press cycle compared Flash to Sonnet because the benchmark numbers look more impressive that way. The fairer comparison, for the routes where Flash is actually a candidate, is against Haiku 4.5. Both are cheap. Both are fast. Both are positioned as routing-layer or dispatcher models. They differ in interesting ways that show up in production but not in the benchmark slide.
- First-token latency. Haiku wins, consistently. On a 32k-token input with a short structured output, Haiku 4.5 is at first-token in around 240ms p50; Flash is at around 410ms. For a user-visible UX (chat reply, suggestion popup) Haiku is the right pick.
- Streaming throughput. Flash wins. Once generation starts, the 4x output-tokens-per-second claim shows up in real workloads. For a long generation route (multi-turn explanation, document summary, code transformation in a single shot) Flash finishes faster wall-clock.
- Tool-selection accuracy in chains. Haiku inherits the Anthropic telemetry. On a 3-step tool-call chain (retrieve, decide, write) Haiku has hit 93.4% end-to-end completion on my eval set. Flash sits at 87.8%. The 5-point gap compounds over chain length.
- Multimodal. Flash wins clearly. Haiku handles text well; Flash handles image + text inputs natively without an extra preprocessing model.
- Cost per million input tokens. Comparable as of launch. The cost battle is going to be quarter-by-quarter for the next year. Do not over-fit your stack to current pricing.
The reasonable shape of a routing layer in mid-2026, on the workloads I have measured: Haiku 4.5 as the default dispatcher and pre-classifier for chained tool flows. Flash for streaming UX routes and multimodal preprocessing. Sonnet 4.6 for the production reasoning steps where accuracy on tool selection matters. Opus 4.7 with 1M context for the few hardest reasoning calls. Four-model stacks are common in 2026; that is not over-engineering anymore, that is the cost-aware default.
The migration recipe I would run this week
- 01Lift the eval setPull 200 to 500 real traces from production. Score them with your existing eval harness. If you do not have one, write a minimal scorer in a day. Without this step, every routing decision below is folklore.
- 02Wire Flash behind a feature flagDo not rip out Sonnet. Add Flash as a parallel branch in the routing layer, behind a per-route flag. Mirror traffic at 5% to start, in shadow mode so user-facing latency is unaffected.
- 03Compare on cost-per-completed-taskRun the eval set on each route. Compute the cost-per-success ratio for each model on that route. Flag the routes where Flash hits the accuracy floor at lower cost.
- 04Promote per-route, not whole-stackFlip Flash to 100% on routes where it wins on cost-per-success. Leave Sonnet on the rest. Resist the urge to migrate everything at once. The win and the loss live at the route level, not the stack level.
- 05Re-benchmark in 30 daysGoogle will iterate on Flash. Pro lands next month. Anthropic will respond. Re-run the eval and adjust. Routing is a quarterly decision in 2026, not an annual one.
Nothing about this recipe is novel. It is the standard practice for any model swap. The reason teams skip it is that the eval set takes a week to build and the routing change feels urgent. The routing change is not urgent. The eval set is the leverage.
What I am still skeptical about
Three things keep me from going further than "add Flash to specific routes" this week. None of them are dealbreakers. All of them are worth keeping in mind before you commit hard.
First, Pro is the model people actually wanted to see. Google holding it back for another month is interesting and probably means it is not as production-ready as the marketing implied. If Flash is the everyday workhorse and Pro is the harder-reasoning option, fine. But that means Flash inherits the entire benchmark story while Pro inherits the deployment risk. We will not know how Pro behaves until it is in the wild.
Second, the AI Ultra pricing structure for Spark is going to push Gemini-first stacks at the consumer end. That is not a bad thing for Google, but it means a lot of enterprise buyers will mentally price the whole agentic stack against Ultra's $200, which is not a fair comparison to API-priced workloads. The procurement conversation is about to get noisier and the comparisons are about to get sloppier.
Third, vendor lock-in is back on the table in a way it was not in 2025. Spark only runs on Google Cloud VMs. Anthropic's tool-use telemetry is Anthropic-only. The MCP server registry is the only universal layer in the agentic stack, and it is now the most important piece of architectural defence you can ship. If you have not invested in a portable agent abstraction (or at least a thin model-adapter layer between your agent logic and the API call), do that before you commit to either vendor for the long pipeline.
Two field tests from this week
Two concrete experiments from the engagements I mentioned at the top, because abstract benchmark talk is not the same as production behaviour.
The code-review pipeline
A team running pull-request review through Sonnet 4.6 wanted to know if Flash could absorb the first-pass triage (classify the PR by risk class, identify which reviewer guidance applies, surface obvious issues for the human reviewer). We ran the same 240 PRs through both models on the same prompts.
Flash matched Sonnet on risk classification accuracy (within 1.2 points on a manually graded set of 60). It lagged Sonnet badly on the actual issue-surfacing step, returning shallower comments and missing two thirds of the subtle race conditions Sonnet flagged. The decision was easy: use Flash for the classifier step in front of Sonnet, not as a Sonnet replacement. Net cost saving: 41% on the front-end routing without measurable quality loss. The deeper analysis stayed on Sonnet because the gap was real.
The content pipeline
A content-and-SEO operation running pre-agentic data fetching plus a writer step. The writer was Sonnet 4.6. The fetch-and-plan step had been Haiku 4.5. We tested Flash as a replacement for Haiku on the planning step.
Flash produced slightly better outlines (more concrete sub-sections, less tendency to over-bullet) but at noticeably higher latency on first-token (412ms vs 240ms for Haiku). For this client the latency was invisible to the end-user (the pipeline runs as a background job, not interactively) so the quality win mattered more than the speed loss. Net call: replace Haiku with Flash on the planning step. The downstream writer step stayed on Sonnet.
Two engagements, two different routing-layer decisions, both arrived at by running the eval set rather than reading the benchmark. That is the actual recommendation. (For the eval-set construction details, see eval datasets beyond the happy path; we treat the suite as the binding gate on every routing change.)
My current call across the board: ship Flash into the routing layer for short-horizon, high-volume work this week, behind a flag, with the eval suite watching. Keep Sonnet 4.6 in the seat for everything that does serious tool calls in production. Revisit when Pro lands.
Source: blog.google announcement at https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-5/. Related: I/O 2026 developer highlights at https://blog.google/innovation-and-ai/technology/developers-tools/google-io-2026-developer-highlights/.
Agentic AI patterns, delivered Thursdays
What I am shipping, watching, and pruning out of client stacks each week. One email. No fluff.