Multi-agent research synthesis — open PoC for swarm vs supervisor
An open-source experiment comparing orchestration patterns on a real research task.
Public R&D · open-source on GitHub
What the team was actually solving
Every team building multi-agent systems faces the same orchestration question and answers it from intuition, not measurement. "Supervisor is cleaner" vs "swarm is faster" gets stated as fact in a hundred conference talks without a single side-by-side benchmark anyone can reproduce. This PoC builds and measures both, on a task with a defensible ground truth.
Where the old process broke
- 1No public side-by-side benchmark of supervisor vs swarm orchestration
- 2Research synthesis tasks are usually evaluated qualitatively, not with reproducible scoring
- 3Most multi-agent libraries do not let you swap orchestration patterns without rewriting
- 4Eval datasets for synthesis tasks are scarce; this PoC ships one
The AI / technical solution we shipped
A reproducible benchmark: same task (synthesise a literature review across 12 papers on a given topic), same model, same MCP tool registry, same eval rubric. Three runners — single-agent (baseline), supervisor pattern, swarm pattern — each scored on factuality, citation accuracy, coverage, and cost. Code + eval data + raw runs all open-sourced.
How the system is wired
The specialist roles
PoC stack
How the PoC was built
Week 1 — task + eval design
Twelve papers picked from a public topic; literature-review structure defined; rubric written before any runner was built.
Week 2 — three runners
Single-agent baseline, supervisor, swarm. Each runner is one file, ~150–250 lines. Same MCP tool registry.
Week 3 — bench + write-up
Each runner executed N=30 times. Scores aggregated. Cost-per-run captured. Public write-up + repo published.
Reproducibility primitives
Every benchmark run is fully traced. Anybody can re-run the same task and see the same patterns. The raw traces from the published runs are committed alongside the code.
- Pinned model version (locked at publish time)
- Seeded sampling for reproducibility within the model's nondeterminism band
- Raw Langfuse exports committed for each runner's 30 runs
- CI script that reruns the benchmark on any contributor PR
Headline benchmark numbers (N=30 per runner)
- Single-agent — coverage64%
- Single-agent — citation accuracy71%
- Single-agent — cost / run$0.42
- Single-agent — wall time38 s
- Supervisor — coverage88%
- Supervisor — citation accuracy94%
- Supervisor — cost / run$0.38
- Supervisor — wall time21 s
Cross-runner comparison
The supervisor pattern won on every metric for this task shape — task with knowable structure, finite corpus, defensible ground truth. Swarm narrowed the gap on wall time only when the corpus exceeded what the supervisor could plan against in one call. Single-agent lost everywhere.
What we'd tell another team building this
- 01On structured tasks with knowable scope, supervisor wins on every axis once you measure. The "swarm is always faster" claim is too coarse — it depends on the corpus shape and the planner's scope.
- 02Citation accuracy is the metric that exposes weak multi-agent systems. Coverage can be gamed; citation accuracy cannot.
- 03Caching swung cost. Three runners hitting the same MCP tool registry shared the cache on the system prefix. The supervisor pattern called the model more times but cheaper per call.
- 04A reproducible benchmark, even imperfect, is more useful to the field than a dozen unreproducible claims. Eval > intuition.
Where the PoC is going
The next versions extend the benchmark to (a) code-generation tasks (different scoring rubric), (b) longer corpora that exceed supervisor planning capacity (forcing the swarm interesting), and (c) model-comparison runs across Opus / Sonnet / Haiku at each orchestration.
- Add code-generation task variant with executable test rubric
- Larger corpus runs where supervisor saturates planning
- Model-comparison runs (Opus / Sonnet / Haiku) per orchestration
- External contributors welcome — open eval data + open benchmark scripts
Working on multi-agent orchestration?
The benchmark is open-source. Use it. Run your own task variants. Drop a PR with results. If you want help applying these patterns inside your codebase, the multi-agent workflows solution is the conversion path.
Read what we publish on this
Why I am replacing supervisor patterns with handoffs
Supervisors looked clean on paper and shipped slow in production. Handoffs read messier in the code but recover better when an agent loses the plot. Two real systems and where supervisors still earn their keep.
Read the post ProductionEval datasets: stop testing your agents on the happy path
If your eval set is the demos you showed the client, you are testing the wrong thing. How we build evals from production failures and the minimum viable suite to ship.
Read the post ArchitectureThree patterns I broke in 2025 — and what I do instead now
Self-correction loops without budgets, single-agent solutions to multi-domain problems, and using JSON mode to force structure I should have built into the schema. An honest review.
Read the postSolutions & topics worth reading next
Agentic AI Consulting
Designed, built, and handed off — production agentic systems for enterprise teams.
AI Observability
Tracing, eval, cache-hit telemetry, and cost attribution for production agents.
Multi-Agent Workflows
Supervisor + handoff orchestration for portfolios of agents that need to cooperate without arguing.
Agentic AI
Designing, building, and shipping production agents.
Multi-Agent Systems
Orchestrating many agents without losing the plot.
AI Observability
Tracing, eval, and telemetry for production agents.
AI Engineering
The discipline of shipping AI systems, not demos.
More implementation proof
The Agentic Operating System — workshop build
A live multi-agent ops shell, designed and built with 40 engineers in one room.
Read this case study Enterprise EngagementPR review pipeline cuts senior-engineer time 4×
Multi-agent CI workflow for a 180-engineer monorepo.
Read this case study Enterprise EngagementERP support triage agent eliminates the Level-1 backlog
Supervisor-pattern agent integrating Odoo with customer-facing email + chat.
Read this case study