Open-Source / PoCOpen-Source / Research Apr 20263 weeks weekend builds

Multi-agent research synthesis — open PoC for swarm vs supervisor

An open-source experiment comparing orchestration patterns on a real research task.

Public R&D · open-source on GitHub

Multi-Agent SystemsAI Engineering ArchitectureAgentic AI Implementations

Business problem

What the team was actually solving

Every team building multi-agent systems faces the same orchestration question and answers it from intuition, not measurement. "Supervisor is cleaner" vs "swarm is faster" gets stated as fact in a hundred conference talks without a single side-by-side benchmark anyone can reproduce. This PoC builds and measures both, on a task with a defensible ground truth.

Existing workflow

Where the old process broke

1No public side-by-side benchmark of supervisor vs swarm orchestration
2Research synthesis tasks are usually evaluated qualitatively, not with reproducible scoring
3Most multi-agent libraries do not let you swap orchestration patterns without rewriting
4Eval datasets for synthesis tasks are scarce; this PoC ships one

Proposed solution

The AI / technical solution we shipped

A reproducible benchmark: same task (synthesise a literature review across 12 papers on a given topic), same model, same MCP tool registry, same eval rubric. Three runners — single-agent (baseline), supervisor pattern, swarm pattern — each scored on factuality, citation accuracy, coverage, and cost. Code + eval data + raw runs all open-sourced.

System architecture

How the system is wired

Benchmark architecture

Agent workflow

The specialist roles

Three runners on the same task

Technology

PoC stack

Methodology

How the PoC was built

Week 1 — task + eval design

Twelve papers picked from a public topic; literature-review structure defined; rubric written before any runner was built.

Week 2 — three runners

Single-agent baseline, supervisor, swarm. Each runner is one file, ~150–250 lines. Same MCP tool registry.

Week 3 — bench + write-up

Each runner executed N=30 times. Scores aggregated. Cost-per-run captured. Public write-up + repo published.

Observability

Reproducibility primitives

Every benchmark run is fully traced. Anybody can re-run the same task and see the same patterns. The raw traces from the published runs are committed alongside the code.

Pinned model version (locked at publish time)
Seeded sampling for reproducibility within the model's nondeterminism band
Raw Langfuse exports committed for each runner's 30 runs
CI script that reruns the benchmark on any contributor PR

Before vs after

Headline benchmark numbers (N=30 per runner)

Before

Single-agent — coverage64%
Single-agent — citation accuracy71%
Single-agent — cost / run$0.42
Single-agent — wall time38 s

After

Supervisor — coverage88%
Supervisor — citation accuracy94%
Supervisor — cost / run$0.38
Supervisor — wall time21 s

Performance

Cross-runner comparison

The supervisor pattern won on every metric for this task shape — task with knowable structure, finite corpus, defensible ground truth. Swarm narrowed the gap on wall time only when the corpus exceeded what the supervisor could plan against in one call. Single-agent lost everywhere.

88% vs 64%

coverage — supervisor vs single

94% vs 71%

citation accuracy — supervisor vs single

21 s vs 38 s

wall time — supervisor vs single (parallel reads)

~10%

cheaper despite more model calls (cache-hit win)

Lessons learned

What we'd tell another team building this

01On structured tasks with knowable scope, supervisor wins on every axis once you measure. The "swarm is always faster" claim is too coarse — it depends on the corpus shape and the planner's scope.
02Citation accuracy is the metric that exposes weak multi-agent systems. Coverage can be gamed; citation accuracy cannot.
03Caching swung cost. Three runners hitting the same MCP tool registry shared the cache on the system prefix. The supervisor pattern called the model more times but cheaper per call.
04A reproducible benchmark, even imperfect, is more useful to the field than a dozen unreproducible claims. Eval > intuition.

What's next

Where the PoC is going

The next versions extend the benchmark to (a) code-generation tasks (different scoring rubric), (b) longer corpora that exceed supervisor planning capacity (forcing the swarm interesting), and (c) model-comparison runs across Opus / Sonnet / Haiku at each orchestration.

Add code-generation task variant with executable test rubric
Larger corpus runs where supervisor saturates planning
Model-comparison runs (Opus / Sonnet / Haiku) per orchestration
External contributors welcome — open eval data + open benchmark scripts

Working on multi-agent orchestration?

The benchmark is open-source. Use it. Run your own task variants. Drop a PR with results. If you want help applying these patterns inside your codebase, the multi-agent workflows solution is the conversion path.

See the Multi-Agent Workflows solution Browse open-source projects

Deep dives

Read what we publish on this

Multi-Agent

Adjacent

Solutions & topics worth reading next

Solution

More implementation proof

Workshop Build

Browse all case studies

Multi-agent research synthesis — open PoC for swarm vs supervisor

What the team was actually solving

Where the old process broke

The AI / technical solution we shipped

How the system is wired

The specialist roles

PoC stack

How the PoC was built

Week 1 — task + eval design

Week 2 — three runners

Week 3 — bench + write-up

Reproducibility primitives

Headline benchmark numbers (N=30 per runner)

Cross-runner comparison

What we'd tell another team building this

Where the PoC is going

Working on multi-agent orchestration?

Read what we publish on this

Why I am replacing supervisor patterns with handoffs

Eval datasets: stop testing your agents on the happy path

Three patterns I broke in 2025 — and what I do instead now

Solutions & topics worth reading next

Agentic AI Consulting

AI Observability

Multi-Agent Workflows

Agentic AI

Multi-Agent Systems

AI Observability

AI Engineering

More implementation proof

The Agentic Operating System — workshop build

PR review pipeline cuts senior-engineer time 4×

ERP support triage agent eliminates the Level-1 backlog