Open-Source / PoCOpen-Source / Research Apr 20263 weeks weekend builds

Multi-agent research synthesis — open PoC for swarm vs supervisor

An open-source experiment comparing orchestration patterns on a real research task.

Public R&D · open-source on GitHub

Multi-Agent SystemsAI Engineering ArchitectureAgentic AI Implementations
Business problem

What the team was actually solving

Every team building multi-agent systems faces the same orchestration question and answers it from intuition, not measurement. "Supervisor is cleaner" vs "swarm is faster" gets stated as fact in a hundred conference talks without a single side-by-side benchmark anyone can reproduce. This PoC builds and measures both, on a task with a defensible ground truth.

Existing workflow

Where the old process broke

  • 1No public side-by-side benchmark of supervisor vs swarm orchestration
  • 2Research synthesis tasks are usually evaluated qualitatively, not with reproducible scoring
  • 3Most multi-agent libraries do not let you swap orchestration patterns without rewriting
  • 4Eval datasets for synthesis tasks are scarce; this PoC ships one
Proposed solution

The AI / technical solution we shipped

A reproducible benchmark: same task (synthesise a literature review across 12 papers on a given topic), same model, same MCP tool registry, same eval rubric. Three runners — single-agent (baseline), supervisor pattern, swarm pattern — each scored on factuality, citation accuracy, coverage, and cost. Code + eval data + raw runs all open-sourced.

System architecture

How the system is wired

Benchmark architecture
Task inputtopic + paper corpusOrchestrator A/B/Csingle · sup · swarmMCP toolspaper-fetch · searchOutputliterature reviewEvalrubric-scored
Agent workflow

The specialist roles

Three runners on the same task
Singleone agent · all 12 papersSupervisorplanner + readersSwarmparallel readers · merge
Technology

PoC stack

Reasoning modelsClaude Sonnet 4.6 (all three runners use the same model)Tool layerCustom MCP servers: paper-fetch · web-search · note-takerOrchestrationThree parallel implementations sharing the same tool registryEvalOpen eval rubric · LLM-judge cross-check · citation accuracy scorerReproducibilityPinned model version · seeded sampling · raw traces committed
Methodology

How the PoC was built

01

Week 1 — task + eval design

Twelve papers picked from a public topic; literature-review structure defined; rubric written before any runner was built.

02

Week 2 — three runners

Single-agent baseline, supervisor, swarm. Each runner is one file, ~150–250 lines. Same MCP tool registry.

03

Week 3 — bench + write-up

Each runner executed N=30 times. Scores aggregated. Cost-per-run captured. Public write-up + repo published.

Observability

Reproducibility primitives

Every benchmark run is fully traced. Anybody can re-run the same task and see the same patterns. The raw traces from the published runs are committed alongside the code.

  • Pinned model version (locked at publish time)
  • Seeded sampling for reproducibility within the model's nondeterminism band
  • Raw Langfuse exports committed for each runner's 30 runs
  • CI script that reruns the benchmark on any contributor PR
Before vs after

Headline benchmark numbers (N=30 per runner)

Before
  • Single-agent — coverage64%
  • Single-agent — citation accuracy71%
  • Single-agent — cost / run$0.42
  • Single-agent — wall time38 s
After
  • Supervisor — coverage88%
  • Supervisor — citation accuracy94%
  • Supervisor — cost / run$0.38
  • Supervisor — wall time21 s
Performance

Cross-runner comparison

The supervisor pattern won on every metric for this task shape — task with knowable structure, finite corpus, defensible ground truth. Swarm narrowed the gap on wall time only when the corpus exceeded what the supervisor could plan against in one call. Single-agent lost everywhere.

88% vs 64%
coverage — supervisor vs single
94% vs 71%
citation accuracy — supervisor vs single
21 s vs 38 s
wall time — supervisor vs single (parallel reads)
~10%
cheaper despite more model calls (cache-hit win)
Lessons learned

What we'd tell another team building this

  • 01On structured tasks with knowable scope, supervisor wins on every axis once you measure. The "swarm is always faster" claim is too coarse — it depends on the corpus shape and the planner's scope.
  • 02Citation accuracy is the metric that exposes weak multi-agent systems. Coverage can be gamed; citation accuracy cannot.
  • 03Caching swung cost. Three runners hitting the same MCP tool registry shared the cache on the system prefix. The supervisor pattern called the model more times but cheaper per call.
  • 04A reproducible benchmark, even imperfect, is more useful to the field than a dozen unreproducible claims. Eval > intuition.
What's next

Where the PoC is going

The next versions extend the benchmark to (a) code-generation tasks (different scoring rubric), (b) longer corpora that exceed supervisor planning capacity (forcing the swarm interesting), and (c) model-comparison runs across Opus / Sonnet / Haiku at each orchestration.

  • Add code-generation task variant with executable test rubric
  • Larger corpus runs where supervisor saturates planning
  • Model-comparison runs (Opus / Sonnet / Haiku) per orchestration
  • External contributors welcome — open eval data + open benchmark scripts

Working on multi-agent orchestration?

The benchmark is open-source. Use it. Run your own task variants. Drop a PR with results. If you want help applying these patterns inside your codebase, the multi-agent workflows solution is the conversion path.