PR review pipeline cuts senior-engineer time 4×
Multi-agent CI workflow for a 180-engineer monorepo.
Mid-market IT services firm · Ahmedabad · 180 engineers
What the team was actually solving
Senior engineers were spending 8–12 hours per week each on first-pass PR review across a 6-team monorepo. Junior PRs waited 2+ days for sign-off; velocity stalled; the highest-judgement people were doing the lowest-judgement work.
Where the old process broke
- 1Senior-engineer time concentrated on first-pass review, not architecture or mentorship
- 2Median PR-open-to-merge time of 41 hours, with the long tail blocking releases
- 3Inconsistent review depth — security and test-coverage gaps caught reactively after merge
- 4Six teams, six different review conventions; cross-team PRs stalled on style debates
The AI / technical solution we shipped
A multi-agent CI workflow triggered on every PR open. Three specialist agents run in parallel — a reviewer (Claude Sonnet 4.6) for code-correctness and convention, a security agent for risk patterns, and a test-generator agent for coverage gaps. Outputs are consolidated into a single PR comment within 90 seconds. Humans review the agent's synthesis, not the raw diff.
How the system is wired
The specialist roles
Technology stack
How it plugs into the existing workflow
The agents run as a GitHub Actions workflow triggered on PR open and on every subsequent push. They write to the PR via the GitHub REST API as a bot account scoped to the org — read on all repos, write only to PR comments. Existing branch protections and required reviewers are untouched.
- GitHub Actions runner: existing org runners, no new infrastructure
- Bot account: scoped to PR comments + status checks only
- No changes to branch protection rules — the agent is advisory, not gatekeeping
- Slack notifier piggy-backs on the existing PR-events channel
Security & scalability considerations
Least-privilege bot scope
Bot has read-only access to repository content and write access only to PR comments and status checks. Cannot push, merge, or modify branch settings.
No secrets in prompts
Secret detection runs before the diff reaches the model. Any secrets in the diff trigger a different code path that does not call Claude.
Per-PR cost cap
A per-PR token-budget guard kills the agent if it exceeds the cap. Large monorepo refactors were the load test; the cap held.
Eval-driven regressions
47 adversarial PR cases re-run on every model or prompt change. Score drops on the eval suite block the model upgrade until investigated.
Delivery methodology
Discovery (1 wk)
Watched 30 historic PRs across 4 teams. Built a profile of what senior engineers actually catch on first-pass review vs. what is convention.
Architecture (1 wk)
Three-agent supervisor pattern selected over a single big agent. Eval suite seeded from the 30 historic PRs as the ground truth.
Implementation (3 wks)
GitHub MCP server first, then specialist prompts, then the integration step. Working agents in staging from week 2 of implementation.
Eval + observability (1 wk parallel)
Langfuse wiring on every step, cost attribution per PR, score-deltas surfaced in the eval-set dashboard.
Handoff (1 wk)
Maintenance playbook, eval-set update process, 30/60/90-day check-ins. Team trained on adding eval cases as new failure modes emerge.
Orchestration pattern: supervisor + parallel fan-out
Deployment
Runs on the customer's existing GitHub Actions runner pool. No new infrastructure provisioned. The MCP server runs as a container alongside the runner; bot credentials live in GitHub Actions secrets and never enter prompts.
- Runtime: GitHub Actions self-hosted runner (existing pool)
- MCP server: containerised Python service, scoped tokens
- Secret manager: GitHub Actions secrets (not in repo, not in prompts)
- Rollback: feature-flag PR-level skip — single env var pause
Observability
Every agent step traces to Langfuse with the PR number as the session key. Per-agent cost, model version, cache-hit, and the integrator's score-delta are all surfaced in a single Langfuse dashboard the team checks daily.
- Per-step Langfuse trace with PR-id session key
- Cost attribution: per-PR · per-agent · per-team
- Cache-hit rate on stable system prompts: 78–84% steady state
- Eval-suite regression CI step on every prompt/model bump
Before vs after
- Senior eng review hours / week8–12
- Median PR open-to-merge41 hrs
- % merged within 90 minutes23%
- Post-merge regressions (90 days)Tracked
- Senior eng review hours / week2–3
- Median PR open-to-merge6.4 hrs
- % merged within 90 minutes78%
- Post-merge regressions (90 days)Zero from AI-passed reviews
Automation impact
The agent handles first-pass review on every PR. Senior engineers now review the agent's synthesis (and only intervene on the cases it flagged uncertain). The volume of PRs that reach human review unchanged from agent-pass dropped to under 15%.
Performance numbers
End-to-end agent pipeline runs in under 90 seconds for the median PR. The largest refactors (10K+ line diffs) take 4–6 minutes, still well inside the team's SLA expectations.
Business outcomes
Six senior engineers reclaimed ~6 hours each per week (≈36 engineer-hours / week across the team). At the customer's loaded-cost rate, the agent pays for itself in under three days of operation.
What we'd tell another team building this
- 01The eval suite was the unlock. Without 47 adversarial cases scoring every model bump, the team would not trust the agent enough to merge on it.
- 02Three specialists beat one big agent. A single agent with all the tools loaded was 18% less accurate on the same eval set than the three-specialist pattern.
- 03The integrator is the highest-leverage prompt in the system. It is the only one that sees the full picture; iterating on it produced the largest accuracy jumps.
- 04Cost surprised us positively. The cache-hit rate on the stable system prefix turned out to be the dominant cost lever, not the model choice.
Future scalability
The same agent pattern is being extended to release-note generation, dependency-upgrade review, and incident-postmortem drafting. The shared MCP server and observability stack carry over — each new agent is ~1 week to ship instead of 6 weeks.
- Release-note generator using the same GitHub MCP server (in progress)
- Dependency-upgrade review agent reusing the security-scan agent's prompt patterns
- Postmortem-drafter agent fed from incident-channel exports
- Cross-team eval-set library so new agents inherit regression coverage from day one
Want a PR review pipeline like this?
Most engagements like this take 6–8 weeks discovery to handoff. A 60-minute scoping session is enough to tell whether your repo shape and team are a fit.
Read what we publish on this
Why I am replacing supervisor patterns with handoffs
Supervisors looked clean on paper and shipped slow in production. Handoffs read messier in the code but recover better when an agent loses the plot. Two real systems and where supervisors still earn their keep.
Read the post ProductionThe cheapest LLM call is the one you do not make — GitHub's 19-62% token cut, decoded
GitHub published an instrumented analysis of their agentic CI workflows and reported 19-62% token-cost reductions. The savings are the headline. The technique — pre-agentic data fetching and tool-registry hygiene — is the story most teams will miss.
Read the post ProductionThe agent observability stack we ship to every client
Traces, spans, evals, cost-per-completed-task, and the one dashboard panel that catches 80% of regressions. Vendor-agnostic — covers Langfuse, Honeycomb, and rolling your own.
Read the postSolutions & topics worth reading next
Agentic AI Consulting
Designed, built, and handed off — production agentic systems for enterprise teams.
AI Systems Engineering Training
Eight-day corporate training programs that take dev teams from AI-assisted coding to production agentic systems.
Enterprise AI Architecture
Reference architectures for organisations standing up an AI platform — not one agent, but the foundation for many.
AI Observability
Tracing, eval, cache-hit telemetry, and cost attribution for production agents.
Multi-Agent Workflows
Supervisor + handoff orchestration for portfolios of agents that need to cooperate without arguing.
Agentic AI
Designing, building, and shipping production agents.
Multi-Agent Systems
Orchestrating many agents without losing the plot.
AI Observability
Tracing, eval, and telemetry for production agents.
AI Engineering
The discipline of shipping AI systems, not demos.
More implementation proof
The Agentic Operating System — workshop build
A live multi-agent ops shell, designed and built with 40 engineers in one room.
Read this case study Open-Source / PoCMulti-agent research synthesis — open PoC for swarm vs supervisor
An open-source experiment comparing orchestration patterns on a real research task.
Read this case study