Running a PM→Architect→Engineer→QA Agent Pipeline: What Actually Happened

I’ve been building Code & Cast using a multi-agent workflow where Claude agents debate each other across phases. Not just “ask Claude to write code” — a proper structured pipeline where each phase has a dedicated agent, and nothing moves forward until I say so.

Here’s how it actually went.

The Pipeline

The setup has five phases, each with paired agents that work independently and then debate:

PM (alpha + beta debate)
  → Architect (alpha + beta debate)
    → Designer
      → Engineer (alpha=domain, beta=infra, gamma=presentation)
        → QA (alpha=UX, beta=technical)

Each phase ends with a mandatory stop. I review the output. I say “approved” or “here’s what’s wrong.” Only then does the next phase start.

That human gate ended up being the most important part of the whole thing.

Why Two Agents Per Phase?

I stumbled into this by accident. I had pm-alpha write a PRD, thought it looked fine, then had pm-beta write one independently. They disagreed on what the core user problem even was.

The debate forced both agents to defend their reasoning. The merged output was noticeably more rigorous than either individual draft — edge cases got surfaced, assumptions got challenged. Now I run paired agents for every phase where reasoning matters.

For engineering, I split differently. Engineer-alpha handles the domain layer, beta handles infrastructure and queries, gamma handles presentation. Before they start, alpha does a dependency analysis — which files does each engineer actually need? This prevents two agents from stepping on each other or blocking on missing types.

engineer-alpha: analyze dependencies first
  → domain layer (types, entities, zod schemas)
engineer-beta: infra + queries
  → depends on domain types from alpha
engineer-gamma: pages + components
  → depends on query return types from beta

Without that upfront analysis step I kept getting race conditions where gamma was importing types that alpha hadn’t written yet.

The Human Gate Is Not Optional

Early on I tried removing the gate between PM and Architect to speed things up. Bad idea.

The architects built an entire schema and repository layer on top of a PRD detail that I would’ve caught in thirty seconds. One small ambiguous line about how sessions are identified — the architects each made different assumptions and spent their whole context budget on architectures that were fundamentally incompatible.

Two agents debating, converging, and then going in the wrong direction is worse than one agent going in the wrong direction. The confidence of convergence makes it harder to spot.

The gate exists because downstream phases build on upstream decisions. If PM is 10% wrong, Architect amplifies it, and by Engineering you’re implementing something that doesn’t match what I actually wanted.

QA Found Things Tests Didn’t

I was running npm run build and npm run test as the pass criteria for Engineering. Both passed. Then QA ran through user journeys and found real issues — a page title that was still in English on the zh-TW route, a featured post card that showed the wrong date format depending on which language you were in.

None of these were type errors. None would fail a build. They were the kind of thing you find by actually navigating through the site as a user.

qa-alpha does UX and user journey testing — it walks through flows. qa-beta does architectural compliance and logic bugs. They cross-review each other’s findings before writing the final QA report. The combination surfaced things neither would’ve caught alone.

What I’d Change

The pipeline is slower than just asking Claude to implement something directly. A full iteration — PM through QA — probably takes 3-4x longer than winging it. Whether that’s worth it depends entirely on whether you care about the result being right on the first pass.

For Code & Cast, I care. I’m writing about this project publicly, and I don’t want to spend weekends fixing architecture decisions I made in a rush.

The thing that made the biggest difference wasn’t the number of agents — it was the debate structure plus the human gate. A single agent writing a PRD and then immediately coding it produces mediocre results. Two agents debating a PRD, me reviewing it, then two agents debating an architecture on top of that — the compounding is noticeable.

I extracted the whole setup into a reusable template (my_ai-coding-template) so I can start new projects from the same structure without rebuilding the agent definitions from scratch. That part alone probably saved me a few hours on the next project.