Multi-Agent Autonomous Coding

The AutoCode Pipeline

It starts with people. A user-led workshop frames the problem and gathers requirements — then two AI agents, Codex the implementer and Claude the validator, build, test, fix, and document it autonomously through stage gates and guardrails until it ships.

4
Workshop Phases
8
Stage Gates
6
Test Layers
88
MCP Tools
Plan Top-Down · Before a Line of Code

The 6 Mindsets for a Successful Product

Great products are reasoned about from six altitudes — from why it matters down to how it ships. Put on each hat in turn, capture the thinking as diagrams, and only then drop into engineering. Each mindset feeds the workshop, the requirements, and the pipeline below.

The vibe-coding trap. Pick up the tool, say "build me app XYZ," and Claude — or any capable agent — will faithfully build it: fast, confident, and often in completely the wrong direction. Because the direction was never set, you usually don't notice until it's too late to change course cheaply. The six mindsets set the direction first — so the agent builds the right thing.

01
Founder / CEO
Why
02
CTO
Whether
03
Enterprise Architect
Where it fits
04
Solutions Architect
How it connects
05
Application Architect
How it's built
06
Engineering
Build it
1
Founder / CEOVision
"Why are we building this, and for whom?"
  • The problem worth solving & who feels it
  • Market opportunity & target customer
  • Business model — how it makes money
  • North-star metric & definition of success
  • Funding, runway & timing
  • Competitive moat & build-vs-buy
2
CTOStrategy
"Can we build it, sustain it, and afford it?"
  • Technology strategy & platform bets
  • Build vs buy vs partner
  • Team, skills & org to deliver
  • Security, compliance & risk posture
  • Total cost of ownership & cloud strategy
  • Velocity vs quality & tech-debt strategy
3
Enterprise ArchitectLandscape
"How does it fit the whole organization?"
  • Alignment with standards & technology radar
  • Integration with existing systems & data domains
  • Governance, security & compliance (GDPR/SOC2)
  • Reuse of shared platforms & capabilities
  • Data strategy & master-data ownership
  • Portfolio & roadmap fit, rationalization
4
Solutions ArchitectSolution
"How do the systems fit together for this solution?"
  • End-to-end design across systems & teams
  • Integration patterns, APIs, events & contracts
  • Non-functional requirements (scale, latency, uptime)
  • Identity, auth & security boundaries
  • Deployment topology & environments
  • Migration / cutover & trade-off analysis
5
Application ArchitectApplication
"How is this application structured?"
  • Layers, modules & component boundaries
  • Design patterns & framework choices
  • Data model & internal API design
  • State, error handling & observability
  • Testability & maintainability
  • Coding standards (captured in CLAUDE.md)
6
EngineeringBuild it
"Now build it — visibly, with guardrails." → hands off to the pipeline below.
  • Implement → validate → iterate (Codex + Claude)
  • Tasks, tests, stage gates & guardrails
  • Real-time visibility in Code Easy
  • Documentation captured as you go

Each mindset produces artifacts — vision, strategy, landscape, solution & application designs — that flow straight into the user-led workshop, the requirements matrix, and the AutoCode pipeline below. You descend the altitudes once; the agents build from the result.

Phase 0 · Before the Code

It starts with a user-led workshop

Long before AutoCode runs, a facilitated engagement frames the real problem and gathers requirements with the people who live it. Code Easy's documented method runs in four phases over roughly 6–11 days.

PHASE 11–2 days

Discovery

1–2 days to frame the problem deeply — interviews, requirements, constraints.
Requirements MatrixStakeholder MapSuccess Criteria
PHASE 21–2 days

Architecture

1–2 days of system & component mapping, validated visually with stakeholders.
Architecture docsCLAUDE.mdADRs
PHASE 33–5 days

Prototyping

3–5 days of AI-powered rapid building with daily stakeholder iteration.
Working prototypeUpdated matrix ✓Committed code
PHASE 41–2 days

Handoff

1–2 days to finalise documentation, ADRs & roadmap — bridges into AutoCode.
Production roadmapCode Easy exportHandoff package
The Lens · Design Thinking

Every phase runs on a design-thinking loop

Discovery and architecture aren't box-ticking — they follow the five design-thinking stages, keeping the work centred on the people who'll actually use what gets built.

STAGE 1
Empathise

Interview sponsors, users & tech stakeholders. Understand the real pain, not the stated ask.

STAGE 2
Define

Frame the problem & success criteria. Capture the requirements matrix and what's out of scope.

STAGE 3
Ideate

Map architecture & solution options with stakeholders. Explore trade-offs before committing.

STAGE 4
Prototype

Build to learn — AI-powered rapid prototyping, visible in Code Easy, iterated daily.

STAGE 5
Test

Validate against criteria with real users; feed findings back. Then hand off to AutoCode.

Discovery · Design Thinking

Three conversations frame the problem

“Discovery isn't about documenting what the client says they want — it's about understanding the problem deeply enough to build something that actually solves it.”

Empathise

Business sponsors

  • What business problem are we solving?
  • How do you measure success today — and after?
  • What happens if we don't build this?
  • What's the budget & timeline expectation?
Understand

End users

  • Walk me through your current workflow, step by step.
  • What's the most frustrating part of the process?
  • What workarounds do you use today?
  • What would “delightful” look like?
Constrain

Technical stakeholders

  • What systems must this integrate with?
  • Authentication / authorization requirements?
  • Compliance or security requirements?
  • Who maintains this after handoff?
Requirements Gathering

Prioritised by MoSCoW, mapped to requirement types

Every requirement is logged to the matrix (type · title · value · rationale · priority · source), prioritised Must → Won't, and categorised across five types so coverage is comprehensive.

Priority
Requirement Types
Must have
TechnicalFunctional
Should have
ArchitecturalDesign rationale
Could have
Enterprise alignment
Won't have
Out of scope

Captured as a simple CSV — type,title,value,rationale,priority,source — so the whole matrix imports straight into Code Easy and feeds the spec the agents build from.

Architecture Mapping

The system is drawn with stakeholders

Six live Code Easy visual modes validate structure with the right audience — business domains for sponsors, dependency graphs for engineers, journeys for UX.

1Force Graph — dependencies
2Treemap — size & complexity
3Tree — hierarchy
4Architecture — technical layers
5Business Architecture — domains
6User Flow — journeys

CLAUDE.md is authored here as the architecture context file (layers, patterns, constraints), and significant choices become Architecture Decision Records in /docs/adr/ — title · context · decision · consequences · alternatives.

Rapid Prototyping

A tight build-to-learn cycle

Each loop is 30–60 minutes, monitored live in Code Easy. Intervene the moment files land in the wrong layer or scope creeps.

Plan

Pick 2–3 features for the cycle.

Prompt

Direct Claude with matrix context.

Monitor

Watch activity & architecture live.

Review

Validate against acceptance criteria.

Iterate

Refine, or promote to AutoCode.

What The Workshop Captures

One pass captures the whole picture

The wizard turns selectable, framework-grounded inputs into the spec & CLAUDE.md the agents build from — so choices align with your strategy, budget, standards and your team.

Vision
Why & for whom
user typessuccess metricsoutcomes
CEO mindset · design thinking
Requirements
Prioritised scope
MoSCoW5 typesCSV import
Must · Should · Could · Won't
Architecture
Stack & targets
platforms / OScloudSLA / OLAAPI / CLIauth
CLAUDE.md templates
Strategy
Budget & evolution
invest / optimise / outsourcebuild vs run/operate
Wardley mapping
Governance
Delivery model
Team Topologiesfitness functionsADRsevolutionary
TOGAF ADM → Emergent Stack
Team
Skills alignment
languagesframeworksseniorityknown gaps
Build to what the team can maintain
Integrations
External services
3rd-party providerscustom APIswebhooks
Connect & consume
UI & Design
Look & standards
light / darkcolour schemesdesign systemsWCAG AAsketches
Suggestions; agents still apply best judgment
Output
Build-ready spec
requirementsCLAUDE.mdADRsUI refsdraft spec
Straight into AutoCode
Requirements Matrix+ CLAUDE.md Claude Plan Mode Specification AutoCode pipeline ↓
▶ Run the workshop & export a bundle →
The Philosophy

Crossing the Boundary

The method converges four ways of building software — then adds a fifth. Move at vibe-coding speed with traditional-coding quality, with visible architecture instead of a black box.

World 1

Traditional coding

“Write every line yourself. Understand every byte.”

World 2

Copiloting

“AI suggests, I decide.”

World 3

Vibe coding

“Describe it, generate it, ship it.”

World 4

Rapid prototyping

“Build to learn, not to keep.”

The fifth mode · Multi-Agent Autonomous Coding

“Define the goal, let AI agents collaborate to build it.” Codex implements, Claude validates — inside guardrails, with everything visible. This is where the workshop hands off to AutoCode.

Vibe coding
GenerateShip

Black box · hope it works.

Crossing the boundary
GenerateReviewRefineValidateShip

Visual architecture · verify it works.

Reality Check

Why a real app takes days — not ten seconds

The ads promise you'll think of an app and AI builds it. That's a compelling story and a systematically flawed one. Here's where the time actually goes — and why the missing work can't be prompted away.

wall-clock tasks × round-trips each × (latency + verification) Typing speed isn't in this equation. The loop is. A large app is hundreds of tasks, each iterating until it passes.

Where the time actually goes

01

It's the loop, not the keystrokes

Each task runs implement → validate → fix → re-validate. Many iterate 2–3 times before passing. Generation is cheap; the round-trips are the cost.

02

A dependency graph, not parallel

No API before the schema, no UI test before the API. Most of a build is sequential — more agents can't collapse the timeline.

03

Guardrails throttle on purpose

Max ~20 files & ~1000 lines per task forces large features into many small, reviewable units. Reviewability costs wall-clock by design.

04

Verification is slow, real work

Wiring, smoke, e2e, startup & Chrome UI checks — gates with 60–300s timeouts that fail and restart the loop. You can't generate your way past testing.

05

Context windows force chunking

No model holds a large codebase in memory. It reads a slice, reasons, writes, re-reads — thousands of times across the build.

06

Correction is structural, not rare

LLMs emit plausible code, not verified code. Subtle wrongness is the normal mode — exactly why drift detection flags after 3 failures.

Why “describe it and it's built” can't scale

The pitch doesn't remove the hard work — it hides where it lives. Three structural reasons it breaks down beyond toy apps.

PILLAR 1

Underspecification

A one-line idea maps to millions of valid apps. The hard part was always deciding precisely what to build. That ambiguity doesn't vanish — the AI either asks you (a requirements process) or guesses (the wrong app). The magic prompt conceals the problem; it doesn't solve it.

PILLAR 2

Essential vs accidental complexity

AI crushes accidental complexity — boilerplate, syntax, glue. But the essential complexity — the domain, constraints, edge cases, trade-offs — is irreducible (Brooks' “No Silver Bullet”). You can pay it faster, never skip it.

PILLAR 3

Verification doesn't compress

Even with perfect generation, you must confirm real behaviour across every state — errors, auth, concurrency, persistence, deploy. That means running & observing, bounded by real time, not token speed.

The demos dodge all three. A to-do app or landing page is well-trodden, low-ambiguity, and sits squarely in the training data — the model recalls a memorised pattern, it doesn't reason about your novel requirements. Scale up or go original and the illusion collapses into the loop above.

The advertised promise
IdeaDone
  • Assumes typing was the bottleneck
  • No specification, no iteration
  • Correctness taken on faith
  • Works only for memorised toy apps
The engineering reality
FrameSpecifyGenerateVerifyCorrectShip
  • Requirements & verification dominate the cost
  • Hundreds of validate-and-fix cycles
  • Correctness is proven, not assumed
  • Holds up for real, novel systems

Days in autonomous mode isn't the system being slow. It's the system being honest about where software actually gets built.

Phase 0 hands off here ↓
End-to-End Flow

From specification to shipped — autonomously

The autonomous controller drives a state machine. Each card below is a real state in autonomous-controller.js. Watch the signal flow through the pipeline.

Spec Created

draft → ready

Planning

initializing → planning

Implementing

codex writes code

Validating

claude reviews

Tested & Gated

wiring · smoke · e2e

Completed

100% → pre_deploy
Loop: implementing ⇄ validating ⇄ fixing Throughout: drift detection watches scope, loops & stalls Checkpoints: awaiting_approval pauses for the human
The Autonomous Loop

Two agents, one feedback cycle

Codex claims work from the queue and writes code. On completion the system auto-queues a validation task for Claude. Pass → next task. Needs changes → a fix task flows back to Codex.

Implementer

Codex

agent type codex · caps: code · implement · fix
  • implement Write new features & modules
  • update Modify existing code
  • refactor Improve structure & quality
  • bugfix Fix specific defects
  • fix Apply Claude's review findings
  • phase_validate Run build · typecheck · lint
code submitted → ← fix / needs_changes
Validator

Claude

agent type claude · reviews correctness · style · security
  • validate Review each code change
  • passed ✓ Mark task done, advance spec
  • needs_changes Emit findings → new fix work
  • error analysis Diagnose failed tasks & suggest fixes
Communicate through the agent_work_queue & agent_messages tables — the state machine drives the handoff.
Quality Pipeline

Eight stage gates guard the path to deploy

Gates flagged AUTO run a command and pass themselves; gates flagged MANUAL pause in awaiting_approval for a human. Required gates must pass before pre_deploy.

idle — press play to watch all 8 gates pass
Safety System

Guardrails keep the agents inside the lines

Enforced at the watcher (codex-watcher/config.js), per-spec in the database, and continuously by the drift detector.

20

Max files / task

A single task may touch at most maxFilesPerTask files before requiring approval.

1000

Max lines changed

maxLinesChanged caps the diff size of any one task to keep changes reviewable.

3

Failure flag

Drift detector flags a task after maxFailuresBeforeFlag = 3 failed attempts — no runaway fix loops.

5

Stall threshold

Flags the run when staleWorkThreshold = 5 items pass with no real progress.

Blocked paths

Agents can never write to secrets or keys:

  • .env · .env.*
  • *.key · *.pem
  • secrets/** · credentials/**

Requires approval

Sensitive actions pause for a human:

  • Deleting files
  • Editing package.json
  • Editing lock files

Semi-autonomous checkpoints

Optional human approval at:

  • pre_implement · post_implement
  • critical_change (>5 files / sensitive)
  • pre_deploy
Auto-Testing

Six layers of testing run automatically

The test runner queues and executes these as work flows through the pipeline. A failure with autoFixOnFailure spawns a fix task straight back to Codex.

Wiring test

Connectivity & integration sanity after each phase.

npm run test:wiring || npm run lint

Smoke test

Core functionality builds & runs.

npm run test:smoke

Debug check

No critical errors — lint & types are clean.

npm run lint && npm run typecheck

Startup test

Validates package.json, boots the app, watches for errors (30s).

spawn & observe · 30s timeout

Integration test

Opt-in via guardrail — cross-module behaviour.

npm run test:integration || npm test

E2E validation

Full end-to-end run before the pre_deploy gate.

npm run test:e2e || npm run test
Browser-Level Verification · v1.3.2

Chrome UI auto-testing

When enabled per-repo, Code Easy launches a real Chrome (no manual debug port), drives it over the DevTools Protocol, and feeds console / JS / network errors and screenshots back into the autonomous fix loop — so the agents can verify the running UI, not just the source.

01

Toggle on

Per-repo switch in the Chrome tab gates everything (403 if off).

02

Launch Chrome

Spawns system Chrome with an isolated profile + debug port.

03

Navigate + connect

Opens the test URL & auto-connects the CDP debugger.

04

Capture signals

Console logs, JS exceptions, failed requests, screenshots.

05

Feed the loop

Critical errors → a fix task back to Codex.

MCP tools

chrome_launchchrome_closechrome_test_status chrome_get_errorschrome_screenshotchrome_evaluate

Per-repo config

  • enabled · debug_port (9222)
  • test_url · chrome_path
  • headless mode
  • stored in chrome_test_config

Safe by design

  • Isolated --user-data-dir
  • No puppeteer / playwright dep
  • Auto-closed on shutdown
  • Only runs if the repo opts in
Living Documentation

Knowledge is captured as the project is built

As specs are processed and code lands, the project's intent and history are recorded through the MCP knowledge tools — so the docs grow with the codebase.

Requirements

Specs & acceptance criteria stored via store_requirement.

Decisions (ADRs)

Architecture choices logged with store_decision.

Plans & tasks

Every generated plan and task breakdown persists in the DB.

Session memory

Summaries via store_session_summary at session boundaries.

How it works today: documentation is MCP-assisted — requirements, decisions, plans and session summaries are captured through the knowledge tools and surfaced in the dashboard, rather than auto-written to files. Hooks nudge an agent to record a session summary at natural stopping points.

Global Setting

MCP Safeguards

All 88 tools are exposed to Claude through the MCP server and on by default. Each can be switched on or off from one global setting — so you control exactly which capabilities the agents can use, grouped here by function.