I Built a Multi-Agent System That Fixes Flaky Tests

CI is like a box of chocolates — you never know what you're going to get.

That red pipeline? It might be a real bug. Or it might be the same flaky test that failed yesterday, passed this morning, and will fail again tomorrow for no discernible reason. You re-run the job. It passes. You move on. And then it happens again next week, to someone else, who spends the same 30 minutes figuring out the same problem.

Flaky tests are not dramatic. They don't crash production or page anyone at 3am. They're a slow leak — small enough to ignore, persistent enough to compound. Over months, they erode trust in the pipeline. We stop believing red means broken. Reviews stall waiting for "one more green run." New team members, without the tribal knowledge of which failures are "safe to ignore," get blocked entirely.

I've spent years working in CI-heavy environments — building test infrastructure, debugging pipeline failures, onboarding teams onto test frameworks they didn't write. I gave talks about this at conferences, ran consulting engagements for companies. The same pattern everywhere: flaky tests are everyone's problem and nobody's priority.

So I built a system that makes them its priority.

The idea

I thought I could take my knowledge and experience of what fixing flaky tests actually looks like — and transfer it to AI. Make a copy of my workflow.

The result is a multi-agent orchestrator that downloads CI artifacts from a failed pipeline, spins up parallel AI investigators to analyze the failure from different angles, selects the most credible root cause, applies a targeted fix, verifies the fix compiles, optionally runs it through an AI code reviewer, pushes it to a branch, triggers a verification pipeline, and repeats until the fix sticks — or escalates to a human when it can't.

It runs autonomously. I kick it off, go do other work, and come back to a merge request with a fix, an explanation of the root cause, and a history of everything the agents tried.

How the agents work

The system uses a dozen or so specialized AI agents, each with scoped permissions and a specific role. No agent has full access to everything. Investigators can only read. The fixer can edit files but only within an allowlist. The reviewer can only read the diff. This is defense in depth — the orchestrator controls what each agent can see and do. It's not just a security concern — it's also about context efficiency. When an agent had too much to do or too broad a role, timeouts started appearing.

Phase 0: Reconnaissance

Before any analysis begins, the system fetches the relevant job from the pipeline, pulls the logs, and collects basic context: which job is failing, which tests didn't pass, what the reports say. This is a pure data-gathering phase — no hypotheses, no fixes. Only once this context is assembled can the investigators start work.

Phase 1: Investigation (parallel)

Three investigators run simultaneously, each approaching the failure from a different angle:

The first focuses on test isolation: preconditions, setup/teardown, state leaking between tests
The second focuses on shared infrastructure: cross-test state, singleton resources, global configuration
The third focuses on timing: async operations, race conditions, environment-dependent waits

Each investigator reads the job logs, test reports, and the actual test source code. Each produces a root cause analysis with a proposed fix and a confidence level. Three in parallel means three independent hypotheses.

Phase 2: Verification (parallel)

Three verifiers cross-check each investigator's hypothesis:

Does the file the investigator referenced actually exist?
Is the proposed change syntactically valid?
Does the reasoning hold up against the surrounding code?

This catches hallucinated file paths, impossible edits, and analyses that sound plausible but don't match reality. It's a cheap sanity gate before committing to a fix.

Phase 3: Selection

A selector agent reviews all three analyses and picks the most credible one. It prefers analyses that reference specific files, have higher confidence, and propose concrete changes over vague suggestions. If all three agree, it uses the consensus. If only one investigation succeeded, it uses that.

This phase was the hardest to implement. The agent kept trying to repeat the analysis of the previous investigators — which meant I was losing a lot of time at this stage and getting no result because the context window was exceeded. The fix was strictly limiting this agent's role: it only selects, it doesn't re-analyse.

Phase 4: Implementation

The fixer receives the selected analysis and implements the change. It can read files, edit them, and write new ones — but only within paths defined in the project's allowlist. This gives it autonomy while making sure it can't break anything in the project. After making changes, the orchestrator runs a compile check (TypeScript typecheck, Kotlin build, whatever the project needs) to catch syntax errors before pushing.

Phase 5: Review (conditional)

If the diff is larger than a configurable threshold, a reviewer agent evaluates the change against project-specific review criteria. If it rejects the fix, the fixer gets the feedback and tries again — up to three rounds. Small, surgical fixes skip review entirely and go straight to CI.

The loop

After a fix is committed and pushed, the orchestrator triggers a verification pipeline. If it passes, it runs again — and again — until it hits the required number of consecutive passes (typically three). Only then does it declare the fix verified.

If the pipeline fails, the system compares the new failures against the baseline. If a new test broke (regression), it flags it. Then it downloads fresh artifacts and starts a new investigation cycle, carrying forward context from previous iterations — what was tried, what failed, what the compile errors were. The fixer sees the full history and avoids repeating the same mistake.

If the fixer can't propose any changes two iterations in a row, the system exits with an "escalated" status. It doesn't force bad fixes. Better to hand it back to a human with a detailed analysis than to push something wrong.

This loop is still under active development — and you can probably imagine how many edge cases keep showing up.

The interface

A lightweight web dashboard streams logs in real time. You see which phase the system is in, which agents are active, what each investigator is focused on, and the fixer's summary of changes. There's a kill button if you need to abort. It's a nice addition that gives you a quick read on what's happening — without digging through logs. When the system finishes — it plays a sound.

Security model

Giving AI agents write access to a codebase requires guardrails. The system enforces several layers:

Tool scoping — investigators and reviewers are read-only. The fixer can edit files but only within the allowlist. No agent can run arbitrary shell commands unless the project explicitly opts in.
Branch protection — the orchestrator verifies the active branch before every commit. If the fixer somehow lands on the wrong branch, the system hard-exits.
Path allowlisting — every modified file is checked against the project's allowlist. Changes outside permitted directories are reverted automatically.
API scope limits — the CI integration is restricted to specific operations: trigger pipelines, fetch logs, download artifacts, create merge requests. Branch deletion, member management, and project settings are blocked.
Timeouts — every agent invocation has a hard timeout to prevent runaway sessions.
Hooks blocking file deletion — an agent cannot delete any file from the repository, even if it decides that would be the right move.
Service account — agents operate under a dedicated account, so it's immediately clear in GitLab history which changes were made by the system and which by a human.

The system is designed so that the worst-case failure mode is "it didn't fix anything" — not "it broke something."

What I learned building this

Multi-agent beats single-agent for investigation. A single agent asked "why did this test fail?" often fixates on one hypothesis. Three agents with different analytical lenses consistently produce better root cause analyses. That structure — investigate, then verify, then select — filters out false positives before they reach the fixer.

Iterative beats one-shot for fixing. The first fix attempt doesn't always work. But the system learns across iterations. By the second or third try, the fixer has seen the compile errors, the reviewer's feedback, and the fresh pipeline results. That accumulated context makes each subsequent attempt more targeted. Sometimes the first iteration fixes something, the second removes it, and only the third — with the context of both — lands on the right outcome. Just like in life.

Read-only is underrated. Most agents in this system can't write anything. That constraint forces them to think carefully about what they recommend, knowing someone else will implement it. It also means a hallucinating investigator can't do any damage — the worst it can do is waste a cycle.

A few practical details

Every run produces full logs from each phase — both the prompt sent to the agent and the result of its work. On top of that, the system saves several context files: selected fixes, pipeline outcomes, a short summary of root causes. This serves a few purposes: it makes debugging easier when something goes wrong, it feeds the MR description so the person reviewing the merge request understands what was changed and why — and if the system didn't make it to the end, you can drop those logs directly into Claude and finish the job manually. The context is already gathered, the analysis is partly done — you just pick up the baton.

The investigator prompts are stored as separate .md files — one per agent. Easy to edit, version, and improve. And they evolve — every time the system encounters a new type of failure that could have been anticipated, I update the prompt with that knowledge. That's where the "copy of my workflow" actually lives.

MRs after verification are not auto-merged — they wait for human approval. At this stage, every merge is also an opportunity to observe: does the fix make sense, did the agent cut corners, does the prompt need adjusting. It's still a process of teaching the system, not just fixing tests.

Access restrictions for each agent — what they can read, edit, and which tools they have available — are configured directly in Claude's settings. This means there's no need to enforce it in the orchestrator code — the model respects its assigned scope on its own.

I wanted to use Claude on a subscription plan rather than the API. With a dozen or so agents spinning up in a single session — ideally on the highest available model — the per-fix cost on token-based pricing can be surprising. A subscription with unlimited access changes the economics entirely.

Where it stands

The system runs daily against real pipelines. It produces merge requests with real fixes that get reviewed and merged by humans.

It's not perfect. It escalates when it should, it sometimes takes three iterations where a human would take one, and it can't fix tests that require understanding business logic it hasn't been taught. But it handles the mechanical, time-consuming category of flaky test investigation that nobody wants to do — and it does it consistently, thoroughly, and without complaining.

Flaky tests are everyone's problem. Now there's a system that makes them nobody's burden.

I Built a Multi-Agent System That Fixes Flaky Tests

The idea