An AI agent module dropped into an existing B2B SaaS, behind a feature flag, with an eval harness that gated every PR.
A workflow SaaS with 1,200 enterprise customers wanted to ship an AI assistant inside their product without disrupting the existing surface or risking regression. We built the agent module, the eval corpus, and the rollout plan — and merged it behind a flag on day 19.
▷ outcomes
0
Regression incidents in existing surface
312
Eval cases gating every PR
T+672h
Design-partner rollout
94%
Eval threshold cleared at merge
[ §01 ] the cycle
How 720 hours
actually ran.
-
Day 01 — 04
Audit + spec
scope.agent read the existing codebase end-to-end and produced an audit memo: where the agent could live without entangling the existing surface, which APIs to extend, which to leave untouched. The spec was a 12-page document with eight signed ADRs.
↳audit.md ↳8 ADRs ↳module boundary diagram -
Day 05 — 14
Module build
build.agent shipped the agent module: planner, tool layer, memory store, response formatter. Forty-one PRs, every one reviewed by their head of engineering. Zero changes to the existing surface code outside three extension points.
↳41 PRs ↳module isolated ↳ext points × 3 -
Day 15 — 22
Eval corpus + threshold gating
qa.agent built a 312-example eval corpus drawn from anonymized customer workflows. CI gate: 4 metrics, each with thresholds the agent had to hit before any PR could merge. Two PRs were blocked and rewritten.
↳312 eval cases ↳4 scoring axes ↳CI gate live -
Day 23 — 30
Flagged rollout + monitoring
deploy.agent provisioned the feature flag tree (cohort-based, per-tenant). Soft rollout to 12 design-partner accounts on day 27, with monitor.agent on a custom alert harness watching for response quality drift.
↳flag tree ↳12 design partners live ↳drift alerts armed
[ §02 ] agent log · selected
What the loop
looked like.
[ §03 ] notes from the cycle
AURORA is the engagement shape we get asked about most often: an established B2B SaaS with paying customers, real revenue, and a leadership team that wants to ship an AI feature without becoming an AI company by accident.
The dependency we agreed to up front
Existing code stays untouched. The agent module lives in its own namespace, behind a feature flag, with three explicit extension points where it talks to the host application. This isn’t an architectural preference — it’s a risk posture. The existing surface has 1,200 customers depending on it. The agent module starts at zero.
What the eval harness actually does
Every PR runs the agent module against 312 anonymized customer workflows, scored on four axes: task completion, format compliance, tool-call correctness, and tone fidelity. Below threshold on any axis, the PR is blocked. The corpus and the thresholds are now AURORA’s permanent property — they will gate every future AI feature their team ships.
This is the part of an AI build that consultancies skip and product teams discover late. We make it the artefact you walk away with.
What we explicitly didn’t build
A user-facing chat surface in the product. AURORA already had a clear workflow product; adding a generic chat would have weakened the existing UX. The agent surfaces as small inline interventions inside the existing workflow — a recommendation card, an inline auto-complete, a “fix this” button — never as a separate chat panel.
That call was made on day three, by their head of product, after a 30-minute discussion with the scope.agent’s draft. It’s the kind of judgement call that compounds into a product that doesn’t feel grafted.
from the founder
"We've been quoted nine months by larger shops for this exact piece of work. Kaedax did it in thirty days and the eval harness they built is the artefact we now reuse on every AI feature."
— Head of Engineering · AURORA