Runbook 14 — Artifact contract + failure taxonomy for agents

Objective

Create a deterministic artifact contract that an agent can consume without guessing.
Represent each failed test as a structured record with evidence paths.
Normalize first-pass classification into a controlled vocabulary.
Make failure evidence reviewable without rerun.

Rule

If a failure is not diagnosable for a human, it is not ready for an agent.

Definition of done

✔ At least one pipeline run emits structured failure artifacts.
✔ Each failed test maps to one primary classification.
✔ Artifact paths are stable and repository-relative.
✔ A reviewer can validate whether the classification is reasonable.

Artifact contract

Canonical location	`test-results/`
Failure records	`test-results/failures/<test-id>.json`
Triage output	`test-results/triage/triage-report.json`
Supporting evidence	Trace, screenshot, video, stdout, stderr, JUnit, Playwright results.

Recommended structure

/test-results/
  junit.xml
  results.json
  failures/
    <test-id>.json
  triage/
    triage-report.json

Canonical failure schema

{
  "schemaVersion": "1.0",
  "testId": "auth-chromium::specs/sprint5/article.publish.spec.js::user can publish article",
  "title": "user can publish article",
  "project": "auth-chromium",
  "file": "specs/sprint5/article.publish.spec.js",
  "line": 42,
  "environment": "qa",
  "status": "failed",
  "startedAt": "2026-03-28T10:00:00Z",
  "durationMs": 18432,
  "errorType": "TimeoutError",
  "errorMessage": "Timeout 10000ms exceeded while waiting for getByRole('button', { name: 'Publish Article' })",
  "step": "Click Publish Article",
  "evidence": {
    "trace": "test-results/traces/publish-article-trace.zip",
    "screenshot": "test-results/screenshots/publish-article.png",
    "video": "test-results/videos/publish-article.webm",
    "stdout": "test-results/logs/publish-article.stdout.log",
    "stderr": "test-results/logs/publish-article.stderr.log"
  },
  "networkContext": {
    "webBaseUrl": "http://sut.testlab:3000",
    "apiBaseUrl": "http://sut.testlab:3001/api",
    "ciNatMode": false
  },
  "classification": {
    "primary": "LOCATOR",
    "secondary": "TIMING",
    "confidence": 0.86,
    "ruleId": "locator_not_found_role_name"
  }
}

Failure taxonomy

ENV — hostname resolution, refused connection, unavailable dependency, bad config
DATA — collisions, stale records, seed/reset failures, non-idempotent state
AUTH — login/setup/token/session/storage state failures
LOCATOR — missing role/name, unstable selector, duplicate match, detached element
TIMING — waits exceeded, rendering race, eventual consistency issue
ASSERTION — product mismatch, wrong persisted state, contract mismatch
INFRA — runner issue, artifact upload problem, transient CI/platform fault
UNKNOWN — evidence insufficient or rule gap

Deterministic rules first

const FAILURE_RULES = [
  { id: 'dns_lookup_failed', match: /ENOTFOUND|getaddrinfo/i, primary: 'ENV' },
  { id: 'connection_refused', match: /ECONNREFUSED/i, primary: 'ENV' },
  { id: 'unauthorized_forbidden', match: /401|403|unauthorized|forbidden/i, primary: 'AUTH' },
  { id: 'duplicate_or_exists', match: /duplicate|already exists|unique constraint/i, primary: 'DATA' },
  { id: 'locator_not_found_role_name', match: /locator|getByRole|getByLabel|strict mode violation|not found/i, primary: 'LOCATOR' },
  { id: 'timeout_waiting', match: /Timeout|waiting for/i, primary: 'TIMING' },
  { id: 'assertion_failure', match: /expect\(|toBe|toEqual|received|expected/i, primary: 'ASSERTION' }
];

Commands

Validate artifact surface

find test-results -maxdepth 3 -type f | sort
cat test-results/failures/sample-failure.json | jq .

Human override rule

If the classification is wrong, fix the rule or add a rule. Do not patch output manually and call that “agent intelligence.”

Expected outputs

✔ Each failed test produces one normalized JSON artifact.
✔ Evidence paths line up with the actual trace/screenshot/video files.
✔ Classifications are explicit and reviewable.
✔ JSON is the source of truth; summaries are derived from it.

Failure modes

Artifact too sparse: only error message, no evidence paths. Action: include trace, screenshot, stdout, stderr.
Multiple classifications compete: locator timeout vs generic timeout. Action: preserve primary + secondary and rule ID.
Local vs CI path drift: consumer cannot find artifacts. Action: standardize on repository-relative paths.
Human-readable output disagrees with JSON: action: JSON wins; report is derived.

Operational value

Structured inputs for CI triage agents.
Reviewable evidence without rerun.
Stable taxonomy for trend analysis later.
Machine-readable failures instead of screenshot archaeology.