Runbook 03 — Diagnosability via Intentional Failure

Validate that failures are classified, explainable, and actionable. Sprint 3 intentionally produces red builds to prove that CI acts as a diagnostic system, not a guessing engine.

intentional red failure classification owned outcomes fix without rerun mode-aware boundary

Status: This sprint is intentionally red by design.

Objective + success criteria

Objective: Prove that when failures occur, the automation framework produces immediate, unambiguous diagnostic signals that route ownership without reruns.

Rule: CI output must be sufficient to classify the failure without “try again” reruns.

Default boundary (CI_NAT_MODE!=1): no port-forwarding, no tunnels, no localhost HTTP targets. CI must reach the SUT by hostname (e.g., sut.testlab).

Recording exception (CI_NAT_MODE=1): localhost targets (e.g., 127.0.0.1:3000) are allowed only for sanitized recording runs. This must be explicit and logged.

Success criteria:

  • Failures are intentionally triggered and deterministic
  • Each failure includes a coined scenario ID (S3.x)
  • Failure output names the owning role (Test / Infra / Security)
  • Root cause is identifiable from CI output alone

Environment contract used by this runbook

Prefer WEB_BASE_URL and API_BASE_URL. BASE_URL may exist as compatibility (treat it as an alias for WEB_BASE_URL unless the framework says otherwise).

# Default (production-like)
export CI_NAT_MODE="0"
export WEB_BASE_URL="http://sut.testlab:3000"
export API_BASE_URL="http://sut.testlab:3001/api"
export BASE_URL="${WEB_BASE_URL}"   # compatibility only

Recording mode example (CI_NAT_MODE=1)

# Temporary during sanitized recordings
export CI_NAT_MODE="1"
export WEB_BASE_URL="http://127.0.0.1:3000"
export API_BASE_URL="http://127.0.0.1:3001/api"
export BASE_URL="${WEB_BASE_URL}"

Minimal preflight classifier (recommended)

node -e '
const nat = process.env.CI_NAT_MODE === "1";
const web = process.env.WEB_BASE_URL || process.env.BASE_URL || "";
const api = process.env.API_BASE_URL || "";
if (!web) { console.error("S3.CONFIG: missing WEB_BASE_URL/BASE_URL (Owner: Test)"); process.exit(2); }
if (!api) { console.error("S3.CONFIG: missing API_BASE_URL (Owner: Test)"); process.exit(2); }
if (!nat) {
  const bad = /(localhost|127\.0\.0\.1|0\.0\.0\.0)/i;
  if (bad.test(web) || bad.test(api)) {
    console.error("S3.BOUNDARY: localhost targets forbidden in default mode (Owner: Test/Infra)");
    process.exit(2);
  }
}
console.log("S3.CONFIG OK:", { CI_NAT_MODE: nat ? "1" : "0", WEB_BASE_URL: web, API_BASE_URL: api });
'

Scenario matrix

Scenario Trigger Expected classification Owner
S3.1 Wrong environment (cross-contamination) Config mismatch (env label vs target) Test owner
S3.2 SUT unreachable / upstream unavailable DNS/TCP/HTTP transport failure Infra
S3.3 Auth rejection (invalid credentials) Explicit 401/403/422 from auth endpoint Security

Classification rule: if an auth check returns 5xx (or a gateway/proxy error), classify as S3.2 (Infra). You cannot evaluate auth while the upstream is unhealthy.

Exact commands

Run full Sprint 3 suite

# Tip: run the preflight classifier first (shown above)
npx playwright test sprints/Sprint3.INTENTIONAL_FAILURES.TO_ENSURE_FIX_WITHOUT_RERUN.spec.ts

Run a single scenario

npx playwright test sprints/Sprint3.INTENTIONAL_FAILURES.TO_ENSURE_FIX_WITHOUT_RERUN.spec.ts -g "S3\.1"

Trigger S3.1 (wrong environment)

Example: claim “qa” but point at a “dev” hostname. The suite should fail fast with a message that explains the mismatch.

export CI_NAT_MODE="0"
export TEST_ENV="qa"
export WEB_BASE_URL="http://dev.testlab:3000"
export API_BASE_URL="http://dev.testlab:3001/api"
export BASE_URL="${WEB_BASE_URL}"

npx playwright test sprints/Sprint3.INTENTIONAL_FAILURES.TO_ENSURE_FIX_WITHOUT_RERUN.spec.ts -g "S3\.1"

Trigger S3.2 (SUT unreachable)

Example: point at a closed port to force transport failure (DNS/TCP/timeout). This must classify as Infra.

export CI_NAT_MODE="0"
export WEB_BASE_URL="http://sut.testlab:5999"
export API_BASE_URL="http://sut.testlab:5999/api"
export BASE_URL="${WEB_BASE_URL}"

npx playwright test sprints/Sprint3.INTENTIONAL_FAILURES.TO_ENSURE_FIX_WITHOUT_RERUN.spec.ts -g "S3\.2"

Trigger S3.3 (auth rejection)

Example: keep connectivity valid but force credentials to be rejected with expected auth codes.

export CI_NAT_MODE="0"
export WEB_BASE_URL="http://sut.testlab:3000"
export API_BASE_URL="http://sut.testlab:3001/api"
export BASE_URL="${WEB_BASE_URL}"

export AUTH_PATH="/api/users/login"
export AUTH_REJECT_CODES="401,403,422"
# If the suite reads user/pass from env, set a known-bad password:
export RW_USER_EMAIL="rwuser@example.com"
export RW_USER_PASSWORD="definitely-wrong"

npx playwright test sprints/Sprint3.INTENTIONAL_FAILURES.TO_ENSURE_FIX_WITHOUT_RERUN.spec.ts -g "S3\.3"

Recording mode example (CI_NAT_MODE=1)

Only use this during sanitized recordings. Localhost/port-forwards are allowed only because CI_NAT_MODE=1 is explicit.

export CI_NAT_MODE="1"
export WEB_BASE_URL="http://127.0.0.1:3000"
export API_BASE_URL="http://127.0.0.1:3001/api"
export BASE_URL="${WEB_BASE_URL}"

npx playwright test sprints/Sprint3.INTENTIONAL_FAILURES.TO_ENSURE_FIX_WITHOUT_RERUN.spec.ts -g "S3\.2"

If the SUT uses a different auth route or payload shape, set AUTH_PATH and (optionally) AUTH_PAYLOAD_JSON to match the app.

Expected outputs

Scenario Expected output signal
S3.1 Immediate failure identifying environment mismatch (TEST_ENV vs target host) and aborting execution
S3.2 Failure indicating connectivity / DNS / transport / upstream availability issue (Infra-class)
S3.3 Explicit auth rejection (401 / 403 / 422) with Security ownership
Evidence to capture CI logs sufficient to diagnose without rerun; UI artifacts optional

Rerun policy (critical)

  • Sprint 3 failures are intentional and deterministic
  • Do not rerun jobs to “see if it passes”
  • Fix the root cause, then rerun once to confirm
  • Repeated reruns without change indicate a process failure

Why it matters (production relevance)

Diagnosability transforms CI from a binary gate into an operational signal. By proving that failures classify themselves and route ownership correctly, Sprint 3 reduces wasted engineering time, prevents flakiness from hiding real risk, and establishes trust in automation as a decision system.