Failure Museum

What broke, why it broke, and what changed because of it. Failures are primary sources here — not postmortems, not retrospectives. The actual thing that went wrong, with the actual fix.

F-0012026-06-21PKT-BCM-2026-0015

Wrong OIDC Audience — Builds 7–9

SYMPTOM: aws sts assume-role-with-web-identity rejected with InvalidIdentityToken. All three IAM trust policies refused every Buildkite OIDC token.
ROOT CAUSE: IAM trust policy condition had 'agent.buildkite.com:aud = https://buildkite.com'. But the pipeline requests --audience sts.amazonaws.com. The token presented audience sts.amazonaws.com. The trust policy expected https://buildkite.com. The condition never matched.
FIX: Changed all trust policy aud conditions to "sts.amazonaws.com". Match the exact string the --audience flag produces.
LESSON: The OIDC audience in the trust policy must be an exact string match to the --audience argument passed to buildkite-agent oidc request-token. They are different values. sts.amazonaws.com is the correct audience for AWS STS AssumeRoleWithWebIdentity flows.

F-0022026-06-21PKT-BCM-2026-0015

StringEquals + Wildcard — OIDC Sub Never Matches (Build 8)

SYMPTOM: AssumeRoleWithWebIdentity succeeded on audience check but failed sub condition. Role refused every token even after the audience fix.
ROOT CAUSE: Trust policy used StringEquals for the sub condition with a value ending in :*. StringEquals matches the literal string :* — not a wildcard. The actual Buildkite sub includes the commit SHA and build step key, so it never equals the template string literally.
FIX: Split the condition: StringEquals for aud (exact match, no wildcards needed), StringLike for sub (supports * expansion). StringLike with * matches any suffix.
LESSON: IAM condition operators are not interchangeable. StringEquals is for exact string matching. StringLike is for patterns with * and ?. Using StringEquals with a * in the value pattern makes the asterisk literal — it only matches if the actual string contains an asterisk character.

F-0032026-06-21PKT-BCM-2026-0015

iam:PermissionsBoundary Condition on Read Operations (Build 10)

SYMPTOM: iam:GetRole and iam:ListRolePolicies calls failed with AccessDenied even though the role had explicit Allow for those actions.
ROOT CAUSE: The IAM policy had a Condition on every IAM action requiring iam:PermissionsBoundary to be set. The iam:PermissionsBoundary context key is only populated for mutating IAM calls (CreateRole, PutRolePolicy, etc.). On read operations (GetRole, ListRolePolicies), the context key is absent — the condition evaluates to false — and the Allow doesn't apply.
FIX: Split IAM permissions into two Sids: one for read operations (GetRole, List*) with no conditions, one for mutating operations (CreateRole, PutRolePolicy, etc.) with the PermissionsBoundary condition.
LESSON: IAM condition context keys are call-specific. Not every context key is populated for every API call. Always check the AWS docs for which context keys are available for a given action before writing conditions that depend on them.

F-0042026-06-21PKT-BCM-2026-0015

YAML Block Scalar + Multi-line Python at Column 1 (Build 15)

SYMPTOM: Buildkite pipeline.yml failed to parse at line 56. Build couldn't start. Error: YAML scanner exception.
ROOT CAUSE: Multi-line Python code was embedded inside a YAML | block scalar. The Python started at column 0 (no indentation). The YAML parser interpreted the Python statement import json, sys as a new YAML key because it appeared at the root indentation level — YAML expects keys at the block's parent indent level when content reaches column 1. The block scalar terminated before the Python code ran.
FIX: Replaced multi-line Python with inline single-line shell commands using AWS CLI --query flag and jq-style filtering. aws iam list-attached-role-policies --query 'AttachedPolicies[*].PolicyName' --output text eliminates the need for Python parsing entirely.
LESSON: In Buildkite pipeline YAML, never start code that lives inside a | block scalar at column 1. YAML block scalars terminate when content reaches the indentation level of the parent block. Single-line shell expressions and --query flags are safer than embedded multi-line scripts for CI pipelines.

F-0052026-06-21PKT-BCM-2026-0015

Buildkite Cluster Secret Key Prefix Rejection (HTTP 422)

SYMPTOM: POST /v2/clusters/.../secrets returned HTTP 422 with a validation error. The cluster secret binding script silently fell back to Keychain only.
ROOT CAUSE: The cluster secret key was named bk_api_token_control_plane. Buildkite rejects any cluster secret key that starts with bk or buildkite (case-insensitive). The 422 response included a validation message about reserved prefixes. The fallback handler then tried PUT /secrets/bk_api_token_control_plane — using the key name as a URL path segment — which returned 404 because Buildkite secrets are addressed by UUID, not key name.
FIX: Renamed key to zentari_ctrl_api_token (no reserved prefix). Fixed the 422 fallback to: GET /secrets → find UUID by key name → PUT /secrets/{uuid}.
LESSON: Buildkite cluster secret keys cannot start with bk or buildkite. These are reserved namespace prefixes. When updating an existing secret, always address it by UUID retrieved from GET /secrets, not by key name string.

F-0072026-06-22PKT-BCM-2026-0017

create_hosted_zone=false Plans Zone Destroy When Zone Is In Tofu State (Build 40)

SYMPTOM: Build #40 production tofu plan showed aws_route53_zone.buildcam[0] will be destroyed. Apply attempted the delete. AWS returned: HostedZoneNotEmpty — cannot delete a zone that contains non-required record sets.
ROOT CAUSE: production.tfvars was updated to create_hosted_zone=false after domain-bootstrap (zone Z058930713XQYAP42V3C5 was created by Build #39 and lives in tofu state). The variable controls count = create_hosted_zone ? 1 : 0 on the zone resource. Setting it to false makes count=0, which tofu interprets as "destroy this resource." The zone has A records + cert validation CNAMEs, so AWS refused the delete.
FIX: Set create_hosted_zone=true in production.tfvars. The zone is managed by tofu (it was created by domain-bootstrap), so it must stay count=1. The hosted_zone_id fallback only applies to zones that exist outside tofu management.
LESSON: If a resource was created by tofu (lives in state), keep the count/conditional flag at 1 (enabled). Only use create=false with an external ID when the resource is pre-existing and not managed by tofu. Confusing these two patterns causes tofu to plan a destroy instead of a no-op.

F-0062026-06-22PKT-BCM-2026-0018

RC Selected Before Pipeline Fixes Merged to Staging (Build 18)

SYMPTOM: Build #18 on production branch: tofu-production ran immediately without waiting for the approve-production gate (bypassed), then failed with 403 Forbidden on S3 state bucket. The block step showed key=null in the API.
ROOT CAUSE: Pipeline fixes (approve-production key, depends_on correction, least-privilege roles) were committed to develop (commit e8640d8) but were never merged into staging. The RC selected for promotion was 338a09d — the staging HEAD — which had the old pipeline.yml with depends_on: web-ci on tofu-production and ZentariBuildkiteBootstrapAdminRole (retired) on both production steps. The gate had no key field so it was unreferenceable. The 403 on S3 was the retired bootstrap admin attempting to read state.
FIX: Pushed develop to staging (fast-forward), triggered Build #19 to validate the full pipeline on staging, then re-promoted the updated staging commit (c8f0f6e) to production. Build #20 PASSED.
LESSON: Before any production content promotion: verify that pipeline.yml at the RC HEAD contains all confirmed production pipeline fixes. Check git log origin/staging..origin/develop for pipeline.yml changes. If develop is ahead, update staging first and run a full staging build before selecting the RC.

Every entry is sourced from a build packet or evidence file. No reconstructions — the failures documented here were observed in production CI runs.