GitHub Actions Failure Analysis

Structured investigation of GitHub Actions failures. Follows a 6-step methodology: identify what failed, assess flakiness, find the breaking commit, analyze root cause, check for existing fixes, and report.

Works with any GitHub Actions workflow. Requires gh CLI authenticated.

Mindset: Apply /pb-debug thinking - reproduce before theorizing. Apply /pb-preamble thinking - challenge the obvious explanation. A “flaky test” might be a real race condition. A “random failure” might be a dependency change.

Resource Hint: sonnet - log analysis, pattern matching, and structured investigation

When to Use

CI pipeline fails and you need to understand why
Recurring failures that might be flaky vs. genuinely broken
Pre-release when CI must be green and something is red
After merging a PR that broke CI on main

Usage

/pb-gha [URL or context]

Examples:

/pb-gha https://github.com/org/repo/actions/runs/12345
/pb-gha (analyzes the current repo’s latest failed run)
/pb-gha the lint job keeps failing on main

Step 1: Identify the Failure

Figure out exactly what failed. Not the workflow - the specific job and step.

# Get the latest failed run (or use provided URL)
gh run list --status failure --limit 5

# View the specific run
gh run view <run-id>

# Get the logs for the failed job
gh run view <run-id> --log-failed

What to look for:

The exit code 1 trigger - not warnings, the actual failure
Error messages vs. noise (deprecation warnings aren’t failures)
Which step in the job failed (build, test, lint, deploy)
The commit that triggered this run

Step 2: Assess Flakiness

Check whether this is a one-off or a pattern. The key is checking the specific failing job, not just the workflow.

# List recent runs of the workflow
gh run list --workflow <workflow-name> --limit 20

# For each run, check if the specific job passed or failed
# Look for patterns: always fails? fails on certain branches? intermittent?

Flakiness indicators:

Same job fails intermittently on the same branch → likely flaky
Job fails consistently after a specific date → likely a real breakage
Job fails only on certain branches → likely a code issue
Job fails at random intervals → timing issue, race condition, or external dependency

Calculate:

Success rate over last 20 runs
When it last passed
When it first started failing

Step 3: Find the Breaking Commit

If the failure is consistent (not flaky), pinpoint when it started.

# Find the last passing run
gh run list --workflow <workflow-name> --status success --limit 1

# Find the first failing run
# Compare: what commits landed between the last success and first failure?

# View the commit that introduced the failure
gh run view <first-failing-run-id> --json headSha
git log --oneline <last-good-sha>..<first-bad-sha>

Verification: The job should pass consistently before the breaking commit and fail consistently after it. If it’s intermittent on both sides, it’s not a clean break - look for a flakiness trigger instead.

Step 4: Analyze Root Cause

With the logs, history, and breaking commit (if found), determine what’s actually going wrong.

Common root causes:

Category	Examples
Code change	Test assertion broken, API contract changed, import error
Dependency	Package version bumped with breaking change, lockfile drift
Environment	Runner image updated, tool version changed, disk space
Timing	Race condition, timeout too short, external service slow
Configuration	Workflow syntax, permissions, secrets expired

Root cause checklist:

Read the actual error message (not just the job name)
Check if the failing code was recently modified
Check if dependencies were updated (lockfile diff)
Check if the runner environment changed (ubuntu-latest vs pinned)
Check for external service dependencies (APIs, registries)

Step 5: Check for Existing Fixes

Before writing a fix, check if someone already has one.

# Search open PRs for the error message or affected file
gh pr list --state open --search "<error keyword>"

# Check if there's a related issue
gh issue list --search "<error keyword>"

# Check if main has moved ahead with a fix
git log origin/main --oneline --since="yesterday" -- <affected-file>

Step 6: Report

Synthesize findings into a clear report.

## GHA Failure Report

**Workflow:** [name]
**Job:** [name]
**Step:** [name]
**Run:** [URL]

### Failure
[What specifically failed - the actual error, not the job name]

### Flakiness
[One-off / Intermittent (N/20 failures) / Consistent since [date]]

### Breaking Commit
[SHA and summary, or "N/A - flaky" if intermittent]

### Root Cause
[What's actually wrong and why]

### Existing Fix
[PR link if found, or "None found"]

### Recommendation
[What to do - fix, retry, pin version, skip, etc.]

Quick Mode

For simple “CI is red, what happened?” situations:

# One-liner: show the latest failure's logs
gh run list --status failure --limit 1 --json databaseId --jq '.[0].databaseId' \
  | xargs gh run view --log-failed

Then follow up with the full methodology if the cause isn’t obvious.

Integration with Other Commands

Situation	Follow Up
Root cause is a code bug	`/pb-debug` for systematic fix
Root cause is test flakiness	`/pb-review-tests` for reliability audit
Root cause is infra/config	`/pb-review-infrastructure` for resilience check
Blocking a release	`/pb-release` once green
Recurring problem	`/pb-review-hygiene` for systemic health

Anti-Patterns

Don’t	Do Instead
Re-run without investigating	Understand the failure first
Blame “flaky tests” without data	Check the last 20 runs for actual flakiness rate
Fix the symptom (skip test)	Fix the root cause
Assume the obvious explanation	Verify with logs and history
Ignore intermittent failures	Intermittent = real bug with a timing component

/pb-debug - Systematic debugging methodology
/pb-doctor - Local system health check
/pb-review-hygiene - Codebase operational health
/pb-release - Release orchestration (needs green CI)

Last Updated: 2026-02-18 Version: 1.0.0

Keyboard shortcuts

Engineering Playbook