Incident Response & Recovery

Respond to production incidents quickly and professionally. Clear process, clear communication, minimal impact.

Mindset: Incident response requires both /pb-preamble and /pb-design-rules thinking.

During response: be direct about status (preamble), challenge assumptions about root cause, surface unknowns. Design systems to fail loudly (Repair, Transparency) so incidents are visible immediately. After: conduct honest post-mortems without blame, and improve system robustness.

Resource Hint: opus - critical incident triage requires deep analysis and careful judgment

Purpose

Incidents are inevitable. What matters:

Speed: Detect and respond quickly
Clarity: Know exactly what’s happening
Communication: Keep stakeholders informed
Recovery: Get back to normal fast
Learning: Prevent repeats through post-incident review

When to Use This Command

Production incident occurring - Service degradation or outage
Alert fired - Monitoring detected anomaly
Customer-reported issue - Users experiencing problems
Post-incident - Running retrospective and writing post-mortem
Incident prep - Reviewing process before on-call rotation

Incident Severity Levels

Classify incidents to determine response urgency and escalation.

SEV-1 (Critical, Immediate Page)

User-facing service completely down
Data loss or data integrity risk
Security breach active
Major revenue impact

Response time: Immediate (< 5 minutes) Escalation: Page on-call, VP, customers Communication: Every 15 minutes Resolution target: 1-2 hours

Examples:

API servers offline, users can’t access service
Database corrupted, data cannot be retrieved
Payment processing broken, no transactions processing
Authentication system down, users locked out

SEV-2 (High, Urgent Page)

User-facing service degraded (slow, errors)
Partial functionality broken
Workaround exists but poor user experience

Response time: 15 minutes Escalation: Page on-call + relevant team lead Communication: Every 30 minutes Resolution target: 4 hours

Examples:

API responses 10x slower than normal
Search feature broken (but users can browse)
Emails not sending (but users can still order)
Mobile app crashes on one action (desktop works)

SEV-3 (Medium, No Page)

Internal system degraded
Non-critical feature broken
User workaround available
Limited customer impact

Response time: Next business day acceptable Escalation: Slack to team, create ticket Communication: Daily update Resolution target: 1-2 days

Examples:

Admin dashboard slow
Reporting system down (business can continue)
Non-critical background job failing
One endpoint timeout (alternate exists)

SEV-4 (Low, Future Fix)

Documentation issue
Minor UI bug
Development environment broken
No user-facing impact

Response time: Next sprint Escalation: Create ticket, no escalation Communication: Team awareness Resolution target: When convenient

Examples:

Typo in UI text
Help docs incorrect
Dev script broken
Console warning (no functional impact)

Incident Declaration

Who declares incidents?

Anyone can declare an incident (no permission needed)
Don’t wait for managers to approve
Better to declare and cancel than miss critical issue
When in doubt, declare

How to declare

For SEV-1/2: Declare immediately

Slack: #incidents channel
Message: "@incident-commander SEV-1: Users report 503 errors on checkout"
Include: Service affected, symptoms, your name

For SEV-3/4: Create ticket

Jira/GitHub issue with label: incident
Title: [SEV-3] Admin dashboard slow
Include: What's broken, user impact, symptoms

Incident Commander Role

Once incident declared:

Incident Commander assigned (first responder or on-call)
IC decides severity
IC starts bridge call for SEV-1/2
IC starts Slack thread tracking
IC coordinates investigation and communication

On-Call Operations

For on-call setup, scheduling, training, and rotation health, see /pb-sre-practices → On-Call Health section.

This includes:

On-call rotation structure and scheduling
PagerDuty/Opsgenie setup
On-call expectations and boundaries
Mock incident training
Preventing on-call burnout

This command focuses on incident response - what to do when an incident occurs. On-call operations (how to set up and maintain healthy rotations) are ongoing SRE practices.

Immediate Response (First 5 Minutes)

IC Quick Triage

Is it real? (5 seconds)
- Check monitoring: Is P99 latency actually up?
- Check logs: Are errors really happening?
- Avoid: Chasing false alarms from bad metrics
What’s affected? (30 seconds)
- Which services? endpoints? regions?
- How many users impacted? percentage?
- Is it spreading or stable?
What changed recently? (1 minute)
- Was there a deployment? (check git log)
- Configuration change? (check configs)
- Traffic spike? (check metrics)
- External dependency failure? (check upstreamhealth)
Initial action (2 minutes)
- If recent deployment: Consider rollback immediately
- If configuration change: Revert change
- If dependency down: Switch to failover/degraded mode
- Otherwise: Page relevant team for investigation

Initial Communication (SEV-1/2)

Send to Slack #incidents:

@channel SEV-1: Checkout failing (503 errors)

Status: Investigating
Symptoms: POST /checkout returning 503 since 14:32 UTC
Affected: ~5% of transactions
Potential causes: Database slow? Payment API down? Recent deploy?

Updates every 15 minutes in thread.

Investigation (5-30 Minutes)

Investigation Team

Incident Commander: Coordinates, owns timeline, communicates
Oncall Engineer: Investigates service, runs commands
Subject Matter Expert: Called if needed (database expert, payments, etc)

Diagnostic Checklist

☐ Check recent deployments (git log --since="10 minutes ago")
☐ Check monitoring: latency, errors, resource usage
☐ Check logs: error messages, stack traces
☐ Check external dependencies: Are they healthy?
☐ Check database: Is it responsive? Any locks?
☐ Check traffic: Is there a sudden spike?
☐ Check configuration: Any recent changes?
☐ Check disk space: Are we full? Out of inodes?

Root Cause Patterns

Deployment-related (50% of incidents)

New code has bug
Migration script failed
Configuration not deployed
Infrastructure change

Action: Rollback or hotfix

Database-related (20% of incidents)

Slow query locking table
Connection pool exhausted
Disk full
Replication lag

Action: Kill slow query, scale connections, free space

Resource exhaustion (15% of incidents)

CPU 100%
Memory full
Disk full
Network bandwidth full

Action: Identify process consuming, kill or scale

External dependency (10% of incidents)

API provider down
CDN down
Payment processor down
DNS down

Action: Use fallback, degrade gracefully, wait for recovery

Configuration (5% of incidents)

Wrong environment variables
SSL certificate expired
Feature flag stuck on/off
Rate limiting too aggressive

Action: Fix configuration, restart service

Resolution (Immediate Actions)

Recovery Strategies (In Order of Speed)

1. Rollback (Fastest, if recent deploy)

# If incident started after recent deployment
git log --oneline -5  # See recent deploys
git revert <commit-hash>  # Create revert commit
make deploy  # Deploy revert

# Rollback clears issue in minutes
# Then investigate what went wrong later

2. Kill Slow Queries (If database slow)

-- MySQL
SHOW PROCESSLIST;  -- See running queries
-- Find query taking > 30 seconds
KILL <process-id>;  -- Stop it

-- PostgreSQL
SELECT pid, query, state FROM pg_stat_activity WHERE state != 'idle';
SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE pid != pg_backend_pid() AND query_start < now() - interval '30 seconds';

3. Scale Horizontally (If resource maxed)

# If CPU/memory at 100%
kubectl scale deployment api --replicas=10  # Add more instances
# or
aws autoscaling set-desired-capacity --desired-capacity 20

# Service recovers in 30-60 seconds as new instances start

4. Degrade Gracefully (If dependency down)

If payment processor down:
- Return 503 for checkout
- Queue orders for manual processing
- Users can try again in 5 minutes

If search service down:
- Disable search feature
- Show "Search temporarily unavailable"
- Users can browse without search

If cache down:
- Route around cache
- Use slower database directly
- Accept higher latency, avoid errors

5. Feature Flag (If specific feature broken)

If checkout broken but other features OK:
- Kill checkout feature flag
- Users see "Checkout under maintenance"
- Other site functions normally
- Buy time to fix checkout

6. Configuration Fix (If config issue)

# If environment variable wrong
kubectl set env deployment api ENV_VAR=correct_value
kubectl rollout restart deployment api

# or if config file
git commit -am "fix: correct environment variable"
make deploy

Communication During Incident

Rules for Communication

Honesty: Tell truth about what’s happening
Frequency: Update every 15 min (SEV-1), 30 min (SEV-2)
Specificity: Not “we’re investigating” but “database queries slow, killing long-running query”
Clarity: Avoid technical jargon, explain impact
No blame: Never blame person, focus on recovery

Communication Template

Initial (First 2 min):

SEV-1: Checkout down - 503 errors

What: POST /checkout returning 503 errors
When: Started 14:32 UTC (5 minutes ago)
Impact: ~5% of transactions failing (~$10k/hour)
Status: Investigating root cause
ETA: 15 minutes

Update (Every 15 min during incident):

UPDATE: Found root cause

Root cause: Payment API provider rate limiting us
Evidence: Logs show 429 responses from payment processor
Action: Increasing rate limit quota with provider
ETA: 10 minutes for fix, may need 5 min for orders to catch up

Resolution (When fixed):

RESOLVED: Checkout fully functional again

Root cause: Payment processor temporary rate limiting
Fix applied: Increased our rate limit quota
Time to fix: 27 minutes (14:32 to 14:59)
Impact: ~120 failed transactions (manual processing queued)
Action: Post-incident review scheduled for tomorrow 10am

Notify Stakeholders

Immediately (if SEV-1):

#incidents Slack channel
@oncall
VP Engineering
Customer Success team

Every 15 minutes:

Post update in #incidents thread
If still ongoing, email major customers

After 1 hour (if still ongoing):

Public status page update
Email all customers
If critical, call major customers

Post-Incident Review

Timing

SEV-1: Review within 24 hours
SEV-2: Review within 3 days
SEV-3/4: Review optional, log lessons

Review Participants

Incident Commander
Responders (who worked on incident)
Service owner
One person taking notes

Review Structure (30 min meeting)

1. Timeline (5 min)

14:32 - Incident starts (checkout returns 503)
14:33 - Alert fires, IC pages on-call
14:35 - IC declares SEV-1
14:38 - Team identifies payment processor rate limiting
14:42 - Team increases rate limit quota
14:59 - Incident resolved, checkout working

2. What Went Well (5 min)

Fast detection (1 minute)
Clear communication
Quick escalation
Good teamwork

3. What Could Improve (10 min)

Didn’t have payment processor limits in runbook (add it)
Took 7 minutes to investigate (could have suspected API faster)
Didn’t have direct contact for payment processor (get it)

4. Action Items (10 min)

☐ Add payment processor limits to runbook
☐ Get direct contact info for payment processor
☐ Add payment processor rate limits to monitoring alerts
☐ Consider circuit breaker for payment API
☐ Test failover to backup payment processor

Common Incident Runbooks

Incident: Database Slow

Quick diagnosis (2 min):

-- Show slow running queries
SHOW PROCESSLIST;  -- MySQL
-- or
SELECT pid, query, query_start FROM pg_stat_activity WHERE state != 'idle' ORDER BY query_start;  -- PostgreSQL

-- Show table locks
SHOW OPEN TABLES WHERE In_use > 0;  -- MySQL

Immediate action:

Identify query taking > 30 seconds
KILL <process-id> to stop it
Service recovers immediately

Investigation:

What query was slow? (check logs)
Is it a known slow query?
Missing index?
N+1 query pattern?
Should cache this result?

Resolution:

Add index if missing
Optimize query
Add caching
Scale database vertically

Incident: API Server CPU 100%

Quick diagnosis (1 min):

# What process consuming CPU?
top -b -n 1 | head -20

# If Node/Python/Java process:
ps aux | grep node  # See how many processes

# Which endpoint consuming CPU?
curl http://localhost:9000/debug/cpu-profile  # if available

Immediate action:

Scale horizontally: Add more instances
Traffic redistributes to new instances
CPU returns to normal within 1 minute

Investigation:

What changed recently? (deployment?)
Is CPU spike legitimate?
Is there a memory leak? (check memory growing over time)
Is there a bad query? (database slow too?)
Is there infinite loop in code?

Resolution:

Optimize code (cache, fewer DB queries)
Increase instance size
Scale more instances permanently
Add monitoring for CPU spike

Incident: Payment Processor Down

Detection:

Checkout returns errors
Logs show “Connection refused” to payment processor

Immediate action:

// Pseudo-code for graceful degradation
if (paymentProcessor.unavailable) {
  queueOrderForManualProcessing(order);
  return { success: false, reason: "Processing temporarily unavailable, please try again" };
}

Communication:

Tell customers: “Orders temporarily queued, will process shortly”
Give ETA (usually 30-60 minutes for processor recovery)

Recovery:

If payment processor expected to recover soon (< 1 hour): Wait and communicate
If expected long outage (> 1 hour): Activate backup processor if available

Incident: Disk Full

Quick diagnosis (1 min):

df -h  # Show disk usage
# Look for 100% usage

du -sh /*  # Show which directory consuming space
# Usually /var/log if log files not rotated

Immediate action:

Find large log files: ls -lh /var/log/*.log
Compress old logs: gzip /var/log/old.log
Or delete if safe: rm /var/log/debug.log*
Restart service to free memory
Disk space now available

Prevention:

Enable log rotation (logrotate)
Monitor disk space
Set alerts at 80% full
Clean up old files regularly

Incident Command Bridge Setup

Before Incident: Prepare

Slack #incidents channel exists
On-call schedule configured (PagerDuty/etc)
Runbooks documented (like above)
Stakeholders know to watch #incidents
Phone bridge number available if needed

During Incident: IC Opens Bridge

1. IC posts to #incidents: "Starting investigation bridge"
2. IC starts Slack thread in #incidents
3. If SEV-1: Post phone bridge link
4. IC posts updates every 15 minutes
5. IC tracks timeline (start time, diagnosis, actions, resolution time)

Bridge Rules

One person talking at a time (IC manages)
IC asks questions, delegates tasks
Investigators report findings
No blame, focus on recovery
Keep bridge to 5 people max (core team)
Post findings in Slack thread for others to see

Escalation Paths

Who to escalate to (and when)

For database issues:

Page database on-call
5 min: If still investigating

For infrastructure issues:

Page infrastructure on-call
5 min: If still investigating

For unknown cause after 10 minutes:

Page service owner
Call VP Engineering
This means we’re stumped, need leadership

For external dependency issues:

If known contact: Call them
Otherwise: Wait or use fallback
Post-incident: Get direct contact numbers

Integration with Playbook

Part of deployment and reliability:

/pb-guide - Section 7 references incident readiness
/pb-observability - Monitoring enables incident detection
/pb-release - Release runbook includes incident contacts
/pb-adr - Architecture decisions affect failure modes

/pb-observability - Set up monitoring and alerting to detect incidents early
/pb-sre-practices - On-call health, blameless culture, toil reduction
/pb-dr - Disaster recovery planning for major incidents
/pb-logging - Logging strategy for incident investigation
/pb-maintenance - Systematic maintenance prevents incident categories (expired certs, full disks)

Incident Response Checklist

Before Incidents Happen

See /pb-sre-practices for on-call setup, rotation health, and escalation policies.

Incident commander role defined
#incidents Slack channel created
Runbooks written (database, CPU, payment, disk)
Post-incident review process defined
Monitoring configured (see /pb-observability)

During Incident

Incident declared in #incidents within 2 minutes
Severity level assigned (SEV-1/2/3/4)
IC assigned and acknowledged
Investigation started
Communications every 15 minutes
Root cause identified
Action taken to recover
Resolution time tracked

After Incident

Post-incident review scheduled (within 24 hours)
Action items identified and assigned
Runbook updated with new learnings
Monitoring improved to detect earlier
Prevention implemented if applicable
All participants thanked

Created: 2026-01-11 | Category: Deployment | Tier: S/M/L

Keyboard shortcuts

Engineering Playbook