Incident Response & Recovery
Respond to production incidents quickly and professionally. Clear process, clear communication, minimal impact.
Mindset: Incident response requires both /pb-preamble and /pb-design-rules thinking.
During response: be direct about status (preamble), challenge assumptions about root cause, surface unknowns. Design systems to fail loudly (Repair, Transparency) so incidents are visible immediately. After: conduct honest post-mortems without blame, and improve system robustness.
Resource Hint: opus - critical incident triage requires deep analysis and careful judgment
Purpose
Incidents are inevitable. What matters:
- Speed: Detect and respond quickly
- Clarity: Know exactly what’s happening
- Communication: Keep stakeholders informed
- Recovery: Get back to normal fast
- Learning: Prevent repeats through post-incident review
When to Use This Command
- Production incident occurring - Service degradation or outage
- Alert fired - Monitoring detected anomaly
- Customer-reported issue - Users experiencing problems
- Post-incident - Running retrospective and writing post-mortem
- Incident prep - Reviewing process before on-call rotation
Incident Severity Levels
Classify incidents to determine response urgency and escalation.
SEV-1 (Critical, Immediate Page)
- User-facing service completely down
- Data loss or data integrity risk
- Security breach active
- Major revenue impact
Response time: Immediate (< 5 minutes) Escalation: Page on-call, VP, customers Communication: Every 15 minutes Resolution target: 1-2 hours
Examples:
- API servers offline, users can’t access service
- Database corrupted, data cannot be retrieved
- Payment processing broken, no transactions processing
- Authentication system down, users locked out
SEV-2 (High, Urgent Page)
- User-facing service degraded (slow, errors)
- Partial functionality broken
- Workaround exists but poor user experience
Response time: 15 minutes Escalation: Page on-call + relevant team lead Communication: Every 30 minutes Resolution target: 4 hours
Examples:
- API responses 10x slower than normal
- Search feature broken (but users can browse)
- Emails not sending (but users can still order)
- Mobile app crashes on one action (desktop works)
SEV-3 (Medium, No Page)
- Internal system degraded
- Non-critical feature broken
- User workaround available
- Limited customer impact
Response time: Next business day acceptable Escalation: Slack to team, create ticket Communication: Daily update Resolution target: 1-2 days
Examples:
- Admin dashboard slow
- Reporting system down (business can continue)
- Non-critical background job failing
- One endpoint timeout (alternate exists)
SEV-4 (Low, Future Fix)
- Documentation issue
- Minor UI bug
- Development environment broken
- No user-facing impact
Response time: Next sprint Escalation: Create ticket, no escalation Communication: Team awareness Resolution target: When convenient
Examples:
- Typo in UI text
- Help docs incorrect
- Dev script broken
- Console warning (no functional impact)
Incident Declaration
Who declares incidents?
- Anyone can declare an incident (no permission needed)
- Don’t wait for managers to approve
- Better to declare and cancel than miss critical issue
- When in doubt, declare
How to declare
For SEV-1/2: Declare immediately
Slack: #incidents channel
Message: "@incident-commander SEV-1: Users report 503 errors on checkout"
Include: Service affected, symptoms, your name
For SEV-3/4: Create ticket
Jira/GitHub issue with label: incident
Title: [SEV-3] Admin dashboard slow
Include: What's broken, user impact, symptoms
Incident Commander Role
Once incident declared:
- Incident Commander assigned (first responder or on-call)
- IC decides severity
- IC starts bridge call for SEV-1/2
- IC starts Slack thread tracking
- IC coordinates investigation and communication
On-Call Operations
For on-call setup, scheduling, training, and rotation health, see /pb-sre-practices → On-Call Health section.
This includes:
- On-call rotation structure and scheduling
- PagerDuty/Opsgenie setup
- On-call expectations and boundaries
- Mock incident training
- Preventing on-call burnout
This command focuses on incident response - what to do when an incident occurs. On-call operations (how to set up and maintain healthy rotations) are ongoing SRE practices.
Immediate Response (First 5 Minutes)
IC Quick Triage
-
Is it real? (5 seconds)
- Check monitoring: Is P99 latency actually up?
- Check logs: Are errors really happening?
- Avoid: Chasing false alarms from bad metrics
-
What’s affected? (30 seconds)
- Which services? endpoints? regions?
- How many users impacted? percentage?
- Is it spreading or stable?
-
What changed recently? (1 minute)
- Was there a deployment? (check git log)
- Configuration change? (check configs)
- Traffic spike? (check metrics)
- External dependency failure? (check upstreamhealth)
-
Initial action (2 minutes)
- If recent deployment: Consider rollback immediately
- If configuration change: Revert change
- If dependency down: Switch to failover/degraded mode
- Otherwise: Page relevant team for investigation
Initial Communication (SEV-1/2)
Send to Slack #incidents:
@channel SEV-1: Checkout failing (503 errors)
Status: Investigating
Symptoms: POST /checkout returning 503 since 14:32 UTC
Affected: ~5% of transactions
Potential causes: Database slow? Payment API down? Recent deploy?
Updates every 15 minutes in thread.
Investigation (5-30 Minutes)
Investigation Team
- Incident Commander: Coordinates, owns timeline, communicates
- Oncall Engineer: Investigates service, runs commands
- Subject Matter Expert: Called if needed (database expert, payments, etc)
Diagnostic Checklist
☐ Check recent deployments (git log --since="10 minutes ago")
☐ Check monitoring: latency, errors, resource usage
☐ Check logs: error messages, stack traces
☐ Check external dependencies: Are they healthy?
☐ Check database: Is it responsive? Any locks?
☐ Check traffic: Is there a sudden spike?
☐ Check configuration: Any recent changes?
☐ Check disk space: Are we full? Out of inodes?
Root Cause Patterns
Deployment-related (50% of incidents)
- New code has bug
- Migration script failed
- Configuration not deployed
- Infrastructure change
Action: Rollback or hotfix
Database-related (20% of incidents)
- Slow query locking table
- Connection pool exhausted
- Disk full
- Replication lag
Action: Kill slow query, scale connections, free space
Resource exhaustion (15% of incidents)
- CPU 100%
- Memory full
- Disk full
- Network bandwidth full
Action: Identify process consuming, kill or scale
External dependency (10% of incidents)
- API provider down
- CDN down
- Payment processor down
- DNS down
Action: Use fallback, degrade gracefully, wait for recovery
Configuration (5% of incidents)
- Wrong environment variables
- SSL certificate expired
- Feature flag stuck on/off
- Rate limiting too aggressive
Action: Fix configuration, restart service
Resolution (Immediate Actions)
Recovery Strategies (In Order of Speed)
1. Rollback (Fastest, if recent deploy)
# If incident started after recent deployment
git log --oneline -5 # See recent deploys
git revert <commit-hash> # Create revert commit
make deploy # Deploy revert
# Rollback clears issue in minutes
# Then investigate what went wrong later
2. Kill Slow Queries (If database slow)
-- MySQL
SHOW PROCESSLIST; -- See running queries
-- Find query taking > 30 seconds
KILL <process-id>; -- Stop it
-- PostgreSQL
SELECT pid, query, state FROM pg_stat_activity WHERE state != 'idle';
SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE pid != pg_backend_pid() AND query_start < now() - interval '30 seconds';
3. Scale Horizontally (If resource maxed)
# If CPU/memory at 100%
kubectl scale deployment api --replicas=10 # Add more instances
# or
aws autoscaling set-desired-capacity --desired-capacity 20
# Service recovers in 30-60 seconds as new instances start
4. Degrade Gracefully (If dependency down)
If payment processor down:
- Return 503 for checkout
- Queue orders for manual processing
- Users can try again in 5 minutes
If search service down:
- Disable search feature
- Show "Search temporarily unavailable"
- Users can browse without search
If cache down:
- Route around cache
- Use slower database directly
- Accept higher latency, avoid errors
5. Feature Flag (If specific feature broken)
If checkout broken but other features OK:
- Kill checkout feature flag
- Users see "Checkout under maintenance"
- Other site functions normally
- Buy time to fix checkout
6. Configuration Fix (If config issue)
# If environment variable wrong
kubectl set env deployment api ENV_VAR=correct_value
kubectl rollout restart deployment api
# or if config file
git commit -am "fix: correct environment variable"
make deploy
Communication During Incident
Rules for Communication
- Honesty: Tell truth about what’s happening
- Frequency: Update every 15 min (SEV-1), 30 min (SEV-2)
- Specificity: Not “we’re investigating” but “database queries slow, killing long-running query”
- Clarity: Avoid technical jargon, explain impact
- No blame: Never blame person, focus on recovery
Communication Template
Initial (First 2 min):
SEV-1: Checkout down - 503 errors
What: POST /checkout returning 503 errors
When: Started 14:32 UTC (5 minutes ago)
Impact: ~5% of transactions failing (~$10k/hour)
Status: Investigating root cause
ETA: 15 minutes
Update (Every 15 min during incident):
UPDATE: Found root cause
Root cause: Payment API provider rate limiting us
Evidence: Logs show 429 responses from payment processor
Action: Increasing rate limit quota with provider
ETA: 10 minutes for fix, may need 5 min for orders to catch up
Resolution (When fixed):
RESOLVED: Checkout fully functional again
Root cause: Payment processor temporary rate limiting
Fix applied: Increased our rate limit quota
Time to fix: 27 minutes (14:32 to 14:59)
Impact: ~120 failed transactions (manual processing queued)
Action: Post-incident review scheduled for tomorrow 10am
Notify Stakeholders
Immediately (if SEV-1):
- #incidents Slack channel
- @oncall
- VP Engineering
- Customer Success team
Every 15 minutes:
- Post update in #incidents thread
- If still ongoing, email major customers
After 1 hour (if still ongoing):
- Public status page update
- Email all customers
- If critical, call major customers
Post-Incident Review
Timing
- SEV-1: Review within 24 hours
- SEV-2: Review within 3 days
- SEV-3/4: Review optional, log lessons
Review Participants
- Incident Commander
- Responders (who worked on incident)
- Service owner
- One person taking notes
Review Structure (30 min meeting)
1. Timeline (5 min)
14:32 - Incident starts (checkout returns 503)
14:33 - Alert fires, IC pages on-call
14:35 - IC declares SEV-1
14:38 - Team identifies payment processor rate limiting
14:42 - Team increases rate limit quota
14:59 - Incident resolved, checkout working
2. What Went Well (5 min)
- Fast detection (1 minute)
- Clear communication
- Quick escalation
- Good teamwork
3. What Could Improve (10 min)
- Didn’t have payment processor limits in runbook (add it)
- Took 7 minutes to investigate (could have suspected API faster)
- Didn’t have direct contact for payment processor (get it)
4. Action Items (10 min)
☐ Add payment processor limits to runbook
☐ Get direct contact info for payment processor
☐ Add payment processor rate limits to monitoring alerts
☐ Consider circuit breaker for payment API
☐ Test failover to backup payment processor
Common Incident Runbooks
Incident: Database Slow
Quick diagnosis (2 min):
-- Show slow running queries
SHOW PROCESSLIST; -- MySQL
-- or
SELECT pid, query, query_start FROM pg_stat_activity WHERE state != 'idle' ORDER BY query_start; -- PostgreSQL
-- Show table locks
SHOW OPEN TABLES WHERE In_use > 0; -- MySQL
Immediate action:
- Identify query taking > 30 seconds
KILL <process-id>to stop it- Service recovers immediately
Investigation:
- What query was slow? (check logs)
- Is it a known slow query?
- Missing index?
- N+1 query pattern?
- Should cache this result?
Resolution:
- Add index if missing
- Optimize query
- Add caching
- Scale database vertically
Incident: API Server CPU 100%
Quick diagnosis (1 min):
# What process consuming CPU?
top -b -n 1 | head -20
# If Node/Python/Java process:
ps aux | grep node # See how many processes
# Which endpoint consuming CPU?
curl http://localhost:9000/debug/cpu-profile # if available
Immediate action:
- Scale horizontally: Add more instances
- Traffic redistributes to new instances
- CPU returns to normal within 1 minute
Investigation:
- What changed recently? (deployment?)
- Is CPU spike legitimate?
- Is there a memory leak? (check memory growing over time)
- Is there a bad query? (database slow too?)
- Is there infinite loop in code?
Resolution:
- Optimize code (cache, fewer DB queries)
- Increase instance size
- Scale more instances permanently
- Add monitoring for CPU spike
Incident: Payment Processor Down
Detection:
- Checkout returns errors
- Logs show “Connection refused” to payment processor
Immediate action:
// Pseudo-code for graceful degradation
if (paymentProcessor.unavailable) {
queueOrderForManualProcessing(order);
return { success: false, reason: "Processing temporarily unavailable, please try again" };
}
Communication:
- Tell customers: “Orders temporarily queued, will process shortly”
- Give ETA (usually 30-60 minutes for processor recovery)
Recovery:
- If payment processor expected to recover soon (< 1 hour): Wait and communicate
- If expected long outage (> 1 hour): Activate backup processor if available
Incident: Disk Full
Quick diagnosis (1 min):
df -h # Show disk usage
# Look for 100% usage
du -sh /* # Show which directory consuming space
# Usually /var/log if log files not rotated
Immediate action:
- Find large log files:
ls -lh /var/log/*.log - Compress old logs:
gzip /var/log/old.log - Or delete if safe:
rm /var/log/debug.log* - Restart service to free memory
- Disk space now available
Prevention:
- Enable log rotation (logrotate)
- Monitor disk space
- Set alerts at 80% full
- Clean up old files regularly
Incident Command Bridge Setup
Before Incident: Prepare
- Slack #incidents channel exists
- On-call schedule configured (PagerDuty/etc)
- Runbooks documented (like above)
- Stakeholders know to watch #incidents
- Phone bridge number available if needed
During Incident: IC Opens Bridge
1. IC posts to #incidents: "Starting investigation bridge"
2. IC starts Slack thread in #incidents
3. If SEV-1: Post phone bridge link
4. IC posts updates every 15 minutes
5. IC tracks timeline (start time, diagnosis, actions, resolution time)
Bridge Rules
- One person talking at a time (IC manages)
- IC asks questions, delegates tasks
- Investigators report findings
- No blame, focus on recovery
- Keep bridge to 5 people max (core team)
- Post findings in Slack thread for others to see
Escalation Paths
Who to escalate to (and when)
For database issues:
- Page database on-call
- 5 min: If still investigating
For infrastructure issues:
- Page infrastructure on-call
- 5 min: If still investigating
For unknown cause after 10 minutes:
- Page service owner
- Call VP Engineering
- This means we’re stumped, need leadership
For external dependency issues:
- If known contact: Call them
- Otherwise: Wait or use fallback
- Post-incident: Get direct contact numbers
Integration with Playbook
Part of deployment and reliability:
/pb-guide- Section 7 references incident readiness/pb-observability- Monitoring enables incident detection/pb-release- Release runbook includes incident contacts/pb-adr- Architecture decisions affect failure modes
Related Commands
/pb-observability- Set up monitoring and alerting to detect incidents early/pb-sre-practices- On-call health, blameless culture, toil reduction/pb-dr- Disaster recovery planning for major incidents/pb-logging- Logging strategy for incident investigation/pb-maintenance- Systematic maintenance prevents incident categories (expired certs, full disks)
Incident Response Checklist
Before Incidents Happen
See /pb-sre-practices for on-call setup, rotation health, and escalation policies.
- Incident commander role defined
- #incidents Slack channel created
- Runbooks written (database, CPU, payment, disk)
- Post-incident review process defined
- Monitoring configured (see
/pb-observability)
During Incident
- Incident declared in #incidents within 2 minutes
- Severity level assigned (SEV-1/2/3/4)
- IC assigned and acknowledged
- Investigation started
- Communications every 15 minutes
- Root cause identified
- Action taken to recover
- Resolution time tracked
After Incident
- Post-incident review scheduled (within 24 hours)
- Action items identified and assigned
- Runbook updated with new learnings
- Monitoring improved to detect earlier
- Prevention implemented if applicable
- All participants thanked
Created: 2026-01-11 | Category: Deployment | Tier: S/M/L