Disaster Recovery
Plan, test, and execute recovery from major system failures. When everything goes wrong, have a plan that works.
Mindset: Disaster recovery embodies /pb-design-rules thinking: Repair (fail noisily, recover quickly), Robustness (design for failure), and Least Surprise (recovery should work as documented). Use /pb-preamble thinking to challenge assumptions about what disasters are “unlikely.”
The best time to plan for disaster is before it happens. The second best time is now.
Resource Hint: opus - disaster recovery planning demands careful architecture and risk analysis
When to Use This Command
- Creating DR plan - Establishing recovery strategy for your system
- Defining RTO/RPO - Setting recovery objectives with stakeholders
- DR testing - Running game days and failover exercises
- After an incident - Reviewing and improving DR procedures
- Compliance requirements - Documenting DR capabilities
Quick Reference
| Term | Definition |
|---|---|
| RTO | Recovery Time Objective - max acceptable downtime |
| RPO | Recovery Point Objective - max acceptable data loss |
| Failover | Switching to backup system |
| Failback | Returning to primary system |
RTO/RPO Definitions
Recovery Time Objective (RTO)
RTO = How long can you be down?
| RTO Target | Meaning | Example |
|---|---|---|
| 0 (zero) | No downtime acceptable | Payment processing |
| < 1 hour | Critical system | Core API |
| < 4 hours | Important system | Admin dashboard |
| < 24 hours | Standard system | Reporting |
| < 1 week | Low priority | Development tools |
Setting RTO:
Questions to ask:
- What is the business impact per hour of downtime?
- Do we have SLA commitments?
- What is our reputation risk?
- What can we realistically achieve?
Recovery Point Objective (RPO)
RPO = How much data can you lose?
| RPO Target | Meaning | Backup Strategy |
|---|---|---|
| 0 (zero) | No data loss | Synchronous replication |
| < 1 minute | Near-zero | Streaming replication |
| < 1 hour | Minimal | Frequent snapshots |
| < 24 hours | Standard | Daily backups |
| < 1 week | Acceptable | Weekly backups |
Setting RPO:
Questions to ask:
- How much work would users lose?
- Can data be reconstructed from other sources?
- What is the regulatory requirement?
- What can we afford to backup?
RTO/RPO Trade-offs
Lower RTO/RPO = Higher cost and complexity
Zero RTO + Zero RPO:
- Active-active multi-region
- Synchronous replication
- Expensive, complex
1 hour RTO + 1 hour RPO:
- Warm standby
- Frequent async replication
- Moderate cost
24 hour RTO + 24 hour RPO:
- Cold standby
- Daily backups
- Low cost
Document your targets:
## Service: Payment Processing
- RTO: 15 minutes
- RPO: 0 (zero data loss)
- Justification: Revenue impact, regulatory requirement
- Strategy: Active-passive with synchronous replication
## Service: Admin Dashboard
- RTO: 4 hours
- RPO: 1 hour
- Justification: Internal tool, can reconstruct recent changes
- Strategy: Backup restore from hourly snapshots
Backup Strategies
The 3-2-1 Rule
- 3 copies of data
- 2 different storage types
- 1 offsite location
Example:
Copy 1: Production database (primary)
Copy 2: Local replica (different disk)
Copy 3: Cloud storage backup (different region/provider)
Immutable Backups
Protect against ransomware and accidental deletion.
# AWS S3 with Object Lock
aws s3api put-object-lock-configuration \
--bucket my-backups \
--object-lock-configuration '{
"ObjectLockEnabled": "Enabled",
"Rule": {
"DefaultRetention": {
"Mode": "GOVERNANCE",
"Days": 30
}
}
}'
# Objects cannot be deleted for 30 days
Immutability options:
- AWS S3 Object Lock
- Azure Immutable Blob Storage
- GCP Bucket Lock
- Air-gapped offline backups
Backup Verification
Backups that haven’t been tested are not backups.
# Monthly backup verification script
#!/bin/bash
echo "=== Backup Verification $(date) ==="
# 1. Download latest backup
aws s3 cp s3://backups/latest/db.sql.gz /tmp/restore-test/
# 2. Restore to test database
gunzip /tmp/restore-test/db.sql.gz
psql -h test-db -U admin -d restore_test < /tmp/restore-test/db.sql
# 3. Verify data integrity
EXPECTED_ROWS=1000000 # Known approximate count
ACTUAL_ROWS=$(psql -h test-db -U admin -d restore_test -t -A -c "SELECT COUNT(*) FROM users")
if [ "$ACTUAL_ROWS" -lt "$EXPECTED_ROWS" ]; then
echo "ERROR: Row count mismatch. Expected ~$EXPECTED_ROWS, got $ACTUAL_ROWS"
exit 1
fi
# 4. Verify application can connect
curl -f http://test-app/health || exit 1
echo "=== Backup verification PASSED ==="
Verification schedule:
- Daily: Automated integrity checks
- Weekly: Restore to test environment
- Monthly: Full recovery drill
- Quarterly: DR test (see below)
Retention Policies
| Backup Type | Retention | Purpose |
|---|---|---|
| Hourly | 24 hours | Point-in-time recovery |
| Daily | 30 days | Short-term recovery |
| Weekly | 3 months | Medium-term recovery |
| Monthly | 1 year | Long-term/compliance |
| Yearly | 7 years | Regulatory (varies) |
Failover Procedures
Manual Failover Steps
When automated failover isn’t possible or appropriate:
## Database Failover Runbook
### Pre-Conditions
- Primary database is unresponsive or corrupted
- Replica has current data (check replication lag)
- You have authority to initiate failover
### Steps
1. **Verify the problem (2 min)**
- Is primary truly down? (not network issue)
- What is replica lag? (acceptable data loss?)
- Notify team in #incidents
2. **Stop writes to primary (1 min)**
- Update application config to reject writes
- Or: Block primary at network level
3. **Promote replica (5 min)**
```bash
# PostgreSQL
pg_ctl promote -D /var/lib/postgresql/data
# Verify promotion
psql -c "SELECT pg_is_in_recovery();" # Should return 'f'
-
Update application config (2 min)
- Point DATABASE_URL to new primary
- Deploy config change
-
Verify application (2 min)
- Check health endpoints
- Verify writes working
- Monitor error rates
-
Communicate (ongoing)
- Update status page
- Notify stakeholders
Post-Failover
- Document what happened
- Schedule postmortem
- Plan failback (when original primary is repaired)
### Automated Failover
For zero/low RTO requirements:
```yaml
# Example: PostgreSQL with Patroni (automated failover)
# patroni.yml
scope: my-cluster
name: node1
restapi:
listen: 0.0.0.0:8008
bootstrap:
dcs:
ttl: 30
loop_wait: 10
retry_timeout: 10
maximum_lag_on_failover: 1048576 # 1MB max lag for failover
postgresql:
listen: 0.0.0.0:5432
data_dir: /var/lib/postgresql/data
parameters:
synchronous_commit: "on" # For zero data loss
Automated failover considerations:
- Test failover regularly (it will fail when you need it otherwise)
- Set appropriate lag thresholds
- Have manual override procedures
- Monitor failover events
DNS-Based Failover
For simple active-passive setups:
# Health check fails → update DNS
# Using AWS Route 53 health checks
aws route53 change-resource-record-sets \
--hosted-zone-id Z123456 \
--change-batch '{
"Changes": [{
"Action": "UPSERT",
"ResourceRecordSet": {
"Name": "db.example.com",
"Type": "A",
"TTL": 60,
"ResourceRecords": [{"Value": "10.0.2.100"}]
}
}]
}'
DNS failover considerations:
- TTL affects failover time (lower TTL = faster failover, more DNS traffic)
- Clients may cache DNS beyond TTL
- Not suitable for zero-RTO requirements
Recovery Testing
Game Day Exercises
Controlled failure injection to test recovery.
Game day template:
## Game Day: Database Failover Test
### Date: 2026-02-15
### Duration: 2 hours (10am - 12pm)
### Participants: SRE team, Database team, On-call engineer
### Objectives
- Verify automated failover works as documented
- Measure actual RTO
- Identify documentation gaps
### Scenario
Simulate primary database failure during normal traffic.
### Pre-Conditions
- Staging environment configured identically to production
- All participants briefed
- Rollback plan ready
- Status page prepared
### Steps
1. (T+0) Announce game day start
2. (T+5) Inject failure: Stop primary database
3. (T+5) Observe: Does automated failover trigger?
4. (T+10) Measure: Time to full recovery
5. (T+20) Verify: Application functioning correctly
6. (T+30) Restore: Bring original primary back
7. (T+45) Failback: Return to original configuration
8. (T+60) Debrief: What worked, what didn't
### Success Criteria
- RTO < 5 minutes (target: 2 minutes)
- RPO = 0 (synchronous replication)
- No customer-visible errors
### Actual Results
[Fill in after exercise]
- RTO achieved: ___
- RPO achieved: ___
- Issues discovered: ___
- Action items: ___
Chaos Engineering (Lite)
Start simple before full chaos engineering:
Level 1: Planned failures
- Terminate a server during maintenance window
- Failover database on schedule
- Disconnect from external service
Level 2: Automated small failures
- Random pod termination (Kubernetes)
- Inject latency into service calls
- Simulate partial network failures
Level 3: Full chaos engineering
- Netflix Chaos Monkey style
- Production failures
- Requires mature observability and recovery
Start with Level 1. Master each level before advancing.
Tabletop Exercises
Discussion-based DR testing without actual system changes.
## Tabletop Exercise: Ransomware Attack
### Scenario
You arrive Monday morning. All production databases are encrypted.
Attackers demand 10 BTC. Last known good backup was Friday 6pm.
### Discussion Questions
1. Who do you notify first?
2. How do you verify backup integrity?
3. What is your recovery sequence?
4. How do you communicate with customers?
5. What is the estimated recovery time?
6. Do you pay the ransom? (Spoiler: No)
### Expected Outcomes
- Validate contact lists are current
- Identify gaps in backup strategy
- Practice decision-making under pressure
- Update runbooks based on discussion
Data Recovery Workflows
Database Point-in-Time Recovery
# PostgreSQL: Restore to specific timestamp
# Requires WAL archiving enabled
# 1. Stop application
sudo systemctl stop myapp
# 2. Create recovery configuration (PostgreSQL 12+)
# Note: recovery.conf was removed in PostgreSQL 12
cat >> /var/lib/postgresql/data/postgresql.conf << EOF
restore_command = 'cp /backup/wal/%f %p'
recovery_target_time = '2026-01-20 14:30:00'
recovery_target_action = 'promote'
EOF
# Create recovery signal file
touch /var/lib/postgresql/data/recovery.signal
# 3. Restore base backup
pg_basebackup -h backup-server -D /var/lib/postgresql/data-new
# 4. Start PostgreSQL (will replay WAL to target time)
sudo systemctl start postgresql
# 5. Verify data
psql -c "SELECT MAX(created_at) FROM transactions;"
File System Recovery
# From snapshot (cloud provider)
aws ec2 create-volume \
--snapshot-id snap-123456 \
--availability-zone us-east-1a
# Mount and verify
sudo mount /dev/xvdf /mnt/recovery
ls -la /mnt/recovery/
# Or from backup
rsync -avz backup-server:/backups/2026-01-20/ /mnt/recovery/
Application State Recovery
Some applications have state that needs recovery beyond database:
- Session data: May need to invalidate all sessions
- Cache data: Rebuild from source of truth
- File uploads: Restore from object storage backup
- Search indexes: Rebuild from database
Recovery sequence matters:
1. Database (source of truth)
2. File storage
3. Application servers
4. Cache/search indexes (rebuild)
5. CDN/edge cache (invalidate)
Communication During Disaster
Status Page Updates
Update template:
## Incident: Database Outage
### [RESOLVED] 15:45 UTC
The database has been restored and all services are operational.
We are monitoring for any residual issues.
### [UPDATE] 15:30 UTC
Database restore in progress. Estimated completion: 15 minutes.
### [UPDATE] 15:00 UTC
We have identified the issue and are restoring from backup.
RTO estimate: 45 minutes.
### [INVESTIGATING] 14:30 UTC
We are experiencing database connectivity issues.
Some users may see errors. We are investigating.
Communication cadence:
- Initial: Within 10 minutes of detection
- Updates: Every 30 minutes (or on significant change)
- Resolution: When fully restored
Stakeholder Communication
Internal escalation:
- On-call engineer
- Team lead
- Engineering manager
- VP Engineering (for major incidents)
- CEO (for customer-facing outages > 1 hour)
External communication:
- Status page (all incidents)
- Email to affected customers (significant incidents)
- Social media (major outages)
- Press (if necessary)
Communication Templates
Customer email template:
Subject: Service Disruption - [Service Name]
Dear Customer,
We experienced a service disruption affecting [specific impact]
between [start time] and [end time] UTC.
What happened:
[Brief, non-technical explanation]
What we're doing:
[Actions taken to prevent recurrence]
Impact to you:
[Specific impact, any data affected]
Next steps:
[Any action required from customer]
We apologize for the inconvenience and appreciate your patience.
[Your name]
[Company name]
Post-Recovery Verification
After recovery, verify before declaring success:
Verification Checklist
## Post-Recovery Verification
### Data Integrity
- [ ] Row counts match expected values
- [ ] Recent transactions present
- [ ] No data corruption detected
- [ ] Referential integrity intact
### Application Function
- [ ] All health checks passing
- [ ] Authentication working
- [ ] Core user flows working
- [ ] Background jobs processing
### Performance
- [ ] Response times normal
- [ ] No error rate elevation
- [ ] Database query times normal
- [ ] No resource exhaustion
### Monitoring
- [ ] All alerts cleared
- [ ] Dashboards show normal
- [ ] Logs show no errors
- [ ] External monitors green
### Communication
- [ ] Status page updated
- [ ] Team notified
- [ ] Stakeholders updated
- [ ] Postmortem scheduled
DR Plan Template
Every critical service needs a DR plan.
# Disaster Recovery Plan: [Service Name]
## Overview
- Service: [Name]
- Owner: [Team]
- Last updated: [Date]
- Last tested: [Date]
## Recovery Objectives
- RTO: [X hours]
- RPO: [X hours]
## Backup Strategy
- Method: [Daily snapshot, continuous replication, etc.]
- Location: [Where backups stored]
- Retention: [How long kept]
- Verification: [How/when tested]
## Failure Scenarios
### Scenario 1: Database Failure
- Detection: [How we know]
- Response: [Steps to recover]
- Runbook: [Link]
### Scenario 2: Complete Region Failure
- Detection: [How we know]
- Response: [Steps to recover]
- Runbook: [Link]
### Scenario 3: Data Corruption
- Detection: [How we know]
- Response: [Steps to recover]
- Runbook: [Link]
## Recovery Procedures
1. [Step 1]
2. [Step 2]
3. [Step 3]
## Contacts
- Primary: [Name, contact]
- Backup: [Name, contact]
- Escalation: [Name, contact]
## Dependencies
- [Service 1]: [Impact if unavailable]
- [Service 2]: [Impact if unavailable]
## Testing Schedule
- Monthly: Backup verification
- Quarterly: Failover test
- Annually: Full DR test
Integration with Playbook
Part of operational excellence:
/pb-hardening- Prevent disasters through security/pb-secrets- Protect credentials/pb-sre-practices- Sustainable operations/pb-dr- Recover when prevention fails (this command)/pb-incident- Respond during disasters
DR testing cadence:
Monthly: Backup verification
Quarterly: Failover testing (game day)
Annually: Full DR simulation
After changes: Verify DR still works
Quick Reference
| Topic | Action |
|---|---|
| Set RTO/RPO | Document for each critical service |
| Verify backups | Monthly restore test |
| Test failover | Quarterly game day |
| Update DR plan | After any infrastructure change |
| Practice communication | Include in tabletop exercises |
Related Commands
/pb-incident- Respond to incidents during disaster scenarios/pb-sre-practices- Sustainable operations and toil reduction/pb-database-ops- Database backup and failover procedures/pb-deployment- Deploy recovery infrastructure/pb-maintenance- Backup verification and ongoing maintenance scheduling
Hope for the best, plan for the worst, test the plan.