SRE Practices
Build sustainable, reliable operations through toil reduction, error budgets, and healthy on-call practices. This command focuses on prevention and culture-complementing /pb-incident (response) and /pb-observability (monitoring).
Mindset: SRE practices embody /pb-preamble thinking: blameless culture, honest assessment of reliability, and challenging “we’ve always done it this way.” Apply /pb-design-rules thinking: Robustness (systems should handle failure gracefully) and Transparency (make operational health visible).
Reliability is a feature. Invest in it deliberately, not reactively.
Resource Hint: opus - SRE strategy requires architectural thinking and reliability trade-off analysis
When to Use This Command
- Reducing toil - Automating repetitive operational tasks
- Setting SLOs - Defining reliability targets and error budgets
- On-call review - Improving rotation health and reducing burnout
- Capacity planning - Preventing resource exhaustion
- Building SRE culture - Establishing sustainable operations practices
Quick Reference
| Practice | Purpose | Frequency |
|---|---|---|
| Toil reduction | Eliminate repetitive manual work | Ongoing |
| Error budgets | Balance reliability vs velocity | Per release |
| Capacity planning | Prevent resource exhaustion | Quarterly |
| Service ownership | Clear accountability | Always |
| On-call health | Sustainable rotations | Weekly review |
Toil Identification & Reduction
What Is Toil?
Toil is work that is:
- Manual - Requires human intervention
- Repetitive - Done over and over
- Automatable - Could be scripted or eliminated
- Reactive - Triggered by events, not planned
- No enduring value - Doesn’t improve the system
Examples of toil:
- Manually restarting crashed services
- Responding to the same alert repeatedly
- Manual deployment steps
- Copying data between systems
- Responding to routine access requests
Not toil:
- On-call incident response (unavoidable, requires judgment)
- Postmortems (creates enduring improvement)
- System design (creates lasting value)
Toil Tracking
Track toil to understand where to invest automation.
Toil log template:
| Date | Task | Time Spent | Frequency | Automatable? | Priority |
|---|---|---|---|---|---|
| 2026-01-20 | Restart API pod after OOM | 15min | 2x/week | Yes | High |
| 2026-01-20 | Generate weekly report | 30min | Weekly | Yes | Medium |
| 2026-01-20 | Provision dev environment | 1hr | 3x/month | Yes | High |
Metrics to track:
- Total toil hours per week
- Toil as percentage of engineering time (target: < 50%)
- Top 5 toil sources
- Toil reduction over time
Toil Budget
Rule: Keep toil below 50% of on-call/operations time.
If toil > 50%:
→ Stop new feature work
→ Focus on automation until toil < 50%
→ This is not optional
Why 50%? Engineers need time for:
- Improving systems (not just keeping them running)
- Learning and growth
- Sustainable pace
Prioritizing Automation
| Criteria | Weight |
|---|---|
| Frequency (how often) | High |
| Time per occurrence | High |
| Error-prone when manual | High |
| Blocks other work | Medium |
| Causes context switching | Medium |
Automation ROI formula:
Hours saved = (frequency × time per occurrence × weeks) - automation time
If hours saved > 0 in reasonable timeframe → automate
Quick wins first: Start with high-frequency, low-complexity tasks.
Error Budget Policies
Error budgets translate SLO targets into actionable decisions. For SLO definition, see /pb-observability.
Understanding Error Budgets
If your SLO is 99.9% availability (43 minutes downtime/month):
- Error budget = 43 minutes of allowed downtime
- Budget consumed = actual downtime this month
- Budget remaining = what you can “spend” on risky changes
SLO: 99.9% availability
Monthly error budget: 43 minutes
Week 1: 10 min downtime → 33 min remaining (77% left)
Week 2: 5 min downtime → 28 min remaining (65% left)
Week 3: 20 min downtime → 8 min remaining (19% left)
Week 4: SLOW DOWN - limited budget for risky deploys
Error Budget Policy
When budget is healthy (> 50% remaining):
- Deploy new features freely
- Take calculated risks
- Experiment with new technologies
When budget is concerning (25-50% remaining):
- Increase review rigor for changes
- Prioritize reliability fixes
- Reduce deployment frequency
- Add more testing before deploy
When budget is critical (< 25% remaining):
- Freeze non-critical deployments
- Focus exclusively on reliability
- Postmortem recent incidents
- Delay feature work until budget recovers
When budget is exhausted (0% remaining):
- Emergency mode: reliability only
- No new features until SLO is met
- All hands on reliability improvement
- Stakeholder communication required
Negotiating with Product
Error budgets create healthy tension between reliability and velocity.
Conversation framework:
Product: "We need to ship feature X this week"
SRE: "Our error budget is at 15%. If we deploy and cause an outage,
we'll miss our SLO commitment.
Options:
1. Wait until budget recovers (2 weeks)
2. Deploy with extra safeguards (canary, feature flag)
3. Accept SLO miss and communicate to customers
Which tradeoff works for the business?"
Document the decision. If product chooses to spend budget, that’s a valid business decision-but make it explicit.
Capacity Planning
Prevent resource exhaustion before it becomes an incident.
Capacity Metrics
Track these for critical services:
| Metric | Warning | Critical | Action |
|---|---|---|---|
| CPU utilization | > 60% sustained | > 80% | Scale up |
| Memory utilization | > 70% sustained | > 85% | Scale up or optimize |
| Disk usage | > 70% | > 85% | Expand or clean |
| Database connections | > 70% of pool | > 85% | Increase pool or optimize |
| Request latency | P99 > 2x baseline | P99 > 5x | Investigate |
Forecasting Load
Simple linear projection:
Current: 1000 requests/sec
Growth rate: 10% month-over-month
Capacity limit: 2000 requests/sec
Months until capacity:
1000 × 1.1^n = 2000
n ≈ 7 months
Action: Plan capacity increase by month 5
Consider:
- Organic growth (user base)
- Seasonal patterns (holidays, events)
- Marketing campaigns
- New feature launches
Capacity Planning Cadence
Quarterly:
- Review current utilization
- Update growth projections
- Plan infrastructure changes for next quarter
Before major launches:
- Load testing at 2x expected traffic
- Pre-scale infrastructure
- Define rollback triggers
Template: Quarterly Capacity Review
## Q1 2026 Capacity Review
### Current State
- API servers: 8 instances, 45% avg CPU
- Database: 16GB RAM, 60% utilized
- Storage: 500GB, 55% used
### Growth Since Last Quarter
- Traffic: +15%
- Storage: +20%
- Users: +12%
### Projections for Q2
- Expected traffic: +15% (based on trend)
- Storage needs: +100GB (based on data growth)
- No CPU concerns (headroom sufficient)
### Actions
- [ ] Increase storage allocation by 200GB (buffer)
- [ ] Monitor database memory (approaching threshold)
- [ ] No immediate scaling needed for compute
Service Ownership Model
Clear ownership prevents “that’s not my job” failures.
What Owners Are Responsible For
Service owners must:
- Maintain SLO compliance
- Respond to pages for their service
- Document runbooks and architecture
- Plan capacity for their service
- Perform regular dependency audits
- Conduct postmortems for incidents
Ownership Documentation
Every service needs:
## Service: Payment Processing
### Owner
- Team: Payments
- Primary contact: @payments-oncall
- Escalation: @payments-lead
### SLOs
- Availability: 99.95%
- Latency P99: < 500ms
- Error rate: < 0.1%
### Dependencies
- Database: PostgreSQL (owned by Data Platform)
- Queue: Redis (owned by Platform)
- External: Stripe API
### Runbooks
- [Payment processing failures](link)
- [High latency investigation](link)
- [Database connection issues](link)
### On-Call
- Rotation: Weekly, Monday handoff
- Contact: PagerDuty "payments" service
Handoff Protocol
When ownership changes (reorg, team changes):
- Documentation audit - Is everything documented?
- Runbook review - Walk through with new owner
- Shadow on-call - New owner shadows for 2 weeks
- Gradual handoff - New owner primary, old owner backup
- Clean handoff - New owner fully responsible
Never abandon a service without explicit handoff.
Blameless Culture & Psychological Safety
Blame prevents learning. Psychological safety enables improvement.
Why Blameless Matters
With blame:
- Engineers hide mistakes
- Root causes stay hidden
- Same incidents repeat
- Team trust erodes
Without blame:
- Engineers report problems early
- Root causes are discovered
- Systems improve
- Team trust grows
Blameless Postmortem Language
Avoid:
- “John caused the outage by…”
- “The mistake was…”
- “They should have known…”
- “Why didn’t anyone…”
Instead:
- “The system allowed…”
- “The process didn’t catch…”
- “The automation was missing…”
- “How might we prevent…”
Creating Psychological Safety
Leaders must:
- Thank people for reporting problems
- Share their own mistakes openly
- Never punish for honest errors
- Focus questions on systems, not people
- Celebrate learning from failures
Indicators of safety:
- People raise concerns early
- Bad news travels fast
- Postmortems are collaborative, not defensive
- Teams voluntarily share failures
The “5 Whys” Without Blame
Incident: Customer data exposed in logs
Why? Logs included full request bodies
Why? Logging configuration didn't exclude sensitive fields
Why? No standard logging template for sensitive services
Why? Each team built their own logging
Why? No central platform team until recently
Action: Create standard logging library with PII redaction
Notice: No individual blamed. Focus on system improvement.
On-Call Scheduling & Setup
Before incidents happen, establish clear on-call coverage. This section covers setup; see “On-Call Health” below for sustainability.
Rotation Structure
Primary On-Call: Responds immediately (paged on SEV-1/2)
- Expected to join call within 5 minutes
- Use 1 week rotations (high interrupt cost)
Secondary On-Call: Backup if primary unavailable
- Called if primary doesn't respond in 5 minutes
Weekly Rotation:
- Handoff: Friday 5pm (or end of week)
- Ramp-up: New person shadows for 1 week first
On-Call Tools
PagerDuty / Opsgenie (Recommended):
- Escalation policy: Primary → Secondary (5 min) → Manager (5 min)
- Alert routing: SEV-1/2 page immediately, SEV-3 creates ticket
- Calendar integration for swaps and visibility
Simple Alternative: Google Calendar + Slack bot (/whois-oncall)
On-Call Expectations
During on-call week:
- Respond to SEV-1/2 pages within 5 minutes
- Work from location where you can join calls
- Avoid travel to areas without cell service
Company should:
- Pay on-call stipend
- Limit to 1 week per month if possible
- Provide recovery time after heavy rotations
- Never force on-call against will
Mock Incident Training
Required before first live on-call (30-45 min):
- Scenario: Simulate realistic incident (e.g., API down after deployment)
- Practice: New person declares incident, checks dashboards, identifies root cause
- Debrief: Review decision speed, communication frequency, escalation awareness
This prevents: Chaotic first incidents, decision paralysis under pressure
On-Call Health
Sustainable on-call prevents burnout and maintains quality.
Healthy Rotation Patterns
Good:
- 1 week on, 3+ weeks off
- Defined business hours (primary) vs after-hours (backup)
- Clear escalation paths
- Compensatory time off after heavy rotations
Bad:
- Always-on expectations
- 1 week on, 1 week off (too frequent)
- No backup coverage
- Pages for non-actionable alerts
On-Call Load Metrics
Track per rotation:
| Metric | Healthy | Concerning | Action Needed |
|---|---|---|---|
| Pages per week | < 5 | 5-15 | > 15 |
| Night pages | < 1 | 1-3 | > 3 |
| Time to acknowledge | < 5 min | 5-15 min | > 15 min |
| False positive rate | < 10% | 10-30% | > 30% |
If metrics are concerning:
- Reduce alert noise (tune thresholds)
- Automate responses where possible
- Add more people to rotation
- Split into sub-rotations by service
Preventing Burnout
Signs of on-call burnout:
- Dreading rotation weeks
- Ignoring or silencing pages
- Decreased quality of incident response
- Increased sick days during rotation
- Team members leaving
Prevention:
- Regular rotation reviews
- Rotate out of on-call for a quarter (recovery)
- Celebrate reliability improvements
- Make on-call load visible to leadership
- Budget time for on-call automation
On-Call Handoff Template
## On-Call Handoff: Jan 20 → Jan 27
### Outgoing (Alice)
- No ongoing incidents
- Known issues:
- API latency spike at 3pm daily (monitoring, not actionable)
- Staging environment flaky (don't page for staging)
### Incoming (Bob)
- Confirmed: I have access to all systems
- Confirmed: PagerDuty is configured correctly
- Questions: None
### Deployment Schedule
- Tuesday: Feature X (low risk)
- Thursday: Database migration (high risk, after-hours)
### Contacts
- Database: @db-oncall
- Infrastructure: @infra-oncall
- Escalation: @engineering-lead
Operational Review Cadence
Regular reviews prevent drift and maintain operational health.
Weekly: Operational Standup (15 min)
- Recent incidents and postmortem status
- Current error budget consumption
- On-call load from last week
- Any blockers or concerns
Monthly: Reliability Review (1 hour)
- SLO compliance for the month
- Error budget trends
- Toil tracking update
- Capacity utilization review
- Action items from postmortems
Quarterly: Operational Planning (2 hours)
- Quarterly capacity planning
- Toil reduction priorities
- On-call rotation health
- SLO adjustments (if needed)
- Training and documentation gaps
Annually: Disaster Recovery Testing
- Full DR test (see
/pb-dr) - On-call process review
- Major incident simulation
- Documentation audit
Server Migration Checklist
Database Migrations
Always use full dump/restore:
# WRONG: Selective table export (misses users, tokens, etc.)
pg_dump -t verses -t cases dbname > partial.sql
# RIGHT: Full database dump
pg_dump -U user dbname > backup.sql
psql -U user dbname < backup.sql
Pre-migration:
- Document all table row counts on source
- Verify auth tables included (users, refresh_tokens, sessions)
- Plan for downtime window
Post-migration verification:
SELECT 'users', count(*) FROM users
UNION ALL SELECT 'refresh_tokens', count(*) FROM refresh_tokens
UNION ALL SELECT 'cases', count(*) FROM cases;
- Row counts match source
- Login flow works
- Existing sessions remain valid
Rollback plan:
- Keep source database running (read-only) until verification complete
- Document rollback steps before starting migration
- Test rollback procedure in staging first
New Server Security Verification
Before deploying services, verify hardening (Linux servers):
| Item | Command | Expected |
|---|---|---|
| SSH key-only | grep PasswordAuth /etc/ssh/sshd_config | no |
| Root restricted | grep PermitRootLogin /etc/ssh/sshd_config | prohibit-password |
| UFW enabled | ufw status | Status: active |
| Fail2ban running | systemctl status fail2ban | active |
| Auditd running | systemctl status auditd | active |
| Kernel hardened | sysctl net.ipv4.tcp_syncookies | 1 |
| Secrets protected | stat -c %a .env | 600 |
Note: stat syntax varies by platform. Use -c %a on Linux, -f%Lp on macOS.
Integration with Playbook
Complements existing commands:
/pb-incident- Incident response and postmortems/pb-observability- SLO definitions, metrics, alerting/pb-deployment- Deployment strategies/pb-dr- Disaster recovery planning
Workflow:
Design (/pb-observability - define SLOs)
↓
Operate (this command - sustainable practices)
↓
Respond (/pb-incident - when things break)
↓
Recover (/pb-dr - disaster scenarios)
↓
Improve (back to operate)
Quick Commands
| Topic | Action |
|---|---|
| Track toil | Log time spent on repetitive tasks |
| Check error budget | Compare incidents to SLO allowance |
| Review capacity | Quarterly utilization review |
| Assess on-call health | Track pages per week, night pages |
| Conduct postmortem | Blameless, focus on systems |
Related Commands
/pb-incident- Respond to production incidents/pb-observability- Set up monitoring, SLOs, and alerting/pb-dr- Disaster recovery planning and testing/pb-team- Build high-performance engineering teams
Reliability is a feature. Invest in it deliberately.