Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Production Maintenance

Establish systematic maintenance patterns to prevent production incidents. This playbook provides thinking triggers for database maintenance, backup verification, health monitoring, and alerting strategy.

Mindset: Maintenance embodies /pb-design-rules thinking: Robustness (systems fail gracefully when maintenance lapses) and Transparency (make system health visible). Apply /pb-preamble thinking to challenge assumptions about what’s “good enough” maintenance.

Resource Hint: sonnet - maintenance planning and automation patterns


When to Use This Command

  • New production deployment - Establish maintenance patterns from day one
  • After incidents - Add maintenance tasks that would have prevented the incident
  • Quarterly reviews - Audit and update maintenance schedules
  • Capacity planning - Maintenance is part of resource planning
  • Onboarding - Help new team members understand operational patterns

Quick Reference

TierFrequencyFocus
DailyEvery dayLogs, backups, health checks
WeeklyOnce/weekDatabase stats, security updates, reports
MonthlyOnce/monthDeep cleans, cert audits, DR tests

Philosophy

Production systems accumulate entropy:

  • Databases bloat with dead data
  • Disks fill with logs and artifacts
  • Certificates expire silently
  • Dependencies develop vulnerabilities
  • Backups rot without verification

This playbook provides thinking triggers, not prescriptions. Every project has different needs - use these patterns to ask the right questions about your system.


Core Questions

Before implementing maintenance, answer:

  1. What accumulates? (logs, dead tuples, orphan records, temp files)
  2. What expires? (certificates, tokens, cache entries, sessions)
  3. What drifts? (config, dependencies, schema, data integrity)
  4. What breaks silently? (backups, health checks, alerting itself)

Maintenance Tiers

TierFrequencyPurposeQuestions to Ask
DailyEvery dayPrevent accumulationWhat grows unbounded? What needs rotation?
WeeklyOnce/weekCatch driftWhat statistics go stale? What reports matter?
MonthlyOnce/monthDeep cleanWhat requires downtime? What needs verification?

Principle: Automate aggressively, monitor passively, intervene rarely.


Database Maintenance

Questions to Ask

  • Does your database have automatic maintenance (autovacuum, etc.)?
  • Is automatic maintenance sufficient, or does your write pattern need manual intervention?
  • How do you detect bloat before it causes problems?
  • What’s your index maintenance strategy?

PostgreSQL Patterns

TaskPurposeWhen to Consider
VACUUM ANALYZEMark dead tuples reusable, update statsHigh-write tables, weekly minimum
VACUUM FULLReclaim disk space (requires lock)Significant bloat, monthly or less
REINDEXRebuild bloated indexesAfter bulk deletes, schema changes

Bloat detection trigger:

-- Adapt this query to your tables
SELECT relname, n_dead_tup, n_live_tup,
       round(100.0 * n_dead_tup / NULLIF(n_live_tup, 0), 2) AS dead_pct
FROM pg_stat_user_tables
WHERE n_dead_tup > 1000
ORDER BY n_dead_tup DESC;

Ask: Which tables in your system have the highest write churn?

Other Databases

  • MySQL: OPTIMIZE TABLE, ANALYZE TABLE, binary log purging
  • MongoDB: compact, index rebuilds, oplog sizing
  • Redis: Memory monitoring, key expiration policies
  • SQLite: VACUUM, ANALYZE

Ask: What’s the equivalent maintenance for your database?


Backup Strategy

See /pb-dr for comprehensive backup strategy (3-2-1 rule, retention policies, verification procedures).

Key question: When did you last verify a backup by restoring it? If the answer isn’t recent, schedule a restore test now.


Health Monitoring

Questions to Ask

  • What’s the minimum check that proves the system works end-to-end?
  • What dependencies can fail silently?
  • How do you know if monitoring itself is broken?

Health Check Dimensions

DimensionWhat to Check
Service healthHTTP endpoints, process status
DependenciesDatabase connections, cache, queues
ResourcesDisk, memory, connections, file descriptors
CertificatesSSL expiry, API key rotation
Data integrityExpected counts, orphan records

Pattern: Health checks should be cheap, fast, and actionable.

Ask: If this health check fails, what would you do about it?


Resource Monitoring

Questions to Ask

  • What resources can be exhausted?
  • What are the warning thresholds vs. critical thresholds?
  • Who gets alerted, and can they act on it?

Common Resources

ResourceWarning SignQuestion
Disk>70% fullWhat’s growing? Logs? Data? Uploads?
MemorySustained >85%Memory leak? Undersized? Cache unbounded?
Connections>70% of poolConnection leak? Pool too small?
File descriptorsApproaching limitToo many open files? Socket leak?

Ask: What’s the first resource that will run out in your system?


Security Hygiene

Questions to Ask

  • When was the last security update applied?
  • What’s your certificate renewal process?
  • How do you detect unauthorized access attempts?
  • What secrets need rotation, and when?

Maintenance Dimensions

FrequencyFocus
DailyFailed login monitoring, intrusion detection
WeeklySecurity update check, audit log review
MonthlyDependency vulnerability scan, certificate audit
QuarterlyAccess review, secret rotation

Ask: What would an attacker target first in your system?


Post-Migration Verification

Critical pattern: After any migration, verify that:

  1. Database records match reality - Rows exist, counts are correct
  2. Generated artifacts exist - Files tracked in DB actually exist on disk
  3. Volumes are mounted correctly - Containers can access expected paths
  4. External dependencies are reachable - APIs, services, storage
  5. Background jobs can run - Workers have access to everything they need

Common trap: Database migrated, but files/volumes weren’t. System looks healthy until something tries to access the missing files.

Ask: What in your system exists both in the database AND on the filesystem? Are both migrated?


Alerting Strategy

Questions to Ask

  • Is this alert actionable at 3 AM?
  • What’s the difference between “needs attention” and “wake someone up”?
  • How do you prevent alert fatigue?
  • How do you know if alerting is broken?

Alert Quality Checklist

  • Alert has clear remediation steps
  • Alert fires only when action is needed
  • Alert includes enough context to diagnose
  • Someone is responsible for responding

Pattern: If an alert fires and you snooze it, the alert is wrong.

Ask: How many alerts fired last week that required no action?


Reporting

Questions to Ask

  • What trends matter for capacity planning?
  • What would you want to know before a Monday morning?
  • What metrics indicate system health vs. business health?

Weekly Report Triggers

Consider including:

  • Resource utilization trends (not just current values)
  • Backup status and age
  • Security summary (failed attempts, updates pending)
  • Anything that changed unexpectedly

Ask: What would have prevented your last incident if you’d known it sooner?


Automation Principles

Script Structure Pattern

#!/bin/bash
set -e

# Configuration
APP_DIR="/opt/myapp"
LOG_FILE="/var/log/maintenance.log"

# Utility functions
log() { echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1" | tee -a "$LOG_FILE"; }
alert() { log "ALERT: $1"; curl -X POST "$WEBHOOK_URL" -d "text=$1" 2>/dev/null || true; }

# Task functions (idempotent, can run multiple times safely)
task_backup() { log "Running backup"; pg_dump ... }
task_health_check() { log "Health check"; curl -sf "$HEALTH_URL" || alert "Health check failed"; }
task_vacuum() { log "Running vacuum"; psql -c "VACUUM ANALYZE;" ... }
task_report() { log "Generating report"; ... }

# Main dispatch
case "${1:-daily}" in
    daily)  task_backup; task_health_check ;;
    weekly) task_vacuum; task_report ;;
esac

Principles

  • Idempotent: Safe to run multiple times
  • Logged: Know when it ran and what happened
  • Alerting: Fail loudly, not silently
  • Documented: Future you will forget why

Ask: Can you run this script twice safely?


Cron Scheduling

Pattern

TimeTaskRationale
Low traffic windowDaily maintenanceMinimize impact
After daily completesWeekly maintenanceBuild on daily
After weekly patternMonthly maintenanceLeast frequent last

Checklist

  • Absolute paths (cron has minimal PATH)
  • Output redirected to logs
  • Wrapper scripts for complex jobs
  • Tested manually before scheduling

Ask: What happens if the cron job fails silently?


Getting Started Checklist

Use this to audit your current maintenance:

  • Database: Do you have scheduled maintenance? Is it sufficient?
  • Backups: When did you last test a restore?
  • Health: What’s your minimum end-to-end health check?
  • Resources: What will run out first? How will you know?
  • Security: When was the last security update?
  • Certificates: When do they expire? Who gets notified?
  • Alerts: Are they actionable? Is there fatigue?
  • Reports: What trends should you be watching?

Red Flags

Signs your maintenance needs attention:

  • “We’ll deal with it when it becomes a problem”
  • “The backup runs, but we’ve never tested restore”
  • “Alerts fire so often we ignore them”
  • “Disk filled up and we had to emergency clean”
  • “We found out the certificate expired from users”
  • “After migration, we discovered files were missing”

Summary

Maintenance is prevention. The goal isn’t to have impressive automation - it’s to avoid 3 AM incidents.

Ask yourself:

  1. What can fail silently in my system?
  2. What would I want to know before it becomes urgent?
  3. What did the last incident teach me about what to maintain?

Then automate the answers.


  • /pb-observability - Monitoring detects; maintenance prevents
  • /pb-sre-practices - Toil reduction and operational health
  • /pb-incident - Good maintenance reduces incident frequency
  • /pb-dr - Disaster recovery (backups are foundation)
  • /pb-server-hygiene - Periodic server health and hygiene review

Good maintenance is invisible. You only notice its absence.