Production Maintenance

Establish systematic maintenance patterns to prevent production incidents. This playbook provides thinking triggers for database maintenance, backup verification, health monitoring, and alerting strategy.

Mindset: Maintenance embodies /pb-design-rules thinking: Robustness (systems fail gracefully when maintenance lapses) and Transparency (make system health visible). Apply /pb-preamble thinking to challenge assumptions about what’s “good enough” maintenance.

Resource Hint: sonnet - maintenance planning and automation patterns

When to Use This Command

New production deployment - Establish maintenance patterns from day one
After incidents - Add maintenance tasks that would have prevented the incident
Quarterly reviews - Audit and update maintenance schedules
Capacity planning - Maintenance is part of resource planning
Onboarding - Help new team members understand operational patterns

Quick Reference

Tier	Frequency	Focus
Daily	Every day	Logs, backups, health checks
Weekly	Once/week	Database stats, security updates, reports
Monthly	Once/month	Deep cleans, cert audits, DR tests

Philosophy

Production systems accumulate entropy:

Databases bloat with dead data
Disks fill with logs and artifacts
Certificates expire silently
Dependencies develop vulnerabilities
Backups rot without verification

This playbook provides thinking triggers, not prescriptions. Every project has different needs - use these patterns to ask the right questions about your system.

Core Questions

Before implementing maintenance, answer:

What accumulates? (logs, dead tuples, orphan records, temp files)
What expires? (certificates, tokens, cache entries, sessions)
What drifts? (config, dependencies, schema, data integrity)
What breaks silently? (backups, health checks, alerting itself)

Maintenance Tiers

Tier	Frequency	Purpose	Questions to Ask
Daily	Every day	Prevent accumulation	What grows unbounded? What needs rotation?
Weekly	Once/week	Catch drift	What statistics go stale? What reports matter?
Monthly	Once/month	Deep clean	What requires downtime? What needs verification?

Principle: Automate aggressively, monitor passively, intervene rarely.

Database Maintenance

Questions to Ask

Does your database have automatic maintenance (autovacuum, etc.)?
Is automatic maintenance sufficient, or does your write pattern need manual intervention?
How do you detect bloat before it causes problems?
What’s your index maintenance strategy?

PostgreSQL Patterns

Task	Purpose	When to Consider
`VACUUM ANALYZE`	Mark dead tuples reusable, update stats	High-write tables, weekly minimum
`VACUUM FULL`	Reclaim disk space (requires lock)	Significant bloat, monthly or less
`REINDEX`	Rebuild bloated indexes	After bulk deletes, schema changes

Bloat detection trigger:

-- Adapt this query to your tables
SELECT relname, n_dead_tup, n_live_tup,
       round(100.0 * n_dead_tup / NULLIF(n_live_tup, 0), 2) AS dead_pct
FROM pg_stat_user_tables
WHERE n_dead_tup > 1000
ORDER BY n_dead_tup DESC;

Ask: Which tables in your system have the highest write churn?

Other Databases

MySQL: OPTIMIZE TABLE, ANALYZE TABLE, binary log purging
MongoDB: compact, index rebuilds, oplog sizing
Redis: Memory monitoring, key expiration policies
SQLite: VACUUM, ANALYZE

Ask: What’s the equivalent maintenance for your database?

Backup Strategy

See /pb-dr for comprehensive backup strategy (3-2-1 rule, retention policies, verification procedures).

Key question: When did you last verify a backup by restoring it? If the answer isn’t recent, schedule a restore test now.

Health Monitoring

Questions to Ask

What’s the minimum check that proves the system works end-to-end?
What dependencies can fail silently?
How do you know if monitoring itself is broken?

Health Check Dimensions

Dimension	What to Check
Service health	HTTP endpoints, process status
Dependencies	Database connections, cache, queues
Resources	Disk, memory, connections, file descriptors
Certificates	SSL expiry, API key rotation
Data integrity	Expected counts, orphan records

Pattern: Health checks should be cheap, fast, and actionable.

Ask: If this health check fails, what would you do about it?

Resource Monitoring

Questions to Ask

What resources can be exhausted?
What are the warning thresholds vs. critical thresholds?
Who gets alerted, and can they act on it?

Common Resources

Resource	Warning Sign	Question
Disk	>70% full	What’s growing? Logs? Data? Uploads?
Memory	Sustained >85%	Memory leak? Undersized? Cache unbounded?
Connections	>70% of pool	Connection leak? Pool too small?
File descriptors	Approaching limit	Too many open files? Socket leak?

Ask: What’s the first resource that will run out in your system?

Security Hygiene

Questions to Ask

When was the last security update applied?
What’s your certificate renewal process?
How do you detect unauthorized access attempts?
What secrets need rotation, and when?

Maintenance Dimensions

Frequency	Focus
Daily	Failed login monitoring, intrusion detection
Weekly	Security update check, audit log review
Monthly	Dependency vulnerability scan, certificate audit
Quarterly	Access review, secret rotation

Ask: What would an attacker target first in your system?

Post-Migration Verification

Critical pattern: After any migration, verify that:

Database records match reality - Rows exist, counts are correct
Generated artifacts exist - Files tracked in DB actually exist on disk
Volumes are mounted correctly - Containers can access expected paths
External dependencies are reachable - APIs, services, storage
Background jobs can run - Workers have access to everything they need

Common trap: Database migrated, but files/volumes weren’t. System looks healthy until something tries to access the missing files.

Ask: What in your system exists both in the database AND on the filesystem? Are both migrated?

Alerting Strategy

Questions to Ask

Is this alert actionable at 3 AM?
What’s the difference between “needs attention” and “wake someone up”?
How do you prevent alert fatigue?
How do you know if alerting is broken?

Alert Quality Checklist

Alert has clear remediation steps
Alert fires only when action is needed
Alert includes enough context to diagnose
Someone is responsible for responding

Pattern: If an alert fires and you snooze it, the alert is wrong.

Ask: How many alerts fired last week that required no action?

Reporting

Questions to Ask

What trends matter for capacity planning?
What would you want to know before a Monday morning?
What metrics indicate system health vs. business health?

Weekly Report Triggers

Consider including:

Resource utilization trends (not just current values)
Backup status and age
Security summary (failed attempts, updates pending)
Anything that changed unexpectedly

Ask: What would have prevented your last incident if you’d known it sooner?

Automation Principles

Script Structure Pattern

#!/bin/bash
set -e

# Configuration
APP_DIR="/opt/myapp"
LOG_FILE="/var/log/maintenance.log"

# Utility functions
log() { echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1" | tee -a "$LOG_FILE"; }
alert() { log "ALERT: $1"; curl -X POST "$WEBHOOK_URL" -d "text=$1" 2>/dev/null || true; }

# Task functions (idempotent, can run multiple times safely)
task_backup() { log "Running backup"; pg_dump ... }
task_health_check() { log "Health check"; curl -sf "$HEALTH_URL" || alert "Health check failed"; }
task_vacuum() { log "Running vacuum"; psql -c "VACUUM ANALYZE;" ... }
task_report() { log "Generating report"; ... }

# Main dispatch
case "${1:-daily}" in
    daily)  task_backup; task_health_check ;;
    weekly) task_vacuum; task_report ;;
esac

Principles

Idempotent: Safe to run multiple times
Logged: Know when it ran and what happened
Alerting: Fail loudly, not silently
Documented: Future you will forget why

Ask: Can you run this script twice safely?

Cron Scheduling

Pattern

Time	Task	Rationale
Low traffic window	Daily maintenance	Minimize impact
After daily completes	Weekly maintenance	Build on daily
After weekly pattern	Monthly maintenance	Least frequent last

Checklist

Absolute paths (cron has minimal PATH)
Output redirected to logs
Wrapper scripts for complex jobs
Tested manually before scheduling

Ask: What happens if the cron job fails silently?

Getting Started Checklist

Use this to audit your current maintenance:

Database: Do you have scheduled maintenance? Is it sufficient?
Backups: When did you last test a restore?
Health: What’s your minimum end-to-end health check?
Resources: What will run out first? How will you know?
Security: When was the last security update?
Certificates: When do they expire? Who gets notified?
Alerts: Are they actionable? Is there fatigue?
Reports: What trends should you be watching?

Red Flags

Signs your maintenance needs attention:

“We’ll deal with it when it becomes a problem”
“The backup runs, but we’ve never tested restore”
“Alerts fire so often we ignore them”
“Disk filled up and we had to emergency clean”
“We found out the certificate expired from users”
“After migration, we discovered files were missing”

Summary

Maintenance is prevention. The goal isn’t to have impressive automation - it’s to avoid 3 AM incidents.

Ask yourself:

What can fail silently in my system?
What would I want to know before it becomes urgent?
What did the last incident teach me about what to maintain?

Then automate the answers.

/pb-observability - Monitoring detects; maintenance prevents
/pb-sre-practices - Toil reduction and operational health
/pb-incident - Good maintenance reduces incident frequency
/pb-dr - Disaster recovery (backups are foundation)
/pb-server-hygiene - Periodic server health and hygiene review

Good maintenance is invisible. You only notice its absence.

Keyboard shortcuts

Engineering Playbook