Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Server Hygiene

Periodic health and hygiene review for servers and VPS instances. A calm, repeatable ritual for detecting drift, bloat, and silent degradation before they become incidents.

Mindset: Server hygiene embodies /pb-design-rules thinking: Robustness (catch degradation before failure), Transparency (make server state visible and explainable), and Simplicity (predictable cleanups beat clever automation). Apply /pb-preamble thinking to challenge assumptions about what’s “probably fine.”

Resource Hint: sonnet (procedural, well-defined scope)

This is not firefighting. This is the periodic physical exam that prevents the emergency room visit.


When to Use This Command

  • Monthly hygiene pass - Routine review of a running server
  • Quarterly full audit - Deep drift analysis and capacity planning
  • After a period of neglect - Server hasn’t been reviewed in months
  • Before scaling or migration - Understand current state before changes
  • Post-incident verification - Confirm the server is clean after recovery
  • Onboarding to an inherited server - Build a mental model of what’s running

Quick Reference

CadenceScopeTime
WeeklyGlance: disk, errors, failed jobs5 min
MonthlyHygiene: logs, images, packages, access30 min
QuarterlyFull: drift analysis, capacity, backup test1-2 hrs

Execution Flow

Phase 1: SNAPSHOT ──► Phase 2: HEALTH ──► Phase 3: DRIFT ──► Phase 4: CLEANUP ──► Phase 5: READINESS
  (inventory)         (signals)           (bloat detection)   (safe actions)       (future-proof)
       └── Weekly: phases 2-3 only ──┘
       └── Monthly: phases 1-4 ───────────────────────────┘
       └── Quarterly: all phases ──────────────────────────────────────────────────────────────────┘

Phase 1: Snapshot Reality

Goal: know exactly what the server is today. If you can’t explain the server in 5 minutes, it’s already drifting.

Server Inventory

# System identity
hostname && uname -a
head -4 /etc/os-release
uptime

# Resources
nproc && free -h && df -h
ItemCommandWhat to Record
OS and kerneluname -a, cat /etc/os-releaseVersion, last update date
CPU, RAM, disknproc, free -h, df -hLimits and current usage
UptimeuptimeLast reboot, load average
Userscat /etc/passwd | grep -v nologinWho has shell access
SSH keysls /home/*/.ssh/authorized_keysWhich keys are present
Open portsss -tlnpWhat’s listening, on which interfaces
Running servicessystemctl list-units --type=service --state=runningActive services
Containersdocker ps --format 'table {{.Names}}\t{{.Image}}\t{{.Status}}'Running containers
Cron jobscrontab -l; ls /etc/cron.d/Scheduled tasks

Application Footprint

ItemWhat to Check
Deployed appsVersions, last deploy date
Active vs abandonedIs everything running actually needed?
Deployment methodsystemd, Docker, PM2, bare process
Runtime versionsnode, go, python, java - are they current?

Configuration Sources

ItemWhat to Check
Environment variablesWhere are they defined? (systemd, .env, shell profile)
Secrets locationEnv files, vaults, or plaintext?
Reverse proxynginx, caddy, traefik - which sites are configured?
TLS certificatesSource (Let’s Encrypt, manual), renewal status, expiry date

Deliverable: A short server manifest. Write it down - even a few bullet points in a markdown file beats nothing.


Phase 2: Health Signals

Goal: detect slow degradation before users feel it.

Look at trends, not just current values. A server at 60% disk today that was at 40% last month is a problem. Compare with your previous server manifest - if you don’t have one, record today’s numbers. That’s where trends start.

# Disk usage by mount
df -h

# Largest directories
du -sh /* 2>/dev/null | sort -hr | head -10

# Memory with swap
free -h

# CPU load (1, 5, 15 min averages)
uptime

# Disk IO wait (if iostat available)
iostat -x 1 3 2>/dev/null

Thresholds:

ResourceHealthyWarningCritical
Disk< 70%70-85%> 85%
Memory< 80%80-90%> 90% or swapping
CPU load< cores1-2x cores> 2x cores sustained
SwapNoneAny activeGrowing over time

Process Health

# Long-running processes sorted by memory
ps aux --sort=-%mem | head -15

# Zombie processes
ps aux | awk '$8 ~ /Z/ {print}'

# Failed systemd units
systemctl --failed

# OOM killer history
dmesg | grep -i "out of memory" | tail -5
journalctl -k | grep -i "oom" | tail -5

Ask: Is anything slowly leaking memory? Are there zombie processes? Has the OOM killer fired recently?

Application Health

SignalHow to CheckRed Flag
Error ratesjournalctl -u <service> --since "1 hour ago" | grep -i error | wc -lIncreasing trend
Restart loopssystemctl show <service> -p NRestartsCount > 0 unexpectedly
Queue backlogApplication-specificGrowing, not draining
DB connectionsss -tnp | grep 5432 | wc -lApproaching pool limit

System Health

# Kernel warnings
dmesg --level=err,warn | tail -10

# Time sync
timedatectl status | grep "synchronized"

# Pending security updates (Debian/Ubuntu)
apt list --upgradable 2>/dev/null | grep -i security

Rule of thumb: If something spikes periodically, find out why. If something slowly rises, that’s a leak or accumulation.


Phase 3: Drift and Bloat Detection

This is where most server rot happens. Things quietly accumulate until one day the disk is full or a forgotten service gets exploited.

Disk Bloat

# Log sizes
du -sh /var/log/ /var/log/journal/

# Docker waste
docker system df
docker images -f "dangling=true" -q | wc -l
docker volume ls -f "dangling=true" -q | wc -l

# Old build artifacts, temp files, core dumps
find /tmp -type f -mtime +30 | head -20
find / -name "core" -type f 2>/dev/null | head -5
Bloat SourceWhere to Look
Logs without rotation/var/log/, application log directories
Old log archives.gz files never cleaned
Docker images and volumesdocker system df
Build artifacts/tmp, project build directories
Core dumps/, /var/crash/
Package manager cacheapt clean, yum clean all

Service Bloat

CheckCommandRed Flag
Enabled but unused servicessystemctl list-unit-files --state=enabledServices you don’t recognize
Stale reverse proxy configsls /etc/nginx/sites-enabled/Sites for apps no longer running
Unused firewall rulesufw status or iptables -LRules for decommissioned services
Stale cron jobscrontab -lJobs for things that moved or stopped
Orphaned containersdocker ps -a --filter status=exitedExited containers piling up

Config Drift

  • Hand-edited config files with no source of truth
  • Inconsistent environment variables across applications
  • One-off fixes never documented (“I’ll remember why I changed this”)
  • Secrets duplicated in multiple places

Ask: Could you rebuild this server’s configuration from version control alone? If not, what’s missing?

Security Drift

# Users with shell access
grep -v "nologin\|false" /etc/passwd

# SSH keys - do you recognize all of them?
for user_home in /home/*/; do
  [ -f "$user_home.ssh/authorized_keys" ] && echo "=== $(basename $user_home) ===" && cat "$user_home.ssh/authorized_keys"
done

# Packages not updated recently
apt list --upgradable 2>/dev/null | wc -l

# TLS certificate expiry
openssl s_client -connect localhost:443 -servername $(hostname) </dev/null 2>/dev/null | openssl x509 -noout -dates
Drift TypeWhat to Check
Unused SSH keysKeys for people who no longer need access
Stale usersAccounts that should have been removed
Overly permissive firewallRules broader than necessary
Outdated TLSWeak ciphers, approaching expiry
Unpatched packagesSecurity updates pending for weeks

Deliverable: Two lists: “safe to remove now” and “needs planning before removal.”


Phase 4: Hygiene Actions

Golden rule: no “clever” changes during hygiene. Predictable beats smart. Only safe, reversible actions during routine reviews.

Safe Cleanups

Inspect before acting. Review output, then confirm.

# Rotate and prune journal logs
journalctl --vacuum-time=30d
journalctl --vacuum-size=500M

# Show removable packages, then clean
apt --dry-run autoremove
apt autoremove && apt clean

# Show what Docker would prune (images, containers, build cache)
docker system prune --dry-run
docker system prune

Requires judgment - these can destroy data if containers are temporarily stopped:

# Review temp files before deleting
find /tmp -type f -mtime +30 | head -20
# Only delete after reviewing: find /tmp -type f -mtime +30 -delete

# List unused volumes - verify none belong to stopped services you intend to restart
docker volume ls -f "dangling=true"
# Only prune after reviewing: docker volume prune

Stability Improvements

ActionWhy
Add log rotation where missingPrevent disk exhaustion from logs
Set resource limits on containersPrevent one service from starving others
Add health checks to servicesDetect failures before users report them
Configure restart policiesRestartSec=5, Restart=on-failure for systemd
Document non-obvious decisionsFuture you will forget why that cron job exists

Performance Tuning

Only if measurements justify it. Don’t tune what you haven’t measured.

AreaActionPrerequisite
Worker countsAdjust based on CPU coresKnow current CPU utilization
DB connectionsTune pool sizeKnow current connection count vs limit
CompressionEnable gzip/brotli in reverse proxyVerify CPU headroom
Unnecessary background jobsRemove or reduce frequencyKnow what each job does

Phase 5: Future Readiness

This is where the ritual pays off long-term.

Backup Verification

The question is not “do you have backups” but “can you restore them.”

CheckStatus
What is backed up?Data, config, secrets, or all three?
Backup frequencyMatches your acceptable data loss?
Last restore testIf “never,” schedule one now
Off-server storageBackups on the same VPS are not backups
Retention and costHow far back can you go? What does it cost?

For comprehensive backup and recovery planning, see /pb-dr.

Monitoring Coverage

  • Resource metrics (CPU, RAM, disk) - collected and retained
  • Application error rates - visible and trended
  • Uptime checks - external, not self-reported
  • Log visibility - searchable, not just stored
  • Alerts - fire when needed, reach someone who can act

For monitoring design guidance, see /pb-observability.

Scaling Headroom

  • Current capacity: How much headroom before hitting limits?
  • First bottleneck: What resource runs out first?
  • Single points of failure: What has no redundancy?
  • Growth trajectory: At current growth rate, when do you hit limits?

Disaster Questions

Answer honestly:

  1. How long to rebuild this server from scratch?
  2. What steps are manual vs automated?
  3. What secrets would block recovery if lost?
  4. Who else knows how this server works?

If rebuild takes more than a few hours, the system is fragile. See /pb-dr for disaster recovery planning.


Server Manifest Template

Maintain a living document per server. Even a few lines beats nothing.

# Server: [hostname]

**Provider:** [e.g., DigitalOcean, Hetzner, AWS]
**Size:** [CPU, RAM, disk]
**OS:** [distro and version]
**Last review:** [date]

## Services Running
- [service 1] - [purpose] - [deployment method]
- [service 2] - [purpose] - [deployment method]

## Access
- SSH: [who has keys]
- Firewall: [ports open]

## Backups
- [what, where, how often, last tested]

## Known Issues
- [things to watch or fix next time]

Quick Commands

ActionCommand
Largest directoriesdu -sh /* 2>/dev/null | sort -hr | head -10
Open portsss -tlnp
Running servicessystemctl list-units --type=service --state=running
Failed servicessystemctl --failed
Docker wastedocker system df
Journal cleanupjournalctl --vacuum-time=30d
Security updatesapt list --upgradable 2>/dev/null
TLS expiryopenssl s_client -connect localhost:443 </dev/null 2>/dev/null | openssl x509 -noout -dates
OOM historydmesg | grep -i "out of memory"

Red Flags

Signs the server needs a hygiene pass now:

  • “We’ll deal with it when it becomes a problem”
  • Deploys are getting slower with no code changes
  • Memory usage “mysteriously” grows between deploys
  • Nobody knows what’s safe to delete
  • A restart broke something that was working
  • Last backup test was “never”

  • /pb-maintenance - Strategic maintenance patterns and thinking triggers
  • /pb-hardening - Initial server security setup (run before first deploy)
  • /pb-dr - Disaster recovery planning and testing
  • /pb-sre-practices - Toil reduction, error budgets, operational culture
  • /pb-observability - Monitoring and alerting design

Last Updated: 2026-02-07 Version: 1.0.0


Production systems accumulate entropy. This ritual is how you pay down the interest before it compounds.