Observability & Monitoring Design
Build visibility into your system’s behavior: metrics, logs, and traces that help you understand what’s happening in production.
Mindset: Observability is multi-perspective understanding. You need metrics, logs, and traces-different views of the same system. This embodies /pb-preamble thinking (no single perspective is complete) and /pb-design-rules thinking (especially Transparency: design for visibility to make debugging easier).
Question your assumptions about what’s happening in production. Systems should be observable; you shouldn’t need to guess.
Resource Hint: sonnet - Observability design follows structured instrumentation patterns.
When to Use
- Designing monitoring and observability for a new service
- Diagnosing gaps in production visibility (missing metrics, logs, or traces)
- Planning instrumentation before a major deployment
Observability vs Monitoring
Monitoring (narrow):
- Check if something is working (alerts on thresholds)
- Passive: respond to alerts
- Example: “CPU is above 80%, send alert”
Observability (broad):
- Understand why it’s happening (diagnose issues)
- Active: explore and investigate
- Example: “CPU is high, let’s trace which requests caused it”
The goal: Observability → Monitoring → Alerting
The Three Pillars of Observability
1. Metrics (Numbers)
What is happening? Volume, rate, performance.
- Request count, latency, error rate
- CPU, memory, disk usage
- Database connections, queue depth
- Business metrics (user signups, transactions)
2. Logs (Events)
What happened? When? Why?
- Request logs (who, what, when)
- Error logs (what went wrong)
- Application events (user actions, state changes)
- Infrastructure events (deployments, failures)
3. Traces (Flows)
How did a request flow through the system?
- Request trace: client → web → database → cache
- Latency breakdown: 100ms total (20ms web, 60ms DB, 10ms cache)
- Failures: where did it break?
Metrics: What to Track
Request Metrics (Always)
Latency (how fast):
- P50 (median), P95, P99 latencies
- By endpoint or operation
- Alert on: P99 > 1000ms (for web API)
Example tracking:
GET /api/users: P99 = 120ms
POST /api/users: P99 = 450ms (includes email send)
GET /api/users/{id}: P99 = 80ms
Throughput (how much):
- Requests per second (RPS)
- By endpoint, status code, method
- Alert on: sudden drop (possible crash)
Example tracking:
Total RPS: 1,200/sec
GET requests: 800/sec (67%)
POST requests: 300/sec (25%)
DELETE requests: 100/sec (8%)
Error Rate (what breaks):
- 4xx errors (client issues): 1% acceptable
- 5xx errors (server issues): <0.1% target
- By endpoint, error type
- Alert on: 5xx > 0.5%
Example tracking:
GET /api/users: 0.02% 5xx (acceptable)
POST /api/users: 0.08% 5xx (high!)
- 401 Unauthorized: 45%
- 400 Bad Request: 35%
- 500 Internal Error: 20%
Resource Metrics
CPU/Memory:
- Usage percentage (alert on >80% sustained)
- By service, pod, host
- Trending (is it growing?)
Database:
- Connection count (alert on >90% of pool)
- Query latency (P95, P99)
- Slow queries (>1s)
- Row counts (growing tables)
Disk:
- Used space (alert on >85%)
- Inode usage
- I/O operations
Business Metrics
Track what matters to business:
- Signups, active users, retention
- Revenue, transactions, conversion rate
- Error impact (transactions failed)
- Feature usage (adoption of new features)
Example:
Signups: 150/day (down 20% from week ago)
Active users: 25,000 (stable)
Failed transactions: 12 (0.03%, acceptable)
→ Investigate signup drop, not necessarily an outage
Logging: Structured Logs
Anti-pattern: Unstructured Logs
2026-01-11 14:23:45 ERROR User login failed
2026-01-11 14:23:46 User 12345 password incorrect
2026-01-11 14:23:47 WARNING High memory usage
Problems:
- Hard to search (“which users failed to login today?”)
- Hard to aggregate (metrics require regex parsing)
- Slow (parsing strings is expensive)
Pattern: Structured Logs (JSON)
{
"timestamp": "2026-01-11T14:23:45Z",
"level": "error",
"service": "auth-service",
"event": "user_login_failed",
"user_id": 12345,
"reason": "incorrect_password",
"attempt_number": 3,
"ip_address": "192.168.1.100",
"user_agent": "Mozilla/5.0...",
"duration_ms": 142
}
Benefits:
- Easy to search:
user_login_failed AND user_id:12345 - Easy to aggregate: count by reason, by service
- Fast: structured data, not regex parsing
- Queryable:
SELECT COUNT(*) WHERE level=error AND duration_ms>1000
Log Levels
DEBUG Use: Development, detailed tracing
Don't: Log in production (too verbose)
INFO Use: Major events (startup, shutdown, deployments)
Example: "User 123 logged in"
WARNING Use: Potentially problematic situations
Example: "Cache miss rate > 20%"
ERROR Use: Something failed, but system still works
Example: "Failed to send email to user 123, will retry"
CRITICAL Use: System is down or degraded
Example: "Database connection pool exhausted"
What to Log
[YES] DO Log:
- Errors and exceptions (with stack traces)
- Major state changes (user logged in, order placed)
- Performance concerns (slow queries, timeouts)
- Security events (login attempts, permission denials)
- Debugging info (request IDs, user context)
[NO] DON’T Log:
- User passwords, API keys, tokens
- Full credit card numbers (log last 4 digits only)
- Personally identifiable info (unless required)
- Debug output from third-party libraries
- Everything (too much log = can’t find signal)
Structured Log Example (Python)
import json
import logging
# Configure structured logging
logger = logging.getLogger(__name__)
def handle_user_login(username, password, ip_address):
try:
user = User.find_by_username(username)
if not user:
logger.warning(
json.dumps({
"event": "user_not_found",
"username": username, # OK: not sensitive
"ip_address": ip_address,
"timestamp": datetime.utcnow().isoformat()
})
)
return {"error": "Invalid credentials"}
if not user.verify_password(password):
logger.warning(
json.dumps({
"event": "invalid_password",
"user_id": user.id,
"attempt_number": user.failed_attempts + 1,
"ip_address": ip_address
})
)
user.failed_attempts += 1
return {"error": "Invalid credentials"}
# Success
logger.info(
json.dumps({
"event": "user_logged_in",
"user_id": user.id,
"ip_address": ip_address,
"session_duration_ms": 0
})
)
return {"success": True, "session_id": create_session(user)}
except Exception as e:
logger.error(
json.dumps({
"event": "login_error",
"error": str(e),
"error_type": type(e).__name__,
"username": username
})
)
return {"error": "Internal error"}
Tracing: End-to-End Visibility
The Problem (Without Tracing)
User reports: “My request takes 30 seconds!”
Without tracing:
Total time: 30 seconds
... but where is it slow?
- API server: ?
- Database: ?
- Cache: ?
- External API: ?
→ Need to guess, investigate each component
The Solution (With Tracing)
Request trace ID: 550e8400-e29b-41d4-a716-446655440000
Timeline:
0ms: HTTP request arrives
5ms: Authentication check (5ms)
10ms: Authorization check (5ms)
200ms: Database query (190ms) ← SLOW!
210ms: Cache update (10ms)
220ms: Format response (10ms)
225ms: HTTP response sent
Result: Database query is the bottleneck (190ms of 225ms)
Action: Optimize slow query or add index
Distributed Tracing (Microservices)
User request to user-service: 100ms
Breakdown:
10ms: Call auth-service (20ms)
├─ 5ms: Call database
└─ 15ms: Call cache
40ms: Call order-service (50ms)
├─ 30ms: Call payments-api
└─ 20ms: Call database
50ms: Format response
Result: Slowest part is payments-api (30ms)
Action: Optimize payments API or add timeout
Implementing Tracing
from opentelemetry import trace, metrics
from opentelemetry.instrumentation.requests import RequestsInstrumentor
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
# Setup trace exporter (send to Jaeger)
jaeger_exporter = JaegerExporter(
agent_host_name="localhost",
agent_port=6831,
)
# Instrument HTTP library
RequestsInstrumentor().instrument()
# Create tracer
tracer = trace.get_tracer(__name__)
# Use in code
with tracer.start_as_current_span("database_query") as span:
span.set_attribute("query", "SELECT * FROM users")
span.set_attribute("duration_ms", 150)
user = database.query("SELECT * FROM users WHERE id = ?", user_id)
Alerting: From Metrics to Actions
Alert Philosophy
Good alerts:
- Actionable (not “something might be wrong”)
- Rare (not noisy/flaky)
- Severity-appropriate (critical = page-on-call, warning = slack)
Bad alerts:
- “CPU is above 50%” (not specific, not actionable)
- “Error rate changed” (by how much? is it significant?)
- “Database query took 2 seconds” (sometimes OK, depends on query)
Alert Examples
Alert: API P99 Latency High
Condition: P99 latency > 1 second for >= 5 minutes
Severity: WARNING
Action: Check database/cache metrics, review recent deployments
Alert: Database Connection Pool Critical
Condition: Used connections > 90% for >= 2 minutes
Severity: CRITICAL (pages on-call)
Action: Check slow queries, close abandoned connections, scale up
Alert: Error Rate Spike
Condition: 5xx error rate > 1% for >= 1 minute
Severity: CRITICAL
Action: Check recent deployments, review error logs, rollback if needed
Alert: Disk Space Critical
Condition: Disk usage > 90% for >= 10 minutes
Severity: CRITICAL
Action: Delete old logs, archive data, scale storage
Alert Severity Levels
CRITICAL (page on-call immediately)
- System is down or degraded
- User-facing feature broken
- Data loss risk
- Security incident
WARNING (notify team, can wait)
- Performance issue (but system works)
- Resource usage high (but not critical)
- Unusual patterns (but maybe intentional)
INFO (log for reference)
- Deployments, configuration changes
- Regular maintenance, backups
- Scheduled events
SLI, SLO, and Error Budgets
Definitions
SLI (Service Level Indicator) - A metric that measures performance:
- Example: “API P99 latency is 120ms” or “System uptime is 99.95%”
- Measurable using monitoring data (from metrics/logs)
- You measure the actual SLI value
SLO (Service Level Objective) - A target for your SLI:
- Example: “API P99 latency should be < 200ms” or “System uptime target: 99.95%”
- What you promise to users (in SLA) or commit internally
- SLO is the target; SLI is the measurement against it
SLA (Service Level Agreement) - A contract with customers:
- What happens if you miss SLO (refunds, credits, penalties)
- External promise (affects revenue/reputation)
- Optional: Many internal services don’t have SLAs
Error Budget - How much you can fail and still meet SLO:
- If SLO is 99.9% uptime, error budget is 0.1%
- Over 30 days: 0.1% of 30 days × 24h × 3600s = 25,920 seconds ≈ 7.2 hours of allowed downtime
- Use error budget to decide: Ship risky feature? Take infrastructure down? Run load tests?
Setting SLIs & SLOs
Step 1: Identify critical user journeys
- Example: User signup, product search, checkout, payment processing
- Not every endpoint needs an SLO (focus on critical paths)
Step 2: Choose meaningful SLIs for each journey
Critical Journey: User Payment
├─ SLI 1: API latency (P99)
│ └─ SLO: < 500ms for 99.9% of requests
├─ SLI 2: Success rate
│ └─ SLO: > 99.99% (< 0.01% failure)
└─ SLI 3: Data freshness
└─ SLO: Payment recorded within 5 seconds
Critical Journey: Product Search
├─ SLI 1: Search latency (P95)
│ └─ SLO: < 200ms for 95% of requests
├─ SLI 2: Search accuracy
│ └─ SLO: > 95% of results relevant
└─ SLI 3: Availability
└─ SLO: 99.9% uptime
Step 3: Be realistic
- Don’t promise 99.99% if you have external dependencies you don’t control
- Start conservative (99.5%); tighten as confidence grows
- Remember: 99.9% means ~43 minutes downtime/month; 99.99% means ~4 minutes/month
Error Budget Example
SLO: 99.9% uptime for payment processing (0.1% error budget)
Budget allocation over month (30 days × 24h × 3600s = 2,592,000s total):
Total allowed downtime: 0.1% × 2,592,000s = 2,592 seconds ≈ 43.2 minutes
Allocation:
Scheduled maintenance: 15 minutes (35% of budget)
Unplanned incidents: 15 minutes (35% of budget)
Load testing/risky deploys: 13 minutes (30% of budget)
Reserve: 0 minutes (fully allocated)
Decision-making:
- “Should we deploy the risky feature?” → Check error budget
- If budget remaining > 13 min, OK. Otherwise, wait for next month
- “Is this incident worth investigating?” → If it consumed budget, yes
- “Can we do maintenance?” → Only if budget allows
Monitoring SLIs & SLOs
Use alerts to catch SLO violations early:
Alert: Approaching SLO Violation
Condition: If current rate would miss SLO by end of day
Action: Page on-call to prevent further failures
Example: 5xx rate is 0.08% (approaching 0.1% daily limit)
Alert: SLO Violated
Condition: SLI has exceeded SLO for 5 minutes
Action: Immediate incident response
Example: Latency P99 exceeded 500ms for 5+ minutes
Track error budget burn rate:
Prometheus query:
rate(errors_total[5m]) / rate(requests_total[5m]) # Current 5-min error rate
If SLO allows 0.1% errors:
- Current burn rate > 0.1%: Burning budget fast (yellow alert)
- Current burn rate > 0.5%: Burning budget very fast (red alert)
SLI/SLO Template
Copy this for each critical service:
## Service: [Payment Processing]
### SLOs (What we promise)
| SLI | Target | Why | Owner |
|-----|--------|-----|-------|
| Latency P99 | < 500ms | Users expect responsive checkout | Payments team |
| Success rate | > 99.99% | Failed charges damage trust | Payments team |
| Data freshness | < 5s | Reconciliation depends on accuracy | Finance + Payments |
| Availability | 99.9% | 43 min downtime/month acceptable | Infrastructure |
### Error Budget (monthly)
| Category | Time | % of Budget |
|----------|------|------------|
| Scheduled maintenance | 15 min | 35% |
| Incident response | 15 min | 35% |
| Risky deployments | 13 min | 30% |
| **Total** | **43.2 min** | **100%** |
### Current Status (this month)
| SLI | Target | Actual | Status | Burn |
|-----|--------|--------|--------|------|
| Latency P99 | < 500ms | 185ms | [YES] Green | Good |
| Success rate | > 99.99% | 99.991% | [YES] Green | Good |
| Availability | 99.9% | 99.94% | [YES] Green | Good |
| Budget remaining | 43.2 min | 38 min | ⚠️ Yellow | Normal |
### Actions
- [ ] If budget < 10 min: Freeze risky deployments
- [ ] If any SLI approaching SLO: Incident response
- [ ] Weekly review of burn rate vs. targets
Dashboards: Visualization
Key Metrics Dashboard
┌─ Service Status ─────────────────────┐
│ ✓ API Server (green) │
│ ✓ Database (green) │
│ ⚠ Cache (yellow - slow response) │
│ ✓ Queue Workers (green) │
└─────────────────────────────────────┘
┌─ Request Metrics ────────────────────┐
│ Throughput: 1,200 req/sec │
│ Latency P50: 80ms │
│ Latency P99: 450ms │
│ Error Rate: 0.08% │
│ 5xx Errors: 10/min │
└─────────────────────────────────────┘
┌─ Resources ──────────────────────────┐
│ CPU: 45% (healthy) │
│ Memory: 72% (normal) │
│ Disk: 58% (OK) │
│ Database Connections: 87/100 │
└─────────────────────────────────────┘
Troubleshooting Dashboard
When alert fires, have dashboard that shows:
- Timeline of what happened
- Related metrics (error rate, latency, resources)
- Recent deployments
- Top errors in last hour
- Slow queries
- Resource constraints
On-Call Runbook Template
When alert fires, on-call engineer needs a runbook:
# Alert: API P99 Latency High
## Quick Diagnosis (5 min)
1. Check if it's real
- Is P99 actually > 1s? (might be metric glitch)
- Is it affecting real users? (check error logs)
2. Gather context
- Did we deploy recently? (check deployments)
- Is database slow? (check DB metrics)
- Is cache down? (check cache metrics)
- Is there a traffic spike? (check RPS)
## If Database is Slow
1. Connect to database
```sql
SHOW PROCESSLIST; -- see current queries
SHOW SLOW LOG; -- see recent slow queries
-
Identify slow query
- Look for query taking > 500ms
- Check if index missing
- Check if N+1 queries
-
Options
- Kill long-running query (if safe)
- Add index (if appropriate)
- Scale database (if overloaded)
If It’s a Traffic Spike
-
Is it legitimate?
- Check graphs (should match user activity)
- Check recent marketing (PR, social media)
- Check competitors (did they mention us?)
-
What to do
- Scale up (if unexpected)
- Accept it (if expected/temporary)
- Optimize (if sustained)
Escalation
If you can’t diagnose in 10 minutes:
- Page database expert (if DB slow)
- Page infrastructure expert (if resource constrained)
- Declare incident if affecting customers
---
## Prometheus Query Examples
If using Prometheus, these PromQL queries are commonly useful:
### Request Rate & Errors
```promql
# Request rate per second (5-minute average)
rate(http_requests_total[5m])
# Error rate (5xx only)
rate(http_requests_total{status=~"5.."}[5m])
# Error rate as percentage
(rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m])) * 100
# 4xx vs 5xx error rates
rate(http_requests_total{status=~"4.."}[5m]) # Client errors
rate(http_requests_total{status=~"5.."}[5m]) # Server errors
# Requests by endpoint
sum(rate(http_requests_total[5m])) by (endpoint)
# Errors by endpoint (find problematic endpoints)
sum(rate(http_requests_total{status=~"5.."}[5m])) by (endpoint)
Latency (Duration)
# P95 latency (95th percentile)
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
# P99 latency (99th percentile)
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
# Average latency
rate(http_request_duration_seconds_sum[5m]) / rate(http_request_duration_seconds_count[5m])
# Latency by endpoint
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) by (endpoint)
# Slow requests (> 1 second)
rate(http_request_duration_seconds_bucket{le="+Inf"}[5m]) - rate(http_request_duration_seconds_bucket{le="1"}[5m])
Resource Usage
# CPU usage percentage
100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Memory usage percentage
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
# Disk usage percentage
(node_filesystem_size_bytes - node_filesystem_avail_bytes) / node_filesystem_size_bytes * 100
# Database connections in use
pg_stat_activity_count # PostgreSQL
OR mysql_global_status_threads_connected # MySQL
Database Performance
# Query execution rate
rate(mysql_global_status_queries[5m])
# Slow query rate
rate(mysql_global_status_slow_queries[5m])
# Connection pool usage
mysql_global_status_threads_connected / mysql_global_variables_max_connections
# Replication lag (MySQL)
mysql_slave_status_seconds_behind_master
SLO Monitoring
# Error budget burn rate (5-minute)
rate(errors_total[5m]) / rate(requests_total[5m])
# SLO status: Is P99 latency within SLO? (SLO: 500ms)
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) < 0.5
# Availability (uptime) over last month
avg_over_time((up[1m])[30d:1m]) * 100
Useful Query Patterns
# Alert if any endpoint has > 1% error rate
(rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m])) > 0.01
# Alert if P99 latency > 1 second
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) > 1
# Alert if CPU > 80% for more than 5 minutes
100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
# Alert if disk > 85%
(node_filesystem_size_bytes - node_filesystem_avail_bytes) / node_filesystem_size_bytes > 0.85
Integration with Playbook
Part of design and planning:
/pb-plan- Include observability in feature planning/pb-guide- Section 4.4 covers monitoring design/pb-review-hygiene- Code review checks for logging/pb-release- Release checklist includes dashboard setup
Related Commands:
/pb-plan- Feature planning (include observability)/pb-guide- SDLC workflow/pb-adr- Architecture decision (monitoring tools)/pb-sre-practices- SRE operational practices, error budgets
Observability Checklist
For each new feature:
Planning Phase:
- What metrics matter? (latency, errors, business)
- What events to log? (state changes, errors)
- How to trace? (request flow, external calls)
- What to alert on? (when is this broken?)
Implementation Phase:
- Add metric instrumentation
- Add structured logging
- Add distributed tracing
- Create dashboards
Deployment Phase:
- Verify metrics are flowing
- Test alerts (trigger intentionally, verify notification)
- Create runbooks (for when things break)
- Document dashboards (what does each chart mean?)
Tools (Popular Options)
Metrics: Prometheus, Datadog, New Relic, CloudWatch Logs: ELK Stack, Splunk, Datadog, CloudWatch Logs Traces: Jaeger, Datadog, New Relic, Lightstep Alerting: PagerDuty, Opsgenie, VictorOps Dashboards: Grafana, Kibana, Datadog, New Relic
Related Commands
/pb-logging- Logging strategy and standards for structured logging/pb-incident- Incident response when observability alerts fire/pb-sre-practices- SRE operational practices and error budgets/pb-performance- Performance optimization using observability data/pb-maintenance- Preventive maintenance (monitoring detects; maintenance prevents)
Created: 2026-01-11 | Category: Planning | Tier: M/L