Deployment Patterns & Strategies

Reference guide for deployment strategies, patterns, and best practices. Use this to learn about and plan deployment approaches.

For executing deployments, use /pb-deployment (actionable deployment workflow).

Principle: Every deployment strategy involves trade-offs.

Use /pb-preamble thinking: question your actual risk tolerance before choosing. Use /pb-design-rules thinking: balance Simplicity (don’t use complex strategies you don’t need) with Robustness (design for failure and rollback). Challenge whether you need the complexity of advanced strategies or if simpler approaches work.

Resource Hint: sonnet - Deployment pattern reference; implementation-level release strategy decisions.

When to Use

Choosing a deployment strategy for a new service or major release
Evaluating risk tolerance and rollback requirements
Planning blue-green, canary, or rolling deployments

Purpose

Deployment is a controlled risk. Goals:

Zero downtime: Users don’t notice deployment
Fast rollback: If something breaks, revert in seconds
Gradual rollout: Start small, expand to all users
Safety first: Catch problems before users see them

Deployment Strategies

Choose strategy based on risk and scope.

Strategy 1: Blue-Green Deployment (Safest)

How it works:

Keep current version running (Blue)
Deploy new version to separate environment (Green)
Test Green environment fully
Switch traffic to Green instantly
Old Blue stays running for quick rollback

Diagram:

Before:
  Users → [Blue - current version running]

Deploy:
  Users → [Blue - current version]
  [Green - new version deployed, not receiving traffic yet]

After:
  Users → [Green - new version live]
  [Blue - previous version, ready for rollback]

Advantages:

Zero downtime (instant switch)
Fast rollback (switch back to Blue)
Full testing before traffic switch
Two environments to compare

Disadvantages:

Expensive (need 2x resources)
Database migrations must be compatible
Can’t test at full production load

When to use:

Critical systems (payment, auth)
Zero downtime required
Budget allows 2x infrastructure

Implementation:

# 1. Deploy new version to green environment
kubectl set image deployment/app-green app=myapp:v2.0

# 2. Wait for green to be ready
kubectl rollout status deployment/app-green

# 3. Test green (health checks pass)
curl http://green.internal/health  # Should return 200

# 4. Switch traffic
kubectl patch service app -p '{"spec":{"selector":{"version":"v2.0"}}}'

# 5. If broken, switch back instantly
kubectl patch service app -p '{"spec":{"selector":{"version":"v1.0"}}}'

Strategy 2: Canary Deployment (Balanced)

How it works:

Deploy new version alongside current
Send small % of traffic to new version (5%)
Monitor for errors
Gradually increase % (5% → 25% → 50% → 100%)
If errors spike, rollback the canary

Diagram:

Phase 1: 5% traffic to v2.0
  90% → [v1.0 - stable]
  10% → [v2.0 - canary, low traffic]

Phase 2: 50% traffic to v2.0
  50% → [v1.0]
  50% → [v2.0]

Phase 3: 100% traffic to v2.0
  [v2.0 - all traffic, fully rolled out]

Advantages:

Catch bugs with real traffic (small blast radius)
Gradual rollout (if errors, affect few users)
Monitor real user impact
Easy to rollback (just reduce canary %)

Disadvantages:

Longer deployment time (30min - 2 hours)
Complex monitoring (compare v1 vs v2 metrics)
Database must be compatible

When to use:

Medium-risk deployments
Want real traffic testing
Can monitor and react quickly

Implementation:

# Kubernetes Canary with Flagger
---
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: app
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: app
  service:
    port: 80

  # Gradually shift traffic
  skipAnalysis: false
  analysis:
    interval: 1m
    threshold: 5  # Max 5% error rate increase
    maxWeight: 50  # Max 50% traffic in canary phase
    stepWeight: 5  # Increase by 5% each minute

  metrics:
  - name: error-rate
    thresholdRange:
      max: 0.05  # Error rate < 5%
  - name: latency
    thresholdRange:
      max: 500m  # P99 latency < 500ms

Manual canary (without Flagger):

# 1. Deploy new version (initially gets 0% traffic)
kubectl set image deployment/app app=myapp:v2.0

# 2. Verify new pods are healthy
kubectl get pods -l app=app

# 3. Use load balancer to send 5% traffic to v2.0
kubectl patch service app -p '{"spec":{"trafficPolicy":{"canary":{"weight":5}}}}'

# 4. Monitor error rate and latency (should match v1.0)
# Watch metrics dashboard for 5 minutes

# 5. If good, increase to 25%
kubectl patch service app -p '{"spec":{"trafficPolicy":{"canary":{"weight":25}}}}'

# 6. If errors spike, rollback to 0%
kubectl patch service app -p '{"spec":{"trafficPolicy":{"canary":{"weight":0}}}}'
kubectl delete deployment app

Strategy 3: Rolling Deployment (Fast)

How it works:

Gradually replace old instances with new
Take down one instance, deploy new, bring up
Repeat until all replaced
If errors detected, stop and rollback

Diagram:

Phase 1: Replace 1/5 instances
  [v1.0] [v1.0] [v1.0] [v1.0] [v2.0]

Phase 2: Replace 2/5 instances
  [v1.0] [v1.0] [v1.0] [v2.0] [v2.0]

Phase 3: All replaced
  [v2.0] [v2.0] [v2.0] [v2.0] [v2.0]

Advantages:

No extra infrastructure needed
Fast (completes in minutes)
Automatic rollback on error
Uses existing instance capacity

Disadvantages:

Temporary reduced capacity during rollout
Must support both versions simultaneously (database!)
Can’t fully test before rolling out
Harder rollback (must roll back the rollout)

When to use:

Budget-constrained
Fast deployments
Confident in changes

Implementation:

# Kubernetes Rolling Update (default)
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: app
spec:
  replicas: 5
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1  # Max 1 extra instance during rollout
      maxUnavailable: 0  # Min 0 unavailable (no service interruption)

  template:
    spec:
      containers:
      - name: app
        image: myapp:v2.0  # New version

        # Health check (stop rollout if failing)
        readinessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 10
          periodSeconds: 5

Feature Flags: Deploy Without Releasing

Problem: New code deployed but not visible to users (until enabled).

Solution: Feature flags to toggle features on/off without redeploying.

# Feature flag pattern
def checkout():
    # Old code still runs (feature flag OFF)
    if feature_flag_enabled('new_checkout'):
        return new_checkout()  # New code (feature flag ON)
    else:
        return old_checkout()  # Old code

Benefits:

Decouple deployment from release
Deploy at any time (flag off)
Release when ready (flag on)
Instant rollback (flag off)
A/B testing (flag on for 10% of users)

Implementation:

# Using LaunchDarkly or similar
import ld_client

def checkout():
    user = get_current_user()

    # Check if flag enabled for this user
    if ld_client.variation('new-checkout', user, False):
        return new_checkout()
    else:
        return old_checkout()

Deployment with flags:

# Step 1: Deploy with feature flag OFF
kubectl set image deployment/app app=myapp:v2.0
# Feature is deployed but disabled

# Step 2: Monitor for errors (shouldn't be any, code not running)
# Wait 1 hour, no errors

# Step 3: Enable for internal team (1% of traffic)
flag.set_percentage('new_checkout', percentage=1)
# Monitor for 30 minutes

# Step 4: Enable for 10% of users
flag.set_percentage('new_checkout', percentage=10)
# Monitor for 1 hour

# Step 5: Enable for all users
flag.set_percentage('new_checkout', percentage=100)

Cleanup:

# After feature stable for 2 weeks
def checkout():
    # Remove feature flag completely
    return new_checkout()  # Just use new code

Database Migrations: Avoid Data Loss

Problem: Schema changes can break running code.

Solution: Gradual migrations, test thoroughly, rollback plan.

Zero-Downtime Migration Pattern

Step 1: Add new column (backwards compatible)

ALTER TABLE users ADD COLUMN phone_number VARCHAR(20) DEFAULT NULL;
-- Old code: uses email
-- New code: will use phone_number, falls back to email if NULL
-- Both work simultaneously

Step 2: Deploy code that reads new column

# New code reads new column, with fallback
def get_contact_method(user):
    if user.phone_number:
        return user.phone_number
    else:
        return user.email  # Fallback

Step 3: Deploy code that writes new column

# New code writes to both old and new
def update_user(user):
    user.email = new_email  # Old column
    user.phone_number = new_phone  # New column
    user.save()

Step 4: Backfill existing data

-- Backfill old records (can be slow, non-blocking)
UPDATE users SET phone_number = email WHERE phone_number IS NULL;
-- Done slowly in background

Step 5: Remove fallback, use only new column

# Remove fallback after backfill complete
def get_contact_method(user):
    return user.phone_number  # Just use new column

Step 6: Remove old column (if really needed)

ALTER TABLE users DROP COLUMN email;
-- Keep old column for 3+ months for emergency rollback
-- Then remove

Why this pattern is safe:

Each step is backwards compatible
Can rollback at any step
No data loss
No blocking locks on table
Users not affected

Rollback Strategies

Quick Rollback (Use Feature Flags)

Fastest: Feature flag off (instant)

# Users still get old behavior, no code redeployment
flag.set_percentage('new_checkout', percentage=0)
# Done. Takes 1 second.

Fast Rollback (Use Blue-Green)

Fast: Switch traffic to previous version (seconds)

# Instant traffic switch to previous version
kubectl patch service app -p '{"spec":{"selector":{"version":"v1.0"}}}'
# Takes 1-2 seconds, users see no interruption

Rollback Last Deployment (Kubernetes)

Medium: Rollback last deployment (30 seconds)

kubectl rollout undo deployment/app
# Rolls back to previous version automatically
# Waits for new pods to be ready
# Takes ~30 seconds

Manual Rollback (With Backups)

For data corruption: Restore from backup

# 1. Take database offline
kubectl scale deployment app --replicas=0

# 2. Restore from backup
pg_restore mydb backup_2024_01_11_1400.dump

# 3. Bring old version back online
kubectl set image deployment/app app=myapp:v1.0
kubectl scale deployment app --replicas=5

# Takes 5-10 minutes, data restored, old version running

What NOT to Do

[NO] DON’T rollback by keeping both versions:

# Bad: Users see inconsistency, data corruption
kubectl patch service app -p '{"spec":{"selector":{"version":"mixed"}}}'
# Some requests go to v1.0, some to v2.0, data gets out of sync

[NO] DON’T deploy fix immediately after rollback:

# Bad: Rolled back to v1.0 due to bug
# Then immediately redeployed v2.0 with "fix"
# But the "fix" is untested

# Good: Rollback, investigate, fix, test, deploy

Pre-Deployment Checklist

Code Quality

All tests passing (unit, integration, E2E)
Code reviewed and approved
Linter passing
Type checking passing (if applicable)
Security scan passed
No console.log/print statements left

Database

Migration tested locally
Rollback plan documented
Backward compatible (old code + new schema works)
Backup taken (or auto backup confirmed)
Estimated migration time calculated

Configuration

All environment variables configured
Secrets not in code (using secret manager)
Feature flags ready (old feature on if needed)
Monitoring/alerts configured

Monitoring & Alerts

Dashboard created (or updated)
Key metrics monitored (latency, errors, resource usage)
Alerts configured (error spike, latency spike, resource full)
On-call engineer assigned
Runbook prepared (what to do if something breaks)

Communication

Stakeholders informed (when deployment will happen)
Maintenance window scheduled (if downtime needed)
Support team briefed (possible issues)
Rollback plan communicated (if needed)

Deployment Checklist

Before Deployment (1 hour)

Check code one more time
Check if anything changed since last review (git log)
Verify tests still passing
Check team is available (for 1-2 hours)
Check production status (no current incidents)

During Deployment

Deploy code
Wait for new instances to be healthy (health checks pass)
Watch error metrics (should be same as before)
Watch latency metrics (should be same as before)
Wait 5-10 minutes to ensure stable

After Deployment (30 min - 1 hour)

Monitor error rate (no spike)
Monitor latency (no spike)
Monitor resource usage (no spike)
Check logs for warnings/errors
Smoke test key user flows
Wait 1-2 hours before signing off (catch delayed issues)

Post-Deployment

Create post-deployment issue if any minor issues found
Update deployment log
Notify team (Slack message confirming successful deployment)

Smoke Testing: Quick Validation After Deployment

What: Smoke tests are rapid validation checks that verify the system’s core functionality is working right after deployment.

Why: Deploy → immediately test critical paths → catch issues before users do → roll back quickly if needed.

Key difference:

Unit tests: Verify functions work (in code)
Integration tests: Verify components work together (in CI/CD)
Smoke tests: Verify system works end-to-end (after deployment)

Manual Smoke Testing

When to run: Immediately after deployment (first 5-10 minutes).

Timing: 5-15 minutes per deployment.

What to test (critical user paths):

Ecommerce platform:
✓ User can browse products
✓ User can add to cart
✓ User can checkout (full payment flow)
✓ Order confirmation email sent
✓ Admin can view orders
✓ Inventory updated correctly

SaaS application:
✓ User can login
✓ User can create new project/workspace
✓ User can export data
✓ Admin dashboard loads
✓ API endpoints responding
✓ Database queries fast (< 500ms)

API service:
✓ Health check endpoint returns 200
✓ Authentication working
✓ Core endpoint responses correct
✓ Error handling works
✓ Rate limiting functional
✓ Logs capturing requests

Manual smoke test script (Bash):

#!/bin/bash
# smoke-test.sh - Quick validation after deployment

set -e  # Exit on first failure
DOMAIN="${SMOKE_TEST_DOMAIN:-https://example.com}"
HEALTH_CHECK_URL="$DOMAIN/health"
TEST_USER_EMAIL="${SMOKE_TEST_EMAIL:-test+smoke@example.com}"
TEST_USER_PASS="${SMOKE_TEST_PASSWORD:-changeme123}"  # Set via env var

echo "🔥 Starting smoke tests..."

# 1. Health check
echo "✓ Checking health endpoint..."
STATUS=$(curl -s -o /dev/null -w "%{http_code}" "$HEALTH_CHECK_URL")
if [ "$STATUS" != "200" ]; then
  echo "[NO] Health check failed: $STATUS"
  exit 1
fi

# 2. Login
echo "✓ Testing login..."
LOGIN_RESPONSE=$(curl -s -X POST "$DOMAIN/api/login" \
  -H "Content-Type: application/json" \
  -d "{\"email\":\"$TEST_USER_EMAIL\",\"password\":\"$TEST_USER_PASS\"}")

if ! echo "$LOGIN_RESPONSE" | grep -q "\"token\""; then
  echo "[NO] Login failed"
  exit 1
fi

TOKEN=$(echo "$LOGIN_RESPONSE" | grep -o '"token":"[^"]*' | cut -d'"' -f4)

# 3. Core API endpoint
echo "✓ Testing API endpoint..."
API_RESPONSE=$(curl -s -X GET "$DOMAIN/api/user/profile" \
  -H "Authorization: Bearer $TOKEN")

if ! echo "$API_RESPONSE" | grep -q "\"email\""; then
  echo "[NO] API endpoint failed"
  exit 1
fi

# 4. Database connection (query latency)
echo "✓ Checking database performance..."
LATENCY=$(curl -s -X GET "$DOMAIN/api/metrics/db-latency" \
  -H "Authorization: Bearer $TOKEN" | grep -o '"latency":[0-9]*' | cut -d':' -f2)

if [ "$LATENCY" -gt 1000 ]; then
  echo "⚠️  Database latency high: ${LATENCY}ms (expected < 1000ms)"
fi

echo "[YES] Smoke tests passed!"

Manual test checklist:

Can login with existing user
Can create new account
Can access dashboard/homepage
Can perform primary action (checkout, submit form, etc.)
Can access admin panel (if applicable)
Database responding (queries < 500ms)
External services working (payment, email, etc.)
Error messages display correctly
Logs showing requests (check CloudWatch/ELK/etc.)

Automated Smoke Testing

When to run: In CI/CD pipeline, after deployment.

Tools:

curl/httpie: Simple HTTP requests
Selenium/Playwright: Browser-based testing
k6: Load testing with smoke scenarios
Postman/Newman: API testing
Cypress: End-to-end testing

Example: k6 smoke test (lightweight)

// smoke-test.js - k6 script for smoke testing
import http from 'k6/http';
import { check, sleep } from 'k6';

export let options = {
  // Smoke test: few users, short duration
  vus: 1,          // 1 virtual user
  duration: '2m',  // Run for 2 minutes
  thresholds: {
    http_req_duration: ['p(99)<500'],  // 99% requests < 500ms
    http_req_failed: ['rate<0.1'],     // Less than 10% failure rate
  },
};

export default function() {
  const BASE_URL = __ENV.BASE_URL || 'https://api.example.com';
  const TEST_EMAIL = __ENV.TEST_EMAIL || 'test@example.com';
  const TEST_PASSWORD = __ENV.TEST_PASSWORD || 'changeme123';

  // Test 1: Health check
  let res = http.get(`${BASE_URL}/health`);
  check(res, {
    'health: status 200': (r) => r.status === 200,
  });

  // Test 2: Login
  res = http.post(`${BASE_URL}/auth/login`, JSON.stringify({
    email: TEST_EMAIL,
    password: TEST_PASSWORD,
  }), {
    headers: { 'Content-Type': 'application/json' },
  });

  check(res, {
    'login: status 200': (r) => r.status === 200,
    'login: token received': (r) => r.json('token') !== undefined,
  });

  const token = res.json('token');

  // Test 3: Core endpoint with auth
  res = http.get(`${BASE_URL}/api/user/profile`, {
    headers: { 'Authorization': `Bearer ${token}` },
  });

  check(res, {
    'profile: status 200': (r) => r.status === 200,
    'profile: has email': (r) => r.json('email') !== undefined,
  });

  sleep(1);
}

Customizing thresholds for your system:

The example uses default thresholds. You must adjust for your actual system:

Default thresholds:
  p(99) < 500ms  - Assumes fast database (your DB might be 1000ms-2000ms)
  rate < 0.1     - Allows 10% error rate (too high for production)

Your system thresholds:
  1. Measure baseline: Run smoke test without threshold enforcement
  2. Check metrics: What's your typical p99 latency? Error rate?
  3. Set threshold: Use baseline + 10% margin

Example for slow system:

// If your baseline is: p99=2000ms, error=5%
export let options = {
  vus: 1,
  duration: '2m',
  thresholds: {
    http_req_duration: ['p(99)<2200'],  // 2000ms + 10% margin
    http_req_failed: ['rate<0.1'],      // But keep <10% as safety net
  },
};

Run smoke test:

# Set auth credentials and run with environment variables
AUTH_TOKEN=$(curl -s -X POST https://api.example.com/auth/login \
  -d '{"email":"test@example.com","password":"test"}' | jq -r '.token')

k6 run \
  --env BASE_URL=https://api.example.com \
  --env TEST_EMAIL=test@example.com \
  --env TEST_PASSWORD=test_password \
  smoke-test.js

Example: GitHub Actions smoke test (after deployment)

name: Deploy & Smoke Test

on:
  push:
    branches: [main]

jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2

      - name: Deploy to production
        run: |
          kubectl set image deployment/app app=myapp:${{ github.sha }}
          kubectl rollout status deployment/app --timeout=5m

  smoke-test:
    needs: deploy
    runs-on: ubuntu-latest
    steps:
      - name: Wait for deployment to stabilize
        run: sleep 30

      - name: Run smoke tests
        env:
          SMOKE_TEST_EMAIL: ${{ secrets.SMOKE_TEST_EMAIL }}
          SMOKE_TEST_PASSWORD: ${{ secrets.SMOKE_TEST_PASSWORD }}
        run: |
          #!/bin/bash
          set -e

          # Test health check
          curl -f https://example.com/health || exit 1

          # Test login
          TOKEN=$(curl -s -X POST https://example.com/api/login \
            -H "Content-Type: application/json" \
            -d "{\"email\":\"$SMOKE_TEST_EMAIL\",\"password\":\"$SMOKE_TEST_PASSWORD\"}" \
            | jq -r '.token')

          [ ! -z "$TOKEN" ] || exit 1

          # Test core endpoint
          curl -f -H "Authorization: Bearer $TOKEN" \
            https://example.com/api/user/profile || exit 1

      - name: Rollback on failure
        if: failure()
        run: |
          kubectl rollout undo deployment/app
          echo "Rollback complete. Smoke test failed."
          exit 1

Data Persistence Validation

Critical: HTTP 200 response doesn’t guarantee data was saved.

Example problem:

Deployment breaks database writes silently:
  - User clicks "create order" → API returns 200 [YES]
  - But order never saved to database [NO]
  - User thinks order exists, payment processed
  - Real order is missing, customer support nightmare

Solution: Verify data persisted, not just HTTP 200

Bash example (verify order saved):

#!/bin/bash
# smoke-test-data.sh - Verify data actually persisted

DOMAIN="https://example.com"

# Get auth token
TOKEN=$(curl -s -X POST "$DOMAIN/api/login" \
  -H "Content-Type: application/json" \
  -d '{"email":"test@example.com","password":"test"}' \
  | jq -r '.token')

echo "Testing data persistence..."

# Test 1: Create order
ORDER_RESPONSE=$(curl -s -X POST "$DOMAIN/api/orders" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"items":[{"id":1,"qty":2}]}')

ORDER_ID=$(echo "$ORDER_RESPONSE" | jq -r '.order_id')

if [ -z "$ORDER_ID" ] || [ "$ORDER_ID" = "null" ]; then
  echo "[NO] Create order failed"
  exit 1
fi

echo "✓ Order created: $ORDER_ID"

# Wait 1 second for DB write to complete
sleep 1

# Test 2: Verify order is in database
SAVED_ORDER=$(curl -s -X GET "$DOMAIN/api/orders/$ORDER_ID" \
  -H "Authorization: Bearer $TOKEN")

ORDER_STATUS=$(echo "$SAVED_ORDER" | jq -r '.status')

if [ "$ORDER_STATUS" != "pending" ]; then
  echo "[NO] Order not saved to database (HTTP 200 but no data)"
  echo "Response: $SAVED_ORDER"
  exit 1
fi

echo "✓ Order saved correctly: status=$ORDER_STATUS"

# Test 3: Verify inventory decremented
INVENTORY=$(curl -s -X GET "$DOMAIN/api/inventory/1" \
  -H "Authorization: Bearer $TOKEN")

QUANTITY=$(echo "$INVENTORY" | jq -r '.quantity')

if [ "$QUANTITY" -lt 8 ]; then  # Started at 10, ordered 2
  echo "✓ Inventory decremented correctly: $QUANTITY remaining"
else
  echo "[NO] Inventory not updated (data not persisted)"
  exit 1
fi

echo "[YES] All data persistence checks passed"

k6 example (verify response is correct):

// smoke-test-data.js - Verify data state after operations
import http from 'k6/http';
import { check, sleep } from 'k6';

export default function() {
  const BASE_URL = 'https://api.example.com';

  // Step 1: Create a resource
  let res = http.post(`${BASE_URL}/api/orders`, JSON.stringify({
    items: [{id: 1, qty: 2}],
    customer_id: 'test-customer-1',
  }), {
    headers: { 'Content-Type': 'application/json' },
  });

  check(res, {
    'create order: status 200': (r) => r.status === 200,
    'create order: has order_id': (r) => r.json('order_id') !== undefined,
  });

  const orderId = res.json('order_id');

  // Step 2: Wait for eventual consistency (DB write)
  sleep(1);

  // Step 3: Verify resource persisted correctly
  res = http.get(`${BASE_URL}/api/orders/${orderId}`);

  check(res, {
    'verify order: status 200': (r) => r.status === 200,
    'verify order: status is pending': (r) => r.json('status') === 'pending',
    'verify order: has items': (r) => r.json('items').length > 0,
    'verify order: customer_id matches': (r) =>
      r.json('customer_id') === 'test-customer-1',
  });
}

What to verify per application type:

Application	What to verify	Why
E-commerce	Order saved, inventory decremented	Financial accuracy
SaaS	Workspace created, settings saved	Data loss is deal-breaker
API Service	Record persisted with correct values	Silent data loss
Messaging	Message in queue/database	Lost messages = lost data
Billing	Payment recorded, invoice generated	Revenue impact