Advanced Testing Scenarios

Move beyond unit tests. Test behavior, catch mutations, verify contracts, stress systems.

Mindset: Testing embodies /pb-preamble thinking (challenge assumptions, surface flaws) and /pb-design-rules thinking (tests should verify Clarity, verify Robustness, check that failures are loud).

Your tests should challenge assumptions about code behavior. Find edge cases you didn’t think of. Question whether tests are actually testing behavior, not just hitting lines of code. Write tests that surface flawed thinking and verify design rules are honored.

Resource Hint: sonnet - test strategy design and implementation patterns

When to Use

Moving beyond unit tests to property-based, mutation, or contract testing
Designing test strategy for a new service or critical path
Strengthening weak tests identified by code review or mutation analysis

Purpose

Unit tests find bugs in code. Advanced testing finds bugs in:

Property-based tests: Edge cases you didn’t think of
Mutation tests: Tests that are too weak
Contract tests: Integration between services
Chaos tests: Failure scenarios
Performance tests: Degradation under load

Property-Based Testing

The Problem with Example-Based Tests

# Example-based test (traditional)
def test_sort():
    assert sort([3, 1, 2]) == [1, 2, 3]  # One example
    assert sort([]) == []  # Another example

# Problem: What about edge cases you didn't think of?
# - Negative numbers? Duplicates? Very large lists? Mixed types?

Property-Based Testing Solution

Generate many random inputs, verify property holds for all.

from hypothesis import given, strategies as st

# Property: After sorting, all elements in order
@given(st.lists(st.integers()))
def test_sort_property(unsorted_list):
    sorted_list = sort(unsorted_list)
    # Verify property for ANY input
    for i in range(len(sorted_list) - 1):
        assert sorted_list[i] <= sorted_list[i + 1]
    # Hypothesis generates 100+ random inputs automatically

# Hypothesis finds edge cases:
# - Empty list: [] → []
# - Single item: [1] → [1]
# - Duplicates: [1, 1, 2] → [1, 1, 2]
# - Negative: [-5, 0, 3] → [-5, 0, 3]
# - Large list: [9123, -4, ...] → sorted

More Property Examples

# Property: Reversing twice gives original
@given(st.lists(st.integers()))
def test_reverse_twice(lst):
    assert reverse(reverse(lst)) == lst

# Property: Adding to set then checking membership is True
@given(st.lists(st.integers()))
def test_set_membership(lst):
    s = set(lst)
    for item in lst:
        assert item in s

# Property: JSON encode then decode gives original
@given(st.lists(st.dictionaries(st.text(), st.integers())))
def test_json_roundtrip(data):
    json_str = json.dumps(data)
    decoded = json.loads(json_str)
    assert decoded == data

When to Use Property-Based Testing

[YES] DO use for:

Utility functions (sort, parse, format)
Mathematical functions
Data structure operations
Encoding/decoding

[NO] DON’T use for:

Functions with complex business logic
Functions with side effects
Database queries
External API calls

Mutation Testing

The Problem: Weak Tests

# Code being tested
def is_adult(age):
    return age >= 18

# Traditional test (looks good)
def test_is_adult():
    assert is_adult(20) == True
    assert is_adult(10) == False

# Problem: These tests would PASS for ANY implementation
def is_adult_broken(age):
    return True  # Always returns True, test still passes!

def is_adult_broken2(age):
    return age >= 21  # Wrong threshold, test still passes!

Mutation Testing Solution

Mutate code (change >= to >, = to !=, etc) and verify tests fail.

# Mutation testing with mutmut (Python)
# 1. Run tests normally: all pass
pytest

# 2. mutmut finds all code mutations
# 3. Runs tests for each mutation
# 4. Reports which mutations "survived" (tests still pass)

mutmut run

# Results:
# - Mutation: age >= 18 → age > 18
#   Tests: FAIL (good, test caught mutation)
# - Mutation: age >= 18 → age <= 18
#   Tests: FAIL (good, test caught mutation)
# - Mutation: age >= 18 → age >= 17
#   Tests: PASS (BAD, test didn't catch this mutation!)
#   SCORE: 66% (2/3 mutations caught)

Fixing Weak Tests

# Weak test (mutant age >= 17 survives)
def test_is_adult():
    assert is_adult(20) == True
    assert is_adult(10) == False

# Better test (catches age >= 17 mutation)
def test_is_adult():
    assert is_adult(20) == True
    assert is_adult(18) == True   # Boundary: 18 should be True
    assert is_adult(17) == False  # Boundary: 17 should be False
    assert is_adult(10) == False  # Below boundary

# Now mutation age >= 17 is caught!

Mutation Testing Across Languages

JavaScript:

npm install stryker
npx stryker run
# Reports mutation score: % of mutations caught

Python:

pip install mutmut
mutmut run --html-report
# Reports detailed mutations and survival

Java:

mvn install pitest:mutationCoverage
# Generates HTML report of mutations

When to Use Mutation Testing

[YES] DO use for:

Critical code paths
Mathematical/utility functions
Security code
Data validation

[NO] DON’T use for:

Every function (slow, overkill)
Integration tests
UI code

Contract Testing

The Problem: Integration Breaks

Service A (depends on B)
├─ Expects: GET /users returns {"id": int, "name": string}
└─ Tests: Mocks this response, all pass

Service B (provides API)
├─ Implements: GET /users returns {"userId": int, "fullName": string}
└─ Tests: All pass

Problem: Service A calls Service B in production
         API contract changed (id → userId, name → fullName)
         Integration breaks in production
         Tests in both services passed!

Contract Testing Solution

Define contract, both services test against it.

# Shared contract definition
# contracts/user_service_contract.py

USER_CONTRACT = {
    "type": "object",
    "properties": {
        "id": {"type": "integer"},
        "name": {"type": "string"},
        "email": {"type": "string"}
    },
    "required": ["id", "name"]
}

Service B (Provider) Tests Contract

# Service B: Verify API returns contract
import json_schema_validator

def test_get_user_matches_contract():
    response = client.get('/users/123')
    # Verify response matches contract
    validate(response.json(), USER_CONTRACT)
    # If contract defines: {"id": int, "name": string}
    # But code returns: {"userId": int, "fullName": string}
    # Test FAILS (caught before shipping)

Service A (Consumer) Tests Contract

# Service A: Verify mocks match contract
def test_get_user():
    with mock_user_service(returns=CONTRACT):  # Mock uses contract
        response = user_service.get_user(123)
        # Test uses actual contract, not hand-written mock
        assert response['id'] == 123
        assert response['name'] == 'John'

# If contract changes in Service B, contract definition updates
# Both services see the change, both update their code/tests

Contract Testing Tools

Pact (Most popular):

# Provider: Verify API matches consumer contract
from pact import Provider
pact = Provider("UserService")

# Consumer recorded contract expectations
pact.upon_receiving('a request for user 123') \
    .with_request('get', '/users/123') \
    .will_respond_with(200, body={"id": 123, "name": "John"})

# Provider tests: Does our API match consumer expectations?
pact.verify()  # PASS or FAIL

When to Use Contract Testing

[YES] DO use for:

Microservices communication
Public APIs
Third-party integrations
Service boundaries

[NO] DON’T use for:

Internal functions
Single-service monoliths
Unit tests

Chaos Engineering

The Problem: Untested Failure Modes

def get_user_with_orders(user_id):
    user = user_service.get(user_id)      # What if this fails?
    orders = order_service.get(user_id)   # What if this fails?
    recommendations = ai_service.recommend(user_id)  # What if this fails?
    return {user, orders, recommendations}

# Tests: All services work → all tests pass
# Production: AI service is slow one day → what happens?
# Answer: We don't know (and users find out)

Chaos Testing Solution

Intentionally break things, verify system handles gracefully.

# Chaos test: Order service is slow
@chaos_test(failure_mode='latency', service='order_service', latency=10_000)
def test_order_service_slow():
    response = client.get('/users/123')
    # Service should handle gracefully:
    # - Return user with empty orders (fallback)
    # - OR return user without recommendations
    # - OR timeout after 5 seconds with cached data
    # - NOT return error 500
    assert response.status_code == 200
    assert 'user' in response.json()
    assert response.elapsed < 5  # Timeout after 5 seconds

# Chaos test: Database down
@chaos_test(failure_mode='error', service='database', error='connection refused')
def test_database_down():
    response = client.get('/users/123')
    # Should handle gracefully (use cache, return degraded data, etc)
    assert response.status_code in [200, 503]  # OK or degraded service

# Chaos test: External API returns 500
@chaos_test(failure_mode='error_rate', service='payment', error_rate=0.5)
def test_payment_errors():
    # When payment API fails 50% of time:
    results = [client.post('/checkout') for _ in range(100)]
    # Should handle: retry, fallback, queue for later, etc
    # Not just return 500 errors

Chaos Engineering Tools

Gremlin (Commercial):

Inject failures: latency, packet loss, CPU spike, memory leak
Gradual rollout: 5% of traffic → 25% → 100%
Automated recovery

Chaos Toolkit (Open Source):

# chaos-experiment.yml
title: "Order Service Handles Payment Failures"
description: "Verify orders queue when payment API down"

probes:
- type: http
  name: "Get orders"
  method: GET
  url: http://api/orders

actions:
- type: "latency"
  duration: 5000  # 5 second latency
  target: "payment-api"
  percentage: 100

rollbacks:
- type: "stop"
  target: "payment-api-failure"

When to Use Chaos Testing

[YES] DO use for:

Distributed systems
Microservices
Critical paths
Before major incidents happen

[NO] DON’T use for:

Development environments
Simple systems
Nice-to-have features

Performance Testing

Beyond Load Testing

Load testing answers: “Can it handle 10,000 users?”

Performance testing answers: “Is it still fast with 10,000 users? What breaks first?”

# Load test: Can it handle the load?
wrk -c 1000 http://localhost:8000/
# Result: 100 req/sec, system handling

# Performance test: What degrades first?
# 100 users: P99 = 50ms, CPU 20%, Memory 30%, DB connections 10
# 500 users: P99 = 150ms, CPU 60%, Memory 60%, DB connections 50
# 1000 users: P99 = 800ms, CPU 95%, Memory 85%, DB connections 90
# 1500 users: P99 = 8000ms, CPU 100%, Memory 100%, DB connections 100 (LIMIT!)

# Finding: Database connection pool is bottleneck at 1500 users
# Solution: Increase pool size, use connection pooling, optimize queries

Stress Testing (Finding Breaking Points)

# Start slow, gradually increase load until something breaks
# 10 req/sec → all pass
# 50 req/sec → all pass
# 100 req/sec → all pass
# 500 req/sec → 5% errors (connection pool limit?)
# 750 req/sec → 20% errors
# 1000 req/sec → 50% errors (broken)

# Breaking point found at 500 req/sec (connection pool limit)

Soak Testing (Finding memory leaks)

# Run constant load for long time (hours, days)
# 100 req/sec for 24 hours

# Monitor:
# Hour 0: Memory 500MB
# Hour 6: Memory 550MB
# Hour 12: Memory 650MB
# Hour 24: Memory 950MB (memory leak!)

# Finding: Memory growing 20MB/hour
# Solution: Find memory leak, fix it

Testing in Production

Safe Practices

[YES] Production Testing:

Real traffic reveals real issues
Catch edge cases not seen in tests
Validate actual performance
Test real integrations

[NO] But be careful:

Don’t corrupt user data
Don’t expose security issues
Have rollback ready
Monitor closely

A/B Testing Framework

# Serve two versions, compare metrics
def checkout():
    user = get_user()

    # 50% of users get new checkout, 50% get old
    if user.id % 2 == 0:
        version = 'new_checkout'
        checkout_flow = new_checkout(user)
    else:
        version = 'old_checkout'
        checkout_flow = old_checkout(user)

    # Log which version, then track metrics
    metrics.record('checkout_version', version)
    metrics.record('checkout_success', checkout_flow.succeeded)
    metrics.record('checkout_latency', checkout_flow.duration)

    return checkout_flow

# After 1 week:
# Old: 85% success, 1500ms avg latency
# New: 92% success, 800ms avg latency (BETTER!)
# → Rollout new_checkout to 100%

Synthetic Monitoring (Test Production Regularly)

# Run automated test against production periodically
@schedule(every_5_minutes)
def synthetic_test_production():
    # Test critical user flows
    user = create_test_user()

    # Signup flow
    signup_response = requests.post(
        'https://prod.example.com/api/signup',
        json={'email': user.email, 'password': user.password}
    )
    assert signup_response.status_code == 200

    # Login flow
    login_response = requests.post(
        'https://prod.example.com/api/login',
        json={'email': user.email, 'password': user.password}
    )
    assert login_response.status_code == 200

    # Checkout flow
    checkout_response = requests.post(
        'https://prod.example.com/api/checkout',
        json={'user_id': user.id, 'items': [1, 2, 3]}
    )
    assert checkout_response.status_code == 200

    # If any fail, alert on-call

Test Data Strategies

Problem: Production Data in Tests

# [NO] BAD: Using real production data
def test_checkout():
    user = User.objects.get(id=12345)  # Real user
    checkout = checkout_flow(user)
    # Problem: If test changes data, affects real user

Solution: Test Data Builders

# [YES] GOOD: Build test data on demand
class UserBuilder:
    def __init__(self):
        self.email = f"test_{uuid4()}@example.com"
        self.age = 30
        self.balance = 100

    def with_age(self, age):
        self.age = age
        return self

    def build(self):
        return User.create(**self.__dict__)

def test_checkout():
    user = UserBuilder().with_age(25).build()  # Fresh test user
    checkout = checkout_flow(user)
    assert checkout.succeeded
    # Test data cleaned up after test

Factories for Complex Objects

from factory import Factory, SubFactory

class UserFactory(Factory):
    class Meta:
        model = User

    email = factory.Sequence(lambda n: f"user{n}@example.com")
    age = 30
    balance = Decimal('100.00')

class OrderFactory(Factory):
    class Meta:
        model = Order

    user = SubFactory(UserFactory)
    total = Decimal('50.00')
    status = 'pending'

# Usage:
user = UserFactory(age=25)  # Create user with custom age
order = OrderFactory(user=user)  # Create order linked to user
orders = OrderFactory.create_batch(10)  # Create 10 orders

Testing Pyramid & Strategy

         ┌─────────────────────────┐
         │    E2E Tests (10%)      │  Slow, brittle, but test real flows
         ├─────────────────────────┤
         │ Integration Tests (30%) │  Test component interaction
         ├─────────────────────────┤
         │  Unit Tests (60%)       │  Fast, isolated, unit level
         └─────────────────────────┘

Advanced testing adds:

         ┌──────────────────────────────┐
         │  Chaos/Chaos (5%)            │  Failure scenarios
         ├──────────────────────────────┤
         │  Contract Tests (10%)        │  Integration boundaries
         ├──────────────────────────────┤
         │  Mutation Tests (5%)         │  Test strength
         ├──────────────────────────────┤
         │  Property-Based (10%)        │  Edge cases
         ├──────────────────────────────┤
         │  Synthetic Monitoring (5%)   │  Production health
         ├──────────────────────────────┤
         │  Traditional (65%)           │  Unit/Integration/E2E
         └──────────────────────────────┘