Infrastructure Review: Resilience & Security Focus

Multi-perspective infrastructure code review combining Alex Chen (Infrastructure & Resilience) and Linus Torvalds (Security & Pragmatism) expertise.

When to use: Infrastructure changes, Terraform/Kubernetes configs, deployment pipelines, security configurations, system architecture changes.

Resource Hint: opus - Systems thinking + security hardening. Parallel execution of both agents recommended.

How This Works

Two expert perspectives review in parallel, then synthesize:

Alex’s Review - Resilience lens
- What can fail? How do we recover?
- Is the system designed for failure?
- Can we deploy safely? Monitor effectively?
- Is capacity understood and modeled?
Linus’s Review - Security lens
- What are the threat vectors?
- Are implicit security assumptions correct?
- Is there data exposure risk?
- Are we making assumptions we’ll regret?
Synthesize - Combined perspective
- Identify security-resilience trade-offs
- Surface hidden assumptions
- Ensure robustness without over-engineering

Alex’s Resilience Review

See /pb-alex-infra for the comprehensive infrastructure review framework and checklist.

For infrastructure-specific review, focus on:

Failure Detection: Can we detect component failures before users notice? Are health checks in place?
Graceful Degradation: If one service fails, does the system degrade or cascade?
Deployment Safety: Are rollouts gradual? Can we rollback in < 5 minutes?
Observability: Do dashboards and alerts give actionable insights?
Capacity Planning: Are resource limits set? Load-tested to 10x peak?

Alex’s Red Flags for Infrastructure:

No health checks or monitoring of critical paths
Single point of failure (all-in-one deployment)
Manual recovery processes or rollback plans
No resource limits (services can starve each other)

Linus’s Security Review

See /pb-linus-agent for the comprehensive security review framework and checklist.

For infrastructure-specific review, focus on:

Attack Surface: What threat vectors exist? Are data in transit and at rest encrypted?
Access Control: Is least privilege enforced? Can we audit who accessed what?
Assumptions: Are we trusting the internal network? Components? User input? Could assumptions be violated?
Secrets Management: Are secrets in a vault (not code)? Rotated? Access logged?
Compliance: Is GDPR/HIPAA/PCI-DSS met? Retention policies enforced?

Linus’s Red Flags for Infrastructure:

Hardcoded secrets or credentials in code/config
No TLS for sensitive connections or internal services
Over-broad access permissions (all developers as admin)
No audit logging for administrative actions
Sensitive data in logs (credit cards, tokens, PII)

Combined Perspective: Infrastructure Review Synthesis

When Alex & Linus Agree:

✅ Infrastructure is resilient AND secure
✅ Approve for merging

When They Disagree: Common disagreement: “Should we add encryption everywhere?”

Linus says: “Encrypt all data at rest and in transit”
Alex says: “Encryption adds latency. Measure first.”
Resolution: Default to secure. Profile to find real bottlenecks. Encrypt what matters.

Trade-offs to Surface:

Security vs Performance
- Encryption adds CPU load
- But data breaches cost more
- Measure latency. Encrypt if acceptable.
Simplicity vs Defense in Depth
- One firewall is simple
- Multiple layers are complex but safer
- Use both. Understand the trade-off.
Scalability vs Security
- Autoscaling simplifies operations
- But each new instance is a potential attack surface
- Automate security hardening too.

Review Checklist

Before Review Starts

Infrastructure code change is documented
Threat model (if new infrastructure) documented
Change tested in staging environment
Rollback plan documented

During Alex’s Review

Failure modes identified
Observability sufficient
Deployment plan is safe
Capacity is modeled

During Linus’s Review

Threat vectors identified
Access control follows principle of least privilege
Secrets properly managed
Compliance met

After Both Reviews

Feedback synthesized
Security-resilience trade-offs understood
Assumptions surfaced and challenged
Approval given (or revisions requested)

Review Decision Tree

1. Is infrastructure resilient (Alex)?
   NO → Ask for resilience improvements
   YES → Continue

2. Is infrastructure secure (Linus)?
   NO → Ask for security hardening
   YES → Continue

3. Are there trade-off disagreements?
   YES → Discuss (often about latency vs security)
   NO → Continue

4. Are implicit assumptions challenged?
   YES → Re-examine whether assumptions are safe
   NO → Continue

5. Is infrastructure ready to deploy?
   YES → Approve
   NO → Request specific revisions

Example: Database Cluster Review

Code Being Reviewed: PostgreSQL cluster in Kubernetes

Alex’s Review:

Resilience Check:

✅ Primary + 2 replicas (redundancy)
✅ Health checks configured
❌ Issue: No backup strategy documented
✅ Good: Automatic failover configured
❌ Issue: No capacity planning for disk growth

Alex’s Recommendation:

Document backup strategy (daily + weekly + monthly)
Model disk usage growth
Test failover under load

Linus’s Review:

Security Check:

❌ Problem: Database password in config
❌ Problem: No encryption in transit (replication between pods)
✅ Good: Access controlled to pod network
❌ Problem: No audit logging of queries
✅ Good: Backups encrypted

Linus’s Recommendation:

Move password to secrets vault
Enable TLS for replication
Enable query audit logging
Define retention policy

Synthesis:

Trade-off Identified:

Alex: “Audit logging might slow queries”
Linus: “But data integrity requires it”
Resolution: Enable audit logging. Profile to measure impact. Add to monitoring.

Approval: Conditional on both Alex’s and Linus’s changes.

Comment Register

Findings posted as PR/issue comments follow ~/.claude/CLAUDE.md § GitHub Artifact Register: one load-bearing observation per comment, one sentence per finding, no narration or severity adjectives.

/pb-review-code – General code review framework both agents apply
/pb-review-backend – Backend service review for infrastructure dependencies
/pb-alex-infra – Alex’s deep dive: systems thinking, failure modes, resilience design
/pb-security – Security review checklist for infrastructure and configuration

When to Escalate

Escalate to Maya (Product) if:

Infrastructure changes impact user experience
Capacity planning affects feature roadmap
Cost/benefit trade-offs matter

Escalate to Jordan (Testing) if:

Failover scenarios need testing
Load testing needed to validate capacity
Chaos engineering needed to verify resilience

Escalate to Sam (Documentation) if:

Runbooks need documentation
Complex infrastructure needs explanation
Team onboarding needs guides

Infrastructure review: Systems that don’t fail + remain secure when attacked

Keyboard shortcuts

Engineering Playbook