Core Idea
- Build systems around data flow, not compute power: Design for reliability, scalability, and maintainability by optimizing for data volume/complexity and handling inevitable faults
- Avoid the temptation of "magic bullets": Architecture choices (replication strategy, isolation level, consensus mechanism) are application-specific trade-offs, not universal solutions
Reliability & Operations
- Tolerate faults, don't prevent them: Distinguish faults (component failures) from failures (system unavailability); design systems that keep running despite hardware crashes, software bugs, and human errors
- Measure performance with percentiles, not averages: p95/p99 response times matter more than mean; one slow backend call slows the entire user request (tail latency amplification)
- Minimize human error (causes ~75% of outages): Use good abstractions, sandboxed testing, gradual rollouts, monitoring, and easy recovery mechanisms
Data Architecture Patterns
- Separate write-path from read-path: Pre-compute indexes/caches during writes; keep reads simple; shift the boundary based on access patterns
- Use event logs as the source-of-truth: Route all writes through a partitioned log to prevent race conditions between systems; enables both real-time and historical reprocessing with the same framework
- Compose systems via derived data: Designate one database as source-of-record; treat all others as derived views — avoid dual writes and distributed transactions
- Design for schema evolution without downtime: Maintain old + new schemas in parallel; gradually shift traffic; reprocess data instead of risky migrations
Consistency & Correctness Trade-offs
- Choose isolation level by workload: Read Committed for transactional work; Snapshot Isolation for long queries/backups; Serializable Snapshot Isolation (SSI) for strong guarantees without blocking
- Prevent lost updates with atomic operations: Use INCREMENT/compare-and-set instead of read-modify-write cycles; implement explicit locking or lost-update detection for application-level conflicts
- Avoid Last-Write-Wins conflict resolution: Silently loses data; implement causal tracking or version vectors instead
- Use idempotent writes + deduplication over distributed 2PC: Scales better; pass end-to-end request IDs from client to app to database to suppress duplicates across retries
Distributed Systems Realities
- Assume partial failures will occur: Network delays, node crashes, clock skew, and process pauses (GC, VM suspend) WILL happen — design for them explicitly
- Use adaptive timeouts and quorum consensus: Fixed timeouts fail under load; require majority of nodes reachable; accept that partitions force a choice between consistency and availability
- Implement membership services (ZooKeeper/etcd) for coordination: Outsource leader election, failure detection, and fencing tokens rather than building consensus from scratch
- Monitor clock offsets and remove drifting nodes: Clock synchronization matters for correctness; never rely on wall-clock time for causality
Data Integrity & Governance
- Audit and recompute continuously: Don't blindly trust systems; periodically regenerate derived data to detect corruption
- Log all reads and writes: Build deterministic audit trails for debugging, recovery, and time-travel debugging ("what did the user see before they acted?")
- Accept eventual constraint enforcement: Allow temporary violations (e.g., duplicate usernames) with compensating transactions; simpler than synchronous coordination
- Minimize data collection: Data is a liability, not just an asset; respect user agency; plan for GDPR deletion from day one
- Audit ML for bias before deployment: Machine learning amplifies historical bias; explicitly check for discrimination in outputs
Action Plan
- Audit your current architecture: Identify bottlenecks (data volume vs. compute), single points of failure, and manual coordination pain points
- Implement an event log strategy: Choose a partitioned log (Kafka, etc.) as single source-of-truth for writes; design derived views for reads
- Right-size your isolation level: Run workloads under SSI if your DB supports it; otherwise pick explicit trade-offs (Read Committed vs. Snapshot Isolation)
- Add end-to-end request tracking: Pass unique identifiers through your entire stack to enable deduplication and audit trails
- Establish continuous auditing: Schedule periodic recomputation of critical derived data and checksums to catch silent corruption early