Summary of "Designing Data-Intensive Applications"

3 min read

Core Idea

  • Build systems around data flow, not compute power: Design for reliability, scalability, and maintainability by optimizing for data volume/complexity and handling inevitable faults
  • Avoid the temptation of "magic bullets": Architecture choices (replication strategy, isolation level, consensus mechanism) are application-specific trade-offs, not universal solutions

Reliability & Operations

  • Tolerate faults, don't prevent them: Distinguish faults (component failures) from failures (system unavailability); design systems that keep running despite hardware crashes, software bugs, and human errors
  • Measure performance with percentiles, not averages: p95/p99 response times matter more than mean; one slow backend call slows the entire user request (tail latency amplification)
  • Minimize human error (causes ~75% of outages): Use good abstractions, sandboxed testing, gradual rollouts, monitoring, and easy recovery mechanisms

Data Architecture Patterns

  • Separate write-path from read-path: Pre-compute indexes/caches during writes; keep reads simple; shift the boundary based on access patterns
  • Use event logs as the source-of-truth: Route all writes through a partitioned log to prevent race conditions between systems; enables both real-time and historical reprocessing with the same framework
  • Compose systems via derived data: Designate one database as source-of-record; treat all others as derived views — avoid dual writes and distributed transactions
  • Design for schema evolution without downtime: Maintain old + new schemas in parallel; gradually shift traffic; reprocess data instead of risky migrations

Consistency & Correctness Trade-offs

  • Choose isolation level by workload: Read Committed for transactional work; Snapshot Isolation for long queries/backups; Serializable Snapshot Isolation (SSI) for strong guarantees without blocking
  • Prevent lost updates with atomic operations: Use INCREMENT/compare-and-set instead of read-modify-write cycles; implement explicit locking or lost-update detection for application-level conflicts
  • Avoid Last-Write-Wins conflict resolution: Silently loses data; implement causal tracking or version vectors instead
  • Use idempotent writes + deduplication over distributed 2PC: Scales better; pass end-to-end request IDs from client to app to database to suppress duplicates across retries

Distributed Systems Realities

  • Assume partial failures will occur: Network delays, node crashes, clock skew, and process pauses (GC, VM suspend) WILL happen — design for them explicitly
  • Use adaptive timeouts and quorum consensus: Fixed timeouts fail under load; require majority of nodes reachable; accept that partitions force a choice between consistency and availability
  • Implement membership services (ZooKeeper/etcd) for coordination: Outsource leader election, failure detection, and fencing tokens rather than building consensus from scratch
  • Monitor clock offsets and remove drifting nodes: Clock synchronization matters for correctness; never rely on wall-clock time for causality

Data Integrity & Governance

  • Audit and recompute continuously: Don't blindly trust systems; periodically regenerate derived data to detect corruption
  • Log all reads and writes: Build deterministic audit trails for debugging, recovery, and time-travel debugging ("what did the user see before they acted?")
  • Accept eventual constraint enforcement: Allow temporary violations (e.g., duplicate usernames) with compensating transactions; simpler than synchronous coordination
  • Minimize data collection: Data is a liability, not just an asset; respect user agency; plan for GDPR deletion from day one
  • Audit ML for bias before deployment: Machine learning amplifies historical bias; explicitly check for discrimination in outputs

Action Plan

  1. Audit your current architecture: Identify bottlenecks (data volume vs. compute), single points of failure, and manual coordination pain points
  2. Implement an event log strategy: Choose a partitioned log (Kafka, etc.) as single source-of-truth for writes; design derived views for reads
  3. Right-size your isolation level: Run workloads under SSI if your DB supports it; otherwise pick explicit trade-offs (Read Committed vs. Snapshot Isolation)
  4. Add end-to-end request tracking: Pass unique identifiers through your entire stack to enable deduplication and audit trails
  5. Establish continuous auditing: Schedule periodic recomputation of critical derived data and checksums to catch silent corruption early
Copyright 2025, Ran DingPrivacyTerms
Summary of "Designing Data-Intensive Applications"