{
    "version": "https://jsonfeed.org/version/1",
    "title": "Gaurav Sarma's Blog",
    "home_page_url": "https://gauravsarma.com",
    "feed_url": "https://gauravsarma.com/rss/feed.json",
    "description": "Thoughts on software engineering, distributed systems, and career growth",
    "icon": "https://gauravsarma.com/images/profile.jpg",
    "author": {
        "name": "Gaurav Sarma",
        "url": "https://gauravsarma.com"
    },
    "items": [
        {
            "id": "https://gauravsarma.com/posts/2026-06-22_optimising-ssd-writes-for-dbms",
            "content_html": "\n![How to Write to SSDs](optimising-ssd-writes-for-dbms-cover.png)\n\nYour database issues a 4 KiB page write. The SSD physically writes 18.85 KiB. That is not a rounding error: on a Samsung PM9A3 filled to 90% capacity, running YCSB-A with a zipfian skew of 0.8, a single logical B-tree page update in LeanStore triggers nearly five times the data you intended to persist. MySQL and PostgreSQL behave similarly.\n\nThe culprit is not one layer. Writes get amplified inside the DBMS, then amplified again inside the SSD. Most systems optimize each layer independently, or not at all. A recent paper from TUM and TigerBeetle argues that the DBMS, not the filesystem or the SSD firmware, is best positioned to coordinate both. Their answer starts with a deceptively simple shift: stop doing in-place updates.\n\n---\n\n## The Problem\n\nModern databases have spent years tuning for SSD reads: exploiting internal parallelism, hiding read latency, prefetching. Writes are a different story.\n\nSSDs wear out. Unlike spinning disks, flash cells tolerate only a finite number of program/erase cycles. Enterprise drives are rated in DWPD (Drive Writes Per Day). A Samsung PM9A3 rated for five years at 1 DWPD on an 894 GB device averages about 11 MB/s of sustainable writes. LeanStore in the in-place configuration from the paper writes around 400 MB/s under YCSB-A. At that rate the drive exhausts its rated endurance in roughly six weeks.\n\nWrite amplification is multiplicative. When the authors benchmarked LeanStore, MySQL, and PostgreSQL on a 90%-full enterprise SSD until cumulative DB writes exceeded four times device capacity, they decomposed every flash write into three categories (excluding WAL, which is comparatively easy to reason about):\n\n1. **User writes**: evictions and checkpoint flushes, the bytes you actually intend to persist.\n2. **Other DBMS writes**: overhead the engine adds on top, most notably doublewrite buffering.\n3. **Internal flash writes**: amplification inside the SSD itself, measured via OCP telemetry when available.\n\nFor in-place LeanStore, a 4 KiB page write becomes 18.85 KiB at the flash layer. DB WAF is about 2.0 (doublewrite). SSD WAF is about 2.36. Total WAF is the product: roughly 4.7×.\n\nThe paper's central claim: **Total WAF = DB WAF × SSD WAF**, and optimizing one layer while ignoring the other can make things worse. An LSM-tree that reduces DB amplification via aggressive size-tiering may consume more SSD space, raise SSD WAF, and end up writing more flash overall.\n\n---\n\n## Prerequisites\n\n- Familiarity with B-tree storage engines, buffer pools, and checkpoint/eviction\n- Basic SSD concepts: pages, erase blocks, overprovisioning, garbage collection\n- Optional but helpful: awareness of ZNS (Zoned Namespace) and FDP (Flexible Data Placement) NVMe extensions\n\n---\n\n## Technical Decisions\n\nThe authors implement their ideas in **ZLeanStore**, a fork of LeanStore (a high-performance B-tree engine for NVMe) extended with vmcache for 1:N page-id-to-offset mappings. The design choices below are the paper's, not generic best practices.\n\n### Why out-of-place instead of in-place?\n\nIn-place B-trees (InnoDB, PostgreSQL heap pages, LeanStore's original design) fix each page at a stable offset. Every update overwrites the same location. That creates two problems on flash:\n\n1. **Doublewrite buffering**: Before overwriting a page, the DBMS copies it to a safe area so a torn write cannot corrupt the database. MySQL's doublewrite buffer, PostgreSQL's full-page writes, and similar mechanisms roughly double DB-issued bytes.\n\n2. **No control over SSD placement**: The DBMS cannot choose where on the device its writes land. SSD WAF becomes whatever the device's internal GC produces for your access pattern, which can be surprisingly bad under skewed workloads.\n\nOut-of-place writes append new versions elsewhere and update a mapping table. The old page stays valid until the new one is durable, so doublewrite goes away. The DBMS also gains freedom to group, compress, and place writes deliberately.\n\nThe trade-off: you now need database-level garbage collection to reclaim space from stale page versions. That GC has its own write amplification. The rest of the paper is about making that GC cheap and shaping the resulting write stream so the SSD stops amplifying it further.\n\n### Why the DBMS, not the filesystem or SSD?\n\nSSDs and filesystems see a stream of writes. They do not know which pages are hot, which will be overwritten in milliseconds, or which belong to the same B-tree index. The DBMS does.\n\nPrior work on SSD WAF mitigation lives mostly at lower layers: F2FS hot/cold separation, SSD firmware lifetime prediction, multi-stream SSDs. These approaches infer workload properties. The DBMS already knows them from transaction semantics, index structure, and page access history.\n\nThe authors' position: treat total WAF as the optimization target and let the DBMS reshape its write pattern with both DB-level and SSD-level amplification in mind.\n\n### Why optimize DB WAF and SSD WAF together?\n\nThis is the counterintuitive part. Naïve out-of-place LeanStore *increases* total WAF before the other optimizations kick in. Removing doublewrite helps, but DB GC copyback dominates, pushing total WAF to roughly 1.66× the in-place baseline on an 800 GB dataset.\n\nCompression alone does not fix SSD WAF. NoWA alone slightly *increases* DB WAF via compensation writes. The full stack is what drives total WAF from 4.72 down to 0.60 on the same benchmark.\n\n---\n\n## Implementation\n\nThe paper's write path has four cooperating components: buffer manager, I/O interface, space manager, and garbage collector. Here is how the optimizations fit together.\n\n### Phase 1: Page-wise compression and page packing\n\n**Compression** reduces the bytes the DBMS actually writes. The authors compress each 4 KiB page independently with LZ4 or ZSTD before persistence. On TPC-C, YCSB, and several real-world datasets, compression ratios range from 14% to 49% of original size depending on algorithm and workload.\n\nIn-place engines struggle here. A compressed page is often smaller than 4 KiB, but SSDs write in 4 KiB units. Overwrite the same offset in place and you still write a full block. Variable-length compressed pages also drift as data changes, making in-place layouts expensive to maintain. PostgreSQL largely skips page-level compression for this reason; MySQL pushes it to the filesystem layer, which does not solve the 4 KiB granularity problem.\n\nOut-of-place writes sidestep this: compress a batch, append sequentially, record new offsets in a PID→offset table.\n\n**Page packing** addresses read amplification. Enterprise SSDs read most efficiently at 4 KiB aligned boundaries. A 3,000-byte compressed page that straddles a 4 KiB boundary costs two physical reads (~2.73× amplification). Page packing uses best-fit bin packing to place compressed pages into 4 KiB slots so each page is fetched with exactly one read. You pay a small amount of internal slack per slot, but reads stay predictable and fast.\n\n### Phase 2: Grouping by deathtime (GDT)\n\nDatabase GC reclaims zones containing a mix of valid and stale pages. If a victim zone is 75% valid, reclaiming 25% of its space requires rewriting the other 75%: WAF of 4× for that cycle.\n\nGreedy victim selection (pick the zone with the fewest valid pages) works under uniform random access. Real OLTP workloads are skewed. Hot and cold pages end up in the same zone, victim zones stay mostly valid, and GC becomes expensive.\n\n**Grouping by Deathtime (GDT)** uses DB semantics to colocate pages that will become invalid around the same time:\n\n- Each page header stores write timestamps from recent persist operations.\n- Expected Deathtime (EDT) extrapolates when the page will next be overwritten: `current_lsn + (WH₀ − WH₁)⁻¹` from the last few write intervals.\n- On flush, pages with similar EDT are packed into the same zone.\n- GC selects victims whose pages have mostly died, minimizing copyback.\n\nInitial writes group by B-tree index ID as a cold-start heuristic. During GC, valid pages are sorted by descending EDT and rewritten into zones with matching temperature. Read-only pages get maximum EDT and are treated as coldest data.\n\nGDT only works paired with GDT-aware GC. If GC interleaves its rewrites with normal writes without respecting deathtime grouping, the placement benefit erodes.\n\n### Phase 3: Aligning DB and SSD garbage collection units\n\nEven with low DB WAF, SSD internal GC can undo your gains. The key insight: **writes from the same DB zone share similar deathtimes**, so if you can make a zone's worth of data land in a single SSD superblock, invalidating that zone can invalidate an entire superblock. SSD GC then reclaims without copying surviving pages.\n\nSSDs append at superblock granularity (grouping erase blocks across planes/dies) but may GC at a coarser or finer unit depending on vendor firmware. Misalignment interleaves pages from different zones inside the same superblock. When one zone is garbage-collected, only half a superblock dies, and the SSD must copy the rest.\n\nHow to pick zone size:\n\n- **FDP-enabled SSDs**: query the Reclaim Unit (RU) size via NVMe and set DB zone size to match.\n- **Standard SSDs**: use a ZNS-like single-active-zone write pattern and increase zone size until measured SSD WAF drops to 1.0. On six enterprise drives tested, the inferred GC unit upper bound typically falls between 4 GB and 8 GB. Without telemetry, 32 GB is a conservative default.\n\n### Phase 4: NoWA on commodity SSDs\n\nWith multiple concurrent zones, write streams **multiplex** across superblocks. Zone A and zone C pages interleave in the same physical blocks. When DB GC reclaims zones at different rates, the SSD inherits partially valid superblocks and must amplify internally.\n\n**NoWA (No Write Amplification)** enforces two rules:\n\n1. Do not open a new active group of zones until all currently open zones are completely full.\n2. Detect frequency imbalance among concurrently written zones and issue **compensation writes** to recompact underrepresented zones before the SSD hits its minimum free-space threshold.\n\nCompensation writes shift a small amount of amplification from the SSD back to the DB layer. The DB GC can avoid counterproductive compensation by checking whether a write would raise valid ratios in future rounds. The net effect on the Samsung PM9A3: SSD WAF drops from 2.36 (in-place) to 1.0 with the full stack, while DB WAF stays near 0.60.\n\nNoWA is not perfect. Wear leveling, open superblock limits, and internal scheduling can still cause minor reordering. But the authors report SSD WAF = 1.0 on six enterprise SSDs from five vendors under YCSB-A.\n\n### Phase 5: ZNS and FDP as first-class backends\n\n**ZNS SSDs** push GC to the host and enforce sequential writes within zones, guaranteeing SSD WAF = 1 by design. ZLeanStore maps DB zones directly to ZNS zones, uses zone-append commands, and resets zones after GC copyback. ZNS also returns overprovisioning space to the host (commodity SSDs hide 7–28% for internal GC), which gives DB GC more headroom and further lowers DB WAF.\n\n**FDP SSDs** expose Reclaim Unit Handles (RUHs). Assign each DB zone a placement ID modulo the RUH count, and writes to different zones land in independent reclaim streams with no multiplexing. When FDP is available, NoWA becomes unnecessary: placement hints achieve SSD WAF = 1 with slightly lower DB WAF (0.54 vs 0.57 on FDP SSD A) because compensation writes are avoided.\n\nThe I/O layer detects device type at startup and selects among standard io_uring, ZNS zone-append, or FDP placement backends.\n\n### Logging and recovery\n\nOut-of-place writes require logging PID→offset mapping changes, not just page contents. ZLeanStore uses per-thread WAL with continuous checkpointing. Page data is persisted before its mapping update commits. Checkpoints snapshot the mapping table and active-group history. Recovery reloads the checkpoint and replays WAL to reconstruct storage layout.\n\nGC shares the buffer pool with worker threads. A reserve of clean frames (integrated with fuzzy checkpointing) prevents eviction from fighting GC reads for buffer space.\n\n---\n\n## How It All Fits Together\n\nThe full write path for a batch of dirty pages flushed on eviction:\n\n```\nBuffer pool                Space manager              SSD\n───────────                ─────────────              ───\nCompute EDT ──► Compress ──► Pick zone by EDT ──► Append packed pages\nfrom WH        + pack         (or trigger GC)        (io_uring / ZNS / FDP)\n     │              │                │\n     └──────────────┴── Update PID→offset map ──► WAL\n```\n\nEnd-to-end amplification breaks down like this on an 800 GB YCSB-A workload (Samsung PM9A3, zipf = 0.8):\n\n| Configuration | OPS (K) | DB WAF | SSD WAF | Total WAF | Flash bytes/op |\n|---|---:|---:|---:|---:|---:|\n| In-place | 229 | 2.00 | 2.36 | 4.72 | 4,378 |\n| Out-of-place (naïve) | 230 | 4.06 | 1.94 | 7.89 | 7,274 |\n| + compression + packing | 380 | 0.62 | 1.95 | 1.21 | 566 |\n| + GDT | 458 | 0.59 | 1.96 | 1.16 | 1,110 |\n| + NoWA + aligned GC unit | 535 | 0.60 | 1.00 | 0.60 | 567 |\n\nThe naïve out-of-place row is the cautionary tale: removing doublewrite without controlling DB GC and SSD placement makes things worse. The final row writes fewer physical bytes per operation than the in-place engine issues logically.\n\nOn TPC-C with 15,000 warehouses on an FDP SSD, the optimized out-of-place configuration completes 2.45× more new-order transactions in the same runtime, while writing 7.2× fewer flash bytes per transaction.\n\nAcross five additional enterprise SSD models (894 GB to 7.2 TB), total WAF improvements range from 6.2× to 9.8× over in-place writes.\n\n---\n\n## Lessons Learned\n\n**In-place is the wrong default for flash-heavy OLTP.** Doublewrite buffering alone costs ~2× DB WAF. That made sense when disks were cheap to rewrite and had no endurance limit. On modern NVMe it is an expensive legacy tax.\n\n**Out-of-place without a plan is worse than in-place.** DB GC copyback can dominate. The paper's incremental evaluation (Figure 13 in the original) shows total WAF rising before falling as optimizations stack. Do not assume append-only writes are automatically SSD-friendly.\n\n**Compression and GC interact.** Compression shrinks the dataset, which increases effective overprovisioning, which lowers GC valid ratios, which reduces DB WAF further. It also widens the timestamp window for deathtime estimation. On compressible OLTP data, page-wise LZ4/ZSTD is not optional in this design.\n\n**SSD WAF = 1 on commodity drives is achievable.** That was surprising to me. Prior work (including the authors' own SSD-iq paper) showed enterprise SSDs with WAF of 4 under simple hot/cold patterns. NoWA plus aligned GC units gets to 1.0 without ZNS hardware, though FDP makes it cleaner.\n\n**The DBMS should own the write pattern.** Filesystems and SSD firmware can shuffle bytes, but they cannot know that these eight pages will die together at the next checkpoint. That knowledge lives in the engine.\n\nCosts the paper is honest about: out-of-place metadata can consume up to ~11 GB at full device utilization, CPU usage rises from 5% to 8.3% (mostly buffer frame management for GC), and buffer hit ratio dips slightly when switching from in-place (93.2% to 91.8%) before recovering with GDT.\n\n---\n\n## What's Next\n\nThe authors identify open directions: extending the techniques to LSM-tree engines, multi-device deployments, and HM-SMR disks. LSM-trees already write out-of-place but reclaim space on compaction schedules that are harder to control than zone-level GC. Bridging GDT and NoWA concepts to leveled compaction is non-trivial.\n\nFor practitioners today, the actionable takeaway is diagnostic: measure total WAF, not just logical write rate. If you have OCP telemetry on your drives, compare host writes to physical writes. If the ratio is above 1.5 under your actual workload, the problem may be write placement, not query plans.\n\nZLeanStore source is available at [github.com/LeeBohyun/ZLeanStore](https://github.com/LeeBohyun/ZLeanStore).\n\n---\n\n## References\n\n- [How to Write to SSDs](https://www.vldb.org/pvldb/vol19/p1469-lee.pdf) (Lee, Ziegler, Leis; PVLDB 2026), the paper this post covers\n- [Extended version on arXiv](https://arxiv.org/abs/2603.09927)\n- [ZLeanStore artifact](https://github.com/LeeBohyun/ZLeanStore)\n- [LeanStore: A High-Performance Storage Engine for NVMe SSDs](https://www.vldb.org/pvldb/vol17/p4536-leis.pdf) (Leis; PVLDB 2024)\n- [SSD-iq: Uncovering the Hidden Side of SSD Performance](https://doi.org/10.14778/3749646.3749694) (Haas, Lee, Bonnet, Leis; PVLDB 2025)\n- [ZNS: Avoiding the Block Interface Tax for Flash-based SSDs](https://www.usenix.org/conference/atc21/presentation/bjorling) (Bjørling et al.; USENIX ATC 2021)\n- [Principles of Database and Solid-State Drive Co-Design](https://doi.org/10.1007/978-3-031-57877-9) (Lerner & Bonnet; Springer 2024)\n",
            "url": "https://gauravsarma.com/posts/2026-06-22_optimising-ssd-writes-for-dbms",
            "title": "How to Write to SSDs - Co-Designing DBMS and Flash Storage",
            "summary": ". [How to Write to SSDs](optimising-ssd-writes-for-dbms-cover...",
            "date_modified": "2026-06-22T00:00:00.000Z"
        },
        {
            "id": "https://gauravsarma.com/posts/2026-05-12_where-sqlite-gives-up",
            "content_html": "\n![Where SQLite Gives Up](where-sqlite-gives-up-cover.png)\n\nThe \"just use SQLite\" take has been everywhere lately. For a lot of workloads it's right. SQLite is fast, embedded, zero-ops, and has more rigor in its test suite than most production databases. But if you reach for it without thinking about how it serializes writes, how it picks who gets the lock next, and how you'd grow past one machine, you can hit walls that are obvious in hindsight and painful in production.\n\nThis post walks through three of those walls with a small benchmark for each. None of them are reasons to never use SQLite. They are reasons to ask whether your workload sits inside its operating envelope before defaulting to it.\n\n---\n\n## The Problem\n\nThree things that don't show up in a hello-world benchmark but matter in production:\n\n1. SQLite's lock acquisition is unfair. Multiple waiters do not form a queue.\n2. Write transactions take a global lock on the file. Concurrency for writes is exactly one.\n3. SQLite is an embedded library on a filesystem. You scale a single machine, not a cluster.\n\nPostgres and MySQL handle all three by paying an upfront cost: a server process, a connection protocol, a more involved deployment. That cost is real but not large. The rest of this post tries to make the cost of *not* paying it concrete.\n\n---\n\n## Prerequisites\n\n- Python 3.10+ (the `sqlite3` module is in the stdlib)\n- A Unix-ish machine for the multiprocess scripts\n- A working knowledge of SQL transactions\n\n---\n\n## Technical Decisions\n\nA few notes on how the experiments are set up.\n\n### WAL mode for everything\n\nEvery measurement runs with `PRAGMA journal_mode=WAL` and `PRAGMA synchronous=NORMAL`. WAL mode is the modern default for any SQLite database with concurrent access, because it lets readers and writers proceed in parallel. The single-writer constraint is *still* there, but at least readers aren't blocked. Running this benchmark in legacy rollback-journal mode would make the numbers worse and less representative of a sensibly configured deployment.\n\n### `BEGIN IMMEDIATE` instead of `BEGIN`\n\nSQLite defaults to deferred transactions: the lock is acquired lazily when the first write happens. This is convenient but produces a race where two transactions can both think they're going to be the writer, and one gets aborted with `SQLITE_BUSY`. `BEGIN IMMEDIATE` acquires the write lock at the start of the transaction and is the standard advice for any code that needs to coordinate writers. It also makes lock-acquisition timing easy to measure.\n\n### `busy_timeout` set to 30 seconds\n\nWorkers wait up to 30 seconds for the lock instead of failing fast with `SQLITE_BUSY`. This is what most application code does in practice. It's also what makes the unfairness visible: with a long timeout, no waiter gives up early, and we can see the order in which they actually get the lock.\n\n---\n\n## Demo 1: Lock acquisition is unfair\n\nThe first claim is that SQLite has no queue for waiting writers. When the lock becomes available, any waiting process might get it.\n\nTo make this visible, start 8 workers that each try to acquire `BEGIN IMMEDIATE` at the same time, then look at how many times each worker is the *first*, *middle*, and *last* to acquire it across many trials.\n\n```python\nimport sqlite3, multiprocessing, time, os\n\nDB = \"lock_test.db\"\nWORKERS = 8\nTRIALS = 100\n\ndef worker(wid, barrier):\n    conn = sqlite3.connect(DB, timeout=30.0, isolation_level=None)\n    conn.execute(\"PRAGMA busy_timeout=30000\")\n    conn.execute(\"PRAGMA synchronous=NORMAL\")\n    for trial in range(TRIALS):\n        barrier.wait()                       # all 8 line up\n        t0 = time.perf_counter()\n        conn.execute(\"BEGIN IMMEDIATE\")\n        t1 = time.perf_counter()\n        conn.execute(\"INSERT INTO t (who, trial, ts) VALUES (?, ?, ?)\",\n                     (wid, trial, t1))\n        conn.execute(\"COMMIT\")\n    conn.close()\n```\n\nAfter 100 trials with 8 workers on a stock laptop, the per-worker breakdown looks like this. \"Position 1\" is how many times that worker got the lock first; \"Position 8\" is how many times it got it last. If acquisition were FIFO from the moment all 8 were blocked, the order would be deterministic. If it were perfectly fair-random, every cell would be 12.5.\n\n| Worker | Position 1 | Position 4 | Position 8 | Max wait (ms) |\n|---:|---:|---:|---:|---:|\n| 0 | 10 | 12 | 11 | 79.9 |\n| 1 | 14 | 10 | 14 | 27.1 |\n| 2 | 12 | 13 | 12 | 49.8 |\n| 3 | 13 | 10 | 13 | 27.3 |\n| 4 | 14 |  8 | 14 | 27.4 |\n| 5 | 14 | 16 | 13 | 41.6 |\n| 6 | 10 | 15 | 10 | 27.2 |\n| 7 | 13 | 16 | 13 | **114.8** |\n\nTwo things to notice. First, position counts scatter around 12.5 with no FIFO pattern. Second, max-wait per worker varies wildly. Worker 1 never waited more than 27 ms, while worker 7 waited 115 ms at least once. That's a 4.2x spread for the same workload on the same machine. Some workers consistently get unlucky.\n\nWhy? SQLite's busy-wait is built on top of the OS file-locking primitives. When the lock is released, every blocked waiter is racing to acquire it next. Whoever the kernel schedules first wins. There's no FIFO queue, no priority, nothing. In a quiet system this looks roughly fair on average. Under load, the long tail can blow up, and there's no upper bound on starvation that you can derive from \"I am next in line\", because there is no line.\n\nFor batch jobs this is a non-issue. For an HTTP server with a tail-latency SLO, it matters. p99 is what users feel, and p99 is what gets blown up by random scheduling.\n\n---\n\n## Demo 2: One writer, total\n\nThe second claim is that the write lock is global to the entire database file. Adding more writer processes does not increase write throughput. It can decrease it.\n\n```python\nimport sqlite3, multiprocessing, time, os\n\nDB = \"write_scaling.db\"\nTOTAL_INSERTS = 20_000\n\ndef worker(n_inserts):\n    conn = sqlite3.connect(DB, timeout=30.0, isolation_level=None)\n    conn.execute(\"PRAGMA busy_timeout=30000\")\n    conn.execute(\"PRAGMA synchronous=NORMAL\")\n    payload = \"x\" * 100\n    for _ in range(n_inserts):\n        conn.execute(\"BEGIN IMMEDIATE\")\n        conn.execute(\"INSERT INTO t (payload) VALUES (?)\", (payload,))\n        conn.execute(\"COMMIT\")\n    conn.close()\n```\n\nEach worker does its share of `TOTAL_INSERTS / N` inserts. Sweep N from 1 to 16:\n\n| Writers | Inserts/sec | Speedup |\n|---:|---:|---:|\n| 1 | 122,604 | 1.00x |\n| 2 | 116,545 | 0.95x |\n| 4 | 112,282 | 0.92x |\n| 8 | 96,949 | 0.79x |\n| 16 | 56,033 | 0.46x |\n\nThroughput goes *down* as you add writers, and at 16 it has more than halved. The work is identical (20,000 inserts), but the lock thrash, kernel scheduling, and `BEGIN IMMEDIATE` retries eat into useful time. There's exactly one writer at any moment, by design. Adding processes adds contention without adding parallelism.\n\nThe corresponding Postgres or MySQL number on the same machine grows roughly linearly with writers until you hit CPU or disk limits, because they have row-level locking and a real concurrent-transaction model. SQLite gave that up in exchange for being a library with no server.\n\nThe fix-it advice you'll find online is \"batch your writes\" and \"use a single writer process behind a queue\". Both are correct. Both are also a small operational system that you now have to build, debug, and observe. If your write workload is naturally bursty or comes from independent processes that you don't control (think: many CGI workers, many background jobs), you'll be doing this work yourself.\n\n---\n\n## Demo 3: One file, one machine\n\nThe third claim doesn't need a benchmark, but it's the one with the longest tail of consequences.\n\nA SQLite database is a file. You can open it from one machine. If you want a second machine to read from it, you copy the file (a consistent snapshot) or you put a service in front of SQLite that the second machine talks to. If you want a second machine to write, you can't, end of story.\n\nPostgres and MySQL ship with the primitives for the next step:\n\n- Streaming replication for read scale-out\n- Logical replication for partial replication or version migration\n- Failover tooling, leader election, and broad ecosystem support for both\n- Sharding extensions (Citus, Vitess) for the cases where one node is genuinely not enough\n\nBuilding any of this on top of SQLite is a project. It has been done, and the projects work, but you're now operating a distributed system whose primitives you wrote, configured, or pulled in from a young ecosystem. The reason \"vertical-only scaling\" feels light when you write it down and heavy when you live with it is that the moment you cross the one-machine line, every operational concern you took for granted becomes your problem to solve.\n\nThis is fine for a personal project or a single-tenant SaaS where one big instance is plenty. It's not fine for a multi-region service or anything that needs to survive a single host going down without losing minutes of writes.\n\n---\n\n## How It All Fits Together\n\nThe three limits compound in a specific way:\n\n- The unfair lock means tail latency under contention is unpredictable.\n- The single global writer means you can't fix that by adding processes.\n- The single-file model means you can't fix that by adding machines.\n\nIf your workload doesn't push on any of these (low write rate, single-machine deploy, no tight tail-latency SLO), SQLite is great and the operational simplicity is real. If your workload pushes on any one of them, the workarounds exist but they all cost more than just running Postgres or MySQL would have.\n\n---\n\n## Lessons Learned\n\nA few takeaways from poking at this directly.\n\n**The unfairness is not theoretical.** With 8 contending writers and 100 trials, max-wait per worker varied by 4.2x. That's the kind of variance that ruins p99 latency for any real-time service.\n\n**Adding writer processes makes things worse, not better.** This is the most counterintuitive finding for engineers used to Postgres or MySQL, where adding workers usually helps until you saturate something. With raw SQLite, more writers means more lock thrash and lower throughput. Going from 1 to 16 writer processes more than halved insert/sec on this machine.\n\n**The one-machine ceiling is real, but the more interesting cost is the one *before* the ceiling.** Long before you outgrow a single host, you'll want a read replica, or zero-downtime maintenance, or point-in-time recovery on a hot database. All of those are first-class in Postgres and MySQL. In SQLite they're external projects you bolt on.\n\n**None of this means \"don't use SQLite\".** The ergonomic and operational wins are real. For local-first apps, embedded analytics, single-process services, single-tenant CLIs, and file-format-as-database use cases, SQLite is the right answer. The point is that \"just use SQLite\" should come with the same kind of fit-check you'd apply to any other database: what's the write concurrency, what's the read scale-out story, what's the failover plan?\n\nFor most server-side, multi-tenant, multi-process workloads, raw Postgres or MySQL is still the better default. The setup tax is small and the headroom is enormous.\n\n---\n\n## What's Next\n\nA few directions worth measuring next:\n\n- **The unfairness benchmark with `synchronous=FULL`**, to see if fsync timing reshuffles which worker gets the next lock.\n- **Mixed read/write workloads**, to confirm that WAL's reader-writer parallelism really does isolate read latency from writer contention even at high write rates.\n- **`BEGIN IMMEDIATE` versus an explicit application-level write queue** (a single writer goroutine fed by a channel), to see how much of the throughput loss in Demo 2 is reclaimable by serializing writers in userspace instead of the kernel.\n\n---\n\n## References\n\n- [SQLite WAL mode](https://www.sqlite.org/wal.html)\n- [SQLite file locking](https://www.sqlite.org/lockingv3.html)\n- [SQLite isolation and concurrency](https://www.sqlite.org/isolation.html)\n- [Postgres MVCC and locking](https://www.postgresql.org/docs/current/mvcc.html)\n",
            "url": "https://gauravsarma.com/posts/2026-05-12_where-sqlite-gives-up",
            "title": "Where SQLite Gives Up - Locks, Writers, and the Single-File Problem",
            "summary": ". [Where SQLite Gives Up](where-sqlite-gives-up-cover...",
            "date_modified": "2026-05-12T00:00:00.000Z"
        },
        {
            "id": "https://gauravsarma.com/posts/2026-05-07_when-compressed-protobuf-beats-compressed-json",
            "content_html": "\n![When Does Compressed Protobuf Beat Compressed JSON?](when-compressed-protobuf-beats-compressed-json-cover.png)\n\nI did a double-take at the row in my benchmark output. A list of 100 homogeneous records, gzipped. JSON: 1639 bytes. Protobuf: 1690 bytes. *Protobuf was bigger.* That's not how this is supposed to go. \"Protobuf is smaller than JSON\" is one of those facts everyone knows — it shows up in design docs as a one-liner, and the raw-bytes benchmarks behind it are real (50 to 90 percent smaller is typical). And yet here was a case, on a perfectly ordinary payload shape, where gzipped JSON came out ahead.\n\nThe thing is, almost no one ships raw JSON. They ship gzipped JSON. Or zstd. Or brotli. And the moment you put a general-purpose compressor in front of either format, the picture changes. JSON has predictable, repetitive structure (the same keys appear over and over), which is exactly what compressors are good at squeezing out. So the interesting question is not \"is protobuf smaller than JSON\" but \"is *compressed* protobuf smaller than *compressed* JSON, and when?\"\n\nThis post walks through a benchmark that pokes at that question across 19 payload shapes, three compressors, and the spread of cases (tiny records, sparse optionals, long lists, packed ints, UUIDs, binary blobs) where the answer flips.\n\n---\n\n## The Problem\n\nIf you are designing a wire format for an internal service, the standard bake-off looks something like:\n\n1. Pick a representative payload.\n2. Serialize it as JSON. Run gzip on it. Measure.\n3. Serialize it as protobuf. Run gzip on it. Measure.\n4. Pick whichever is smaller.\n\nStep 4 is where it gets interesting, because the answer is not consistent. A protobuf message that is 60 percent smaller than its JSON twin in raw bytes might come out only 1 percent smaller after gzip, or even *larger*. The savings you measured at the schema level can mostly evaporate at the wire level.\n\nWhat I wanted was a single picture: across the kinds of payloads that real APIs actually carry, where does compressed protobuf still win, and where does compressed JSON catch up? The answer turns out to be neither \"always protobuf\" nor \"always JSON\" but a clean set of patterns you can predict from the shape of the data.\n\n---\n\n## Prerequisites\n\nTo follow along you'll want:\n\n- Python 3.10+\n- `protobuf`, `brotli`, `zstandard` from PyPI\n- `protoc` to regenerate `schema_pb2.py` if you change `schema.proto`\n\nThe code lives at [github.com/gsarmaonline/proto-vs-json-compression](https://github.com/gsarmaonline/proto-vs-json-compression). You can run the whole thing in a few seconds with:\n\n```bash\n.venv/bin/python bench.py\n```\n\n---\n\n## Technical Decisions\n\nA benchmark like this lives or dies on the choices you make at the edges. Here are the ones that mattered.\n\n### Comparing apples to apples means generating the data twice\n\nThe most common mistake in proto-vs-JSON benchmarks is to build the protobuf message, then call `MessageToJson` to produce the JSON. That is *not* a fair comparison. `MessageToJson` makes choices (camelCase field names, omitting unset fields, base64 for bytes) that are protobuf's defaults, not JSON's defaults. Real JSON APIs frequently differ.\n\nSo in this benchmark, every scenario builds the protobuf message and the JSON dict from the same source data, independently. That lets us probe a real-world question: when JSON keeps `null` for unset fields versus when it omits them, how does the comparison change? Spoiler: a lot. More on that below.\n\n### Including raw bytes and three compressors\n\nThe columns in the result table are `raw`, `gzip` (level 9), `zstd` (level 19), and `brotli` (quality 11). Raw bytes show the schema-level savings (which are huge and uncontroversial). The three compressors at high settings show what you'd actually ship.\n\nI picked aggressive settings on purpose. The \"compressed JSON catches up\" effect is strongest when the compressor has the most room to work, so this is the steel-man case for JSON. If protobuf still wins at zstd-19, it's a real win.\n\n### Deterministic data\n\nEvery random draw uses a seeded RNG (`random.Random(42)`), so the numbers in the table are reproducible. This matters because some scenarios (UUID lists, binary blobs) are sensitive to entropy, and you don't want to chase a 0.3 percent flutter that's just a different RNG seed.\n\n---\n\n## Implementation\n\n### The schema\n\n`schema.proto` defines 11 message types, each picked to stress a different axis of the proto-versus-JSON comparison:\n\n```proto\n// Numeric-heavy — many ints/doubles, no repetition.\n// Protobuf varints + fixed-width doubles vs JSON text numbers.\nmessage NumericHeavy {\n  uint64 id = 1;\n  uint64 ts_ms = 2;\n  double cpu = 3;\n  // ... 12 more numeric fields\n}\n\n// Sparse optionals — many fields, most unset.\nmessage SparseOptionals {\n  uint64 id = 1;\n  optional string a = 2;\n  optional string b = 3;\n  // ... 19 more optional fields\n}\n\n// Packed int array — protobuf packs into a single tag+length.\nmessage IntArray {\n  repeated uint32 values = 1;     // packed by default in proto3\n}\n```\n\nThe full schema covers small flat objects, numeric-heavy records, long natural-language strings, sparse optionals, deeply nested trees, lists of homogeneous records, low-cardinality enum-like strings, boolean bags, packed integer arrays, UUID lists, and binary blobs.\n\n### The runner\n\nThe runner is straightforward: for each scenario, serialize as JSON (with `json.dumps(..., separators=(',', ':'))` for compact output) and as protobuf, then run each through three compressors and record sizes.\n\n```python\ndef measure(name: str, msg, d) -> Row:\n    js = json.dumps(d, separators=(\",\", \":\")).encode(\"utf-8\")\n    pr = msg.SerializeToString()\n    return Row(\n        name=name,\n        json_raw=len(js),\n        json_gz=gzip_size(js),\n        json_zstd=zstd_size(js),\n        json_br=brotli_size(js),\n        proto_raw=len(pr),\n        proto_gz=gzip_size(pr),\n        proto_zstd=zstd_size(pr),\n        proto_br=brotli_size(pr),\n    )\n```\n\nThe delta column is `(proto - json) / json`. Negative means protobuf is smaller. That sign convention takes a second to internalize, but it's the right one once you start scanning columns.\n\n### Sparse optionals: the most lopsided case\n\nThe single most dramatic result is `sparse_optionals_dense_json`. The scenario: a 21-field message where only 3 fields are set. Two JSON variants: one that emits `null` for unset keys (common in untyped or auto-generated APIs), and one that omits them entirely.\n\n```python\n# Variant 1: dense — emit all keys with nulls.\nd_dense = {\"id\": 1, \"a\": \"hello\", \"b\": None, \"c\": None, ...}\n\n# Variant 2: sparse — drop unset keys.\nd_sparse = {\"id\": 1, \"a\": \"hello\", \"x\": 42}\n```\n\nProtobuf is identical in both cases (it just doesn't write the unset fields), at 11 raw bytes. Dense JSON is 213 raw bytes. After gzip, dense JSON is 104 bytes, protobuf is 31. That's a 70 percent win even after compression, on a payload where you might naively think compression should crush all the repetition. It doesn't, because nullness is structurally encoded in JSON keys, and short distinct keys don't compress well.\n\nSparse JSON cleans this up nicely (47 bytes raw, 47 bytes after gzip, since gzip can't help such a short input) but you only get sparse JSON if your serializer is configured to omit unset fields, which not every codebase does.\n\n### Lists: where JSON catches up and sometimes wins\n\nThe other end of the spectrum is `repeated_orders`. A list of homogeneous records. Each record has six fields. JSON repeats the field names per element. In raw bytes, protobuf is roughly 63 percent smaller across all sizes. After gzip:\n\n| Scenario | jsonGZ | protGZ | ΔGZ |\n|---|---:|---:|---:|\n| repeated_orders×10 | 293 | 267 | −8.9% |\n| repeated_orders×100 | 1639 | 1690 | **+3.1%** |\n| repeated_orders×1000 | 14533 | 14426 | −0.7% |\n\nAt 100 elements, gzipped JSON is *smaller* than gzipped protobuf. Why? Because gzip is exceptionally good at finding the repeated key sequences (`\"order_id\":`, `\"user_id\":`, etc.) and replacing them with backreferences. Once you have enough repetitions, the keys cost almost nothing. Protobuf's per-element overhead (tags, lengths) doesn't compress nearly as well, because it's already information-dense.\n\nAt 10 elements there's not enough repetition for gzip to win. At 1000 elements, the long tail of varint-encoded fields (which compress modestly) tips it back to protobuf. The crossover is in the middle, and it's narrow.\n\n### Why gzip closes the gap: a detour into LZ77\n\nIt's worth understanding *why* this happens, because the answer generalizes. Gzip uses DEFLATE, which is two stages glued together. The first stage is LZ77: scan the input for byte sequences that have already appeared in a sliding window, and replace each repeat with a short (distance, length) pair pointing back to the earlier copy. The second is Huffman coding: take whatever LZ77 emits (literals plus backreference tokens) and assign shorter bit-codes to the more common symbols.\n\nJSON-with-repeated-records is close to a worst case for the *uncompressed* representation and close to a best case for LZ77. The string `\"order_id\":` appears 100 times in a 100-element list. After the first occurrence, every subsequent one collapses to a 2-3 byte backreference. The same for every other field name, every comma between elements, every closing brace. Most of the structural noise gets squeezed out almost entirely. Huffman then mops up the remainder — the leftover literals are heavily skewed toward digits, quotes, and a small alphabet of value characters, so it can assign them short codes.\n\nProtobuf doesn't give LZ77 much to work with. Each field tag is a single byte (for low field numbers). Lengths are 1-2 byte varints. DEFLATE only emits a backreference when it finds a match of at least 3 bytes, so a lone repeating tag byte can't be deduplicated on its own — it would have to land inside a longer matching window, and it usually doesn't, because the value bytes between tags differ from record to record. From a Shannon-entropy standpoint the protobuf bytes are already close to uniformly distributed (varint encoding has that effect: it spreads small integers across the low-bit range), so neither LZ77 nor Huffman can extract much. The format is, in a sense, *already compressed*.\n\nThat's the core asymmetry. JSON carries a heavy load of structural redundancy that LZ77 was practically designed to remove; protobuf carries almost none. When you compress both, you're not really comparing two formats anymore — you're comparing JSON's lower bound (roughly its Shannon entropy after key deduplication) against protobuf's lower bound (roughly the protobuf bytes themselves, plus a small DEFLATE framing overhead). Those bounds are close. They cross around 100 repeated records, where JSON's per-record overhead has been dedup'd down to backreference tokens and protobuf's per-record varint tags haven't been compressed at all.\n\n### Packed int arrays: protobuf's home turf\n\nCompare to `int_array×1000` with small values:\n\n| Format | Raw | Gzip |\n|---|---:|---:|\n| JSON | 3142 | 1374 |\n| Protobuf | 1003 | 929 |\n\nProtobuf packs uint32 values into a tightly varint-encoded sequence with a single tag and length prefix. Each small value takes 1 byte. JSON spells each integer in decimal with a comma separator, which is verbose to begin with and only partly redeemed by gzip (numbers are not as repetitive as field keys). Protobuf wins by 32 percent after gzip, which is one of the larger compressed wins in the whole benchmark.\n\n### Where neither wins meaningfully\n\n`string_heavy` (one long natural-language body) and `uuid_list` (high-entropy strings) are nearly tied. The framing overhead is negligible compared to the payload content, so both formats compress to roughly the same size. If your data is mostly text content or random IDs, the format choice barely matters for size.\n\n---\n\n## How It All Fits Together\n\nAfter running all 19 scenarios, the picture sorts cleanly into three buckets.\n\n**Compressed protobuf still wins (often substantially) when:**\n\n- The payload is small (per-message overhead dominates and protobuf has none).\n- Most fields are unset and the JSON serializer keeps `null` placeholders.\n- You have packed numeric arrays (protobuf's varint + packed encoding is genuinely tighter than text numbers).\n- You have many booleans (1 byte each in protobuf vs `\"key\":true,` in JSON).\n- You have low-cardinality enum-like strings that protobuf encodes as small ints.\n\n**Compressed JSON catches up or wins when:**\n\n- You have long lists of homogeneous records and the per-record key repetition compresses to almost nothing.\n- The payload is dominated by opaque bytes (JSON's base64 overhead, ~33 percent, mostly survives compression but so does protobuf's framing, so they roughly tie).\n\n**Neither wins meaningfully when:**\n\n- The payload is dominated by natural-language text or high-entropy strings (UUIDs). The framing is a rounding error.\n\nThe mental model: protobuf wins where *structure* is the dominant cost. JSON catches up where structure is *repetitive* (compressors love repetition) or where structure is a small fraction of the payload (the data dominates).\n\n---\n\n## Lessons Learned\n\nA few things surprised me, even after staring at the numbers for a while.\n\n**The \"compressed JSON beats compressed protobuf\" case is real, but narrow.** It happens around 100 repeated homogeneous records. Below that, not enough repetition. Above that, varint efficiency pulls protobuf back ahead. If you knew you'd always be in that band, you could ship gzipped JSON and feel fine. Most APIs are not that uniform.\n\n**JSON's worst case is not what people think.** The dramatic 70+ percent wins for protobuf don't come from nested structures or numeric efficiency. They come from sparse optionals with `null` placeholders. If your team uses a JSON serializer that emits nulls for missing fields, that single choice is doing more damage to your wire size than the format itself.\n\n**`zstd` and `brotli` rarely change the qualitative answer.** They consistently produce smaller outputs than gzip for both formats, but the *relative* comparison stays roughly the same. If gzipped protobuf wins, zstd protobuf wins by a similar margin. The exception is `repeated_orders×100`, where zstd gives JSON a bigger lead than gzip does.\n\n**The case for protobuf is rarely \"wire size\" alone.** Once you put a compressor in the path, the bytes-on-the-wire savings shrink to roughly 0 to 35 percent for typical payloads. That's not nothing, but it's not the order-of-magnitude win the raw-bytes column suggests. The stronger arguments for protobuf are schema enforcement, generated client code, faster parsing, and the absence of weird edge cases (no NaN-vs-null debate, no integer-precision-loss-in-JavaScript). Wire size is a tiebreaker, not the headline.\n\n---\n\n## What's Next\n\nA few directions worth poking at:\n\n- **Streaming.** All measurements here are one-shot. For long-lived streams (e.g., NDJSON-style logs), the compressor's dictionary state across records changes the math, and protobuf's per-record overhead becomes more visible.\n- **Schema evolution costs.** Protobuf's reserved field numbers and unknown-field handling have on-the-wire costs that this benchmark doesn't capture. JSON's \"just add a key\" approach has its own, in the form of unbounded growth.\n- **Trained dictionaries.** zstd supports trained dictionaries that dramatically improve compression on small payloads. JSON benefits more than protobuf from this, because the keys are exactly what the dictionary captures. That could shift the small-message numbers significantly.\n\n---\n\n## References\n\n- [Protobuf encoding](https://protobuf.dev/programming-guides/encoding/)\n- [zstd manual](https://facebook.github.io/zstd/)\n- [Repository with the full benchmark](https://github.com/gsarmaonline/proto-vs-json-compression)\n",
            "url": "https://gauravsarma.com/posts/2026-05-07_when-compressed-protobuf-beats-compressed-json",
            "title": "When Does Compressed Protobuf Actually Beat Compressed JSON?",
            "summary": ". [When Does Compressed Protobuf Beat Compressed JSON...",
            "date_modified": "2026-05-07T00:00:00.000Z"
        },
        {
            "id": "https://gauravsarma.com/posts/2026-04-15_table-per-tenant-vs-shared-table",
            "content_html": "\nYou are building a SaaS product. Every customer has their own orders, invoices, and documents. A teammate opens a pull request that creates a new table for every customer that signs up: `orders_customer_123`, `orders_customer_124`, and so on. Another teammate comments: \"this is insane, just add a `customer_id` column and an index.\"\n\nBoth of them are right. Both of them are wrong. The answer depends on numbers that neither of them has looked up yet: how many tenants you will have, how large each tenant's data will be, how often schemas change, and whether a regulator will one day ask you to delete exactly one customer's data within 24 hours.\n\nThis is one of the earliest architectural decisions in any multi-tenant system, and it is surprisingly hard to change later. The table structure, the indexes, the migration tooling, the monitoring, the backup strategy, all of it flows from this one choice.\n\n---\n\n## The Problem\n\nMulti-tenancy in a database boils down to one question: how do you isolate one customer's data from another?\n\nThere are two extremes:\n\n1. **Physical isolation:** Each customer gets their own table (or schema, or database). Their data is literally in a different place on disk.\n2. **Logical isolation:** All customers share the same table. A `customer_id` column and a WHERE clause separate them.\n\nPhysical isolation gives you strong boundaries. Logical isolation gives you operational simplicity. The problem is that each approach has costs that only show up at scale, and by the time you hit scale, migrating between strategies is extremely painful.\n\nThe stakes are real. Pick table-per-tenant with 50,000 customers and your Postgres catalog has 50,000 entries per table type. The planner slows to a crawl. Pick shared-table with a few whale customers and one tenant's 10M-row dataset skews the planner statistics for every other tenant with 1,000 rows. Both failure modes are silent until they are not.\n\n---\n\n## Prerequisites\n\n- Working knowledge of Postgres (tables, indexes, schemas, EXPLAIN output)\n- Familiarity with B-tree indexes and how they affect writes\n- Understanding of what VACUUM does and why it matters\n- Experience with at least one multi-tenant application (even a small one)\n\n---\n\n## Technical Decisions\n\n### Strategy 1: Table-Per-Tenant\n\nThe idea is straightforward. When customer 42 signs up, you run:\n\n```sql\nCREATE TABLE orders_42 (\n    id BIGSERIAL PRIMARY KEY,\n    product_id BIGINT NOT NULL,\n    amount NUMERIC(10, 2) NOT NULL,\n    created_at TIMESTAMPTZ DEFAULT now()\n);\n```\n\nEvery query for that customer hits only their table. No WHERE clause needed. No risk of accidentally reading another customer's data.\n\n**Where it shines:**\n\nQueries are fast because each table is small. A customer with 10,000 orders has a 10,000-row table. Sequential scans are cheap. Indexes are tiny. The Postgres planner has accurate statistics for that specific table because the data distribution is homogeneous.\n\nCustomer deletion is instant: `DROP TABLE orders_42`. No DELETE of millions of rows. No VACUUM pass afterward. No dead tuples. The disk space is returned immediately.\n\nPer-customer backup and restore is trivial. You can `pg_dump` a single customer's tables without touching anyone else. If a customer needs their data exported for compliance, you hand them a dump file.\n\n**Where it breaks down:**\n\nSchema migrations multiply. Adding a column to the orders table means running ALTER TABLE on every customer's table. With 1,000 customers, that is 1,000 DDL statements. With 10,000, it is 10,000. Each one takes a lock on that table. If you are doing this in a migration script, a single failure partway through leaves you in a state where some customers have the new column and some do not.\n\n```python\n# The migration nightmare\nfor customer_id in all_customer_ids:\n    cursor.execute(f\"ALTER TABLE orders_{customer_id} ADD COLUMN status TEXT DEFAULT 'pending'\")\n    # What happens if this fails on customer 5,847?\n    # 5,846 customers have the column. 4,154 do not.\n    # Your application code cannot assume the column exists.\n```\n\nPostgres catalog bloat is the silent killer. Every table creates entries in `pg_class`, `pg_attribute`, `pg_statistic`, and other system catalogs. Each table has roughly 1-2 KB of metadata in `pg_class` alone. Each column adds entries to `pg_attribute`. Each index adds another `pg_class` entry plus `pg_statistic` entries.\n\nFor a schema with 5 tables per customer, each with 3 indexes:\n\n| Tenants | Tables | Indexes | pg_class entries | Estimated catalog size |\n|---|---|---|---|---|\n| 100 | 500 | 1,500 | 2,000 | ~4 MB |\n| 1,000 | 5,000 | 15,000 | 20,000 | ~40 MB |\n| 10,000 | 50,000 | 150,000 | 200,000 | ~400 MB |\n| 50,000 | 250,000 | 750,000 | 1,000,000 | ~2 GB |\n\nAt 50,000 tenants, just opening a connection and having Postgres load catalog metadata becomes noticeably slower. The planner, which consults the catalog to build query plans, starts spending more time in catalog lookups than in actual query execution.\n\nConnection pooling also becomes complicated. If each customer's tables are in a separate schema and your application sets `search_path` per request, PgBouncer in transaction mode cannot safely pool those connections because `search_path` is a session-level setting.\n\n### Strategy 2: Shared Table with Index\n\nThe opposite approach: one table for all customers.\n\n```sql\nCREATE TABLE orders (\n    id BIGSERIAL PRIMARY KEY,\n    customer_id BIGINT NOT NULL,\n    product_id BIGINT NOT NULL,\n    amount NUMERIC(10, 2) NOT NULL,\n    created_at TIMESTAMPTZ DEFAULT now()\n);\n\nCREATE INDEX idx_orders_customer_id ON orders (customer_id);\n```\n\nEvery query includes `WHERE customer_id = ?`. One table. One migration. One index. One VACUUM configuration.\n\n**Where it shines:**\n\nOperational simplicity is dramatic. Schema changes are a single ALTER TABLE. Monitoring is one table's metrics. Your ORM does not need dynamic table names. Your migration tool works out of the box.\n\nAdding a new tenant is free: just INSERT with a new `customer_id`. No DDL, no new metadata, no catalog entries.\n\n**Where it breaks down:**\n\nIndex size grows with total data, not per-tenant data. A B-tree index on `customer_id` for 100 million rows is not small. Each index entry is roughly 16 bytes (8 bytes for the indexed value + 8 bytes for the tuple pointer). With some overhead for internal B-tree pages:\n\n| Total rows | Index on customer_id | Index on (customer_id, created_at) |\n|---|---|---|\n| 1M | ~24 MB | ~40 MB |\n| 10M | ~240 MB | ~400 MB |\n| 100M | ~2.4 GB | ~4 GB |\n| 1B | ~24 GB | ~40 GB |\n\nA 2.4 GB index needs to fit in shared_buffers to avoid constant disk reads during index scans. If it does not fit, every query that touches the index generates random I/O.\n\nWrite amplification is the other cost. Every INSERT into the orders table must also update every index on that table. With a primary key index and a `customer_id` index, each insert does three writes: the heap tuple, the primary key B-tree insert, and the customer_id B-tree insert. Add a composite index and it is four writes. At high write throughput, this becomes the bottleneck.\n\n**Planner statistics are table-level, not per-tenant.** This is the subtlest problem. Postgres collects statistics (via ANALYZE) about the distribution of values in each column. The planner uses these statistics to decide whether to use an index scan or a sequential scan.\n\nIf you have 10,000 tenants and one of them has 10 million rows while the rest have 1,000, the planner sees the average: roughly 10,000 rows per customer_id value. For the large tenant, it might choose an index scan when a sequential scan would be faster. For the small tenants, it might choose a sequential scan when an index scan would be faster. Neither plan is optimal.\n\n```sql\n-- The planner sees: avg rows per customer_id ≈ 10,000\n-- For customer with 10M rows: index scan chosen, might be suboptimal\n-- For customer with 500 rows: sequential scan chosen on a 100M row table!\nEXPLAIN ANALYZE SELECT * FROM orders WHERE customer_id = 42;\n```\n\n**VACUUM contention scales with table size.** Autovacuum on a 100-million-row table is a heavyweight operation. It must scan the entire table for dead tuples, even if only one tenant's data was recently updated. A busy tenant generating lots of updates creates dead tuples that trigger VACUUM for the entire table, slowing reads for every other tenant.\n\n**Customer deletion is expensive.** Deleting a customer requires `DELETE FROM orders WHERE customer_id = 42`, which generates dead tuples equal to the number of deleted rows. Those dead tuples consume disk space until VACUUM reclaims them. For a large customer, this can be millions of dead tuples, triggering an aggressive VACUUM that affects every other tenant.\n\n**Data leakage risk.** Every query must include the customer_id filter. One missed WHERE clause, one raw SQL snippet without the filter, one debugging session where someone queries without it, and you are reading another customer's data. Row-Level Security (RLS) in Postgres mitigates this:\n\n```sql\nALTER TABLE orders ENABLE ROW LEVEL SECURITY;\nCREATE POLICY tenant_isolation ON orders\n    USING (customer_id = current_setting('app.current_tenant')::BIGINT);\n```\n\nBut RLS adds a filter to every query, and the overhead is measurable: roughly 5-15% on simple queries depending on the policy complexity and table size.\n\n---\n\n## Implementation\n\n### The Numbers at Scale\n\nHere is what each strategy actually costs at different tenant counts, assuming 5 tables per tenant, 3 indexes per table, and an average of 10,000 rows per tenant:\n\n| Metric | 100 tenants | 1,000 tenants | 10,000 tenants |\n|---|---|---|---|\n| **Table-per-tenant** | | | |\n| Total tables | 500 | 5,000 | 50,000 |\n| Catalog overhead | ~4 MB | ~40 MB | ~400 MB |\n| Migration time (add column) | ~1 sec | ~10 sec | ~2 min |\n| Customer deletion | instant (DROP) | instant (DROP) | instant (DROP) |\n| Planner accuracy | excellent | excellent | excellent |\n| **Shared table** | | | |\n| Total tables | 5 | 5 | 5 |\n| Index size (per table) | ~24 MB | ~240 MB | ~2.4 GB |\n| Migration time (add column) | ~1 sec | ~5 sec | ~30 sec |\n| Customer deletion | DELETE + VACUUM | DELETE + VACUUM | DELETE + VACUUM |\n| Planner accuracy | good | degraded | poor (skewed stats) |\n\nThe crossover point is typically around 1,000-5,000 tenants. Below that, table-per-tenant is manageable. Above that, the catalog bloat and migration complexity start to dominate.\n\n### Query Latency Comparison\n\nFor a simple lookup query (`SELECT * FROM orders WHERE customer_id = 42 AND created_at > '2026-01-01' LIMIT 100`):\n\n| Scenario | Table-per-tenant | Shared table (indexed) |\n|---|---|---|\n| Small tenant (1K rows) | ~0.1ms (seq scan) | ~0.3ms (index scan) |\n| Medium tenant (100K rows) | ~0.2ms (index scan) | ~0.5ms (index scan) |\n| Large tenant (10M rows) | ~0.3ms (index scan) | ~1.2ms (index scan, larger tree) |\n| After heavy deletes | unchanged | ~2-5ms (bloated index, dead tuples) |\n\nTable-per-tenant wins on raw query performance because the indexes and tables are smaller, they fit in cache more easily, and the planner has per-tenant statistics. The shared table penalty comes from larger index traversals and potentially suboptimal query plans from skewed statistics.\n\n### The Hybrid: Postgres Partitioning\n\nPostgres native partitioning gives you most of the benefits of table-per-tenant with the manageability of a shared table.\n\n```sql\nCREATE TABLE orders (\n    id BIGSERIAL,\n    customer_id BIGINT NOT NULL,\n    product_id BIGINT NOT NULL,\n    amount NUMERIC(10, 2) NOT NULL,\n    created_at TIMESTAMPTZ DEFAULT now()\n) PARTITION BY LIST (customer_id);\n\n-- Each tenant gets a partition\nCREATE TABLE orders_p42 PARTITION OF orders FOR VALUES IN (42);\nCREATE TABLE orders_p43 PARTITION OF orders FOR VALUES IN (43);\n```\n\n**Partition pruning** is the key feature. When you query `WHERE customer_id = 42`, Postgres eliminates all other partitions before executing the query. The planner only considers `orders_p42`. It is as if the other partitions do not exist.\n\nThe advantages:\n\n- **Single DDL for schema changes.** `ALTER TABLE orders ADD COLUMN status TEXT` propagates to all partitions automatically.\n- **Per-partition statistics.** Each partition has its own `pg_statistic` entries. The planner sees accurate row counts per tenant.\n- **Per-partition VACUUM.** Autovacuum runs on individual partitions. A busy tenant does not trigger VACUUM on quiet tenants' data.\n- **Fast deletion.** `DROP TABLE orders_p42` detaches and drops a single partition. Instant, like table-per-tenant.\n- **Standard query interface.** Application code queries the parent table `orders` with a WHERE clause. No dynamic table names.\n\nThe disadvantages:\n\n- **Partition management overhead.** You still need to CREATE a new partition for each new tenant. This is DDL, which takes a brief lock on the parent table.\n- **Global index limitations.** A unique index across all partitions (e.g., a global primary key) requires the partition key to be part of the index. You cannot have a globally unique `id` column without including `customer_id` in the unique constraint.\n- **Catalog overhead still grows.** Each partition is still a table in `pg_class`. 10,000 partitions means 10,000 catalog entries, better than table-per-tenant (no separate index entries to manage), but not zero.\n- **Planner overhead with many partitions.** Before Postgres 14, the planner scaled poorly with thousands of partitions. Postgres 14+ improved this significantly, but above ~5,000 partitions, planning time can still be measurable (1-5ms added to simple queries).\n\n```sql\n-- Partition pruning in action\nEXPLAIN SELECT * FROM orders WHERE customer_id = 42;\n--  Append\n--    ->  Seq Scan on orders_p42\n--          Filter: (customer_id = 42)\n-- Only orders_p42 is scanned. All other partitions are pruned.\n```\n\n### Schema-Per-Tenant\n\nA middle ground: each customer gets their own Postgres schema, but within the same database.\n\n```sql\nCREATE SCHEMA tenant_42;\nCREATE TABLE tenant_42.orders (\n    id BIGSERIAL PRIMARY KEY,\n    product_id BIGINT NOT NULL,\n    amount NUMERIC(10, 2) NOT NULL,\n    created_at TIMESTAMPTZ DEFAULT now()\n);\n```\n\nThe application sets `search_path` per request:\n\n```sql\nSET search_path TO tenant_42, public;\nSELECT * FROM orders;  -- resolves to tenant_42.orders\n```\n\nThis gives you physical isolation (separate tables, separate indexes, separate statistics) with the logical organization of schemas. Per-customer backup works via `pg_dump -n tenant_42`. Migrations still multiply, but schema-level operations can be scripted more cleanly than raw table name manipulation.\n\nThe main downside: connection pooling. PgBouncer in transaction mode resets session state between transactions, which means `search_path` gets cleared. You either need statement-level pooling (which has its own limitations) or you fully qualify every table name (`tenant_42.orders`), defeating the purpose of `search_path`.\n\n### Citus: Distributed Sharding by Tenant\n\nFor the largest deployments, Citus (now part of Azure Cosmos DB for PostgreSQL) distributes data across multiple Postgres nodes, sharded by `customer_id`.\n\n```sql\n-- Citus: distribute the orders table by customer_id\nSELECT create_distributed_table('orders', 'customer_id');\n```\n\nQueries with `WHERE customer_id = ?` are routed to the correct shard node. Each node holds a subset of tenants. This gives you horizontal scaling beyond what a single Postgres instance can handle.\n\nThe tradeoff: cross-tenant queries (analytics, reporting across all tenants) become distributed queries with network overhead. Citus handles this, but the latency is higher than querying a single local table.\n\n---\n\n## How It All Fits Together\n\nThe decision comes down to four variables:\n\n```\n┌─────────────────────────────────────────────────────────┐\n│                  How many tenants?                        │\n├──────────────────────┬──────────────────────────────────┤\n│  < 100               │  Table-per-tenant works fine     │\n│  100 - 5,000         │  Partitioning is the sweet spot  │\n│  5,000 - 50,000      │  Shared table or partitioning    │\n│                      │  (test planner overhead)         │\n│  > 50,000            │  Shared table + Citus            │\n└──────────────────────┴──────────────────────────────────┘\n\n┌─────────────────────────────────────────────────────────┐\n│           How skewed is the data?                        │\n├──────────────────────┬──────────────────────────────────┤\n│  Uniform (all ~same) │  Shared table is fine            │\n│  Skewed (whale        │  Partitioning or table-per      │\n│  tenants exist)      │  (planner stats matter)          │\n└──────────────────────┴──────────────────────────────────┘\n\n┌─────────────────────────────────────────────────────────┐\n│        Compliance requirements?                          │\n├──────────────────────┬──────────────────────────────────┤\n│  Data residency /    │  Physical isolation required     │\n│  per-tenant deletion │  (table-per, schema-per, or     │\n│  guarantees          │   partitioning with DROP)        │\n│                      │                                  │\n│  No special          │  Shared table is simpler         │\n│  requirements        │                                  │\n└──────────────────────┴──────────────────────────────────┘\n```\n\n| Factor | Table-per-tenant | Shared table | Partitioned | Schema-per-tenant |\n|---|---|---|---|---|\n| Query performance | Excellent | Good (degrades with skew) | Excellent (partition pruning) | Excellent |\n| Schema migrations | O(n) tenants | O(1) | O(1) | O(n) tenants |\n| Catalog overhead | High | Minimal | Moderate | High |\n| Planner accuracy | Per-tenant | Table-level (skewed) | Per-partition | Per-tenant |\n| VACUUM isolation | Full | None | Per-partition | Full |\n| Customer deletion | DROP TABLE | DELETE + VACUUM | DROP partition | DROP SCHEMA CASCADE |\n| Connection pooling | Complex | Simple | Simple | Complex |\n| Max practical tenants | ~5,000 | 100,000+ | ~10,000 | ~5,000 |\n| Data leakage risk | None (separate tables) | High (missing WHERE) | Low (pruning helps) | None (separate schemas) |\n\n---\n\n## Lessons Learned\n\n**Start with shared, move to partitioned.** For most SaaS products, a shared table is the right starting point. It is simpler to operate, simpler to migrate, and simpler to monitor. If you grow to the point where statistics skew or VACUUM contention becomes a problem, converting to a partitioned table is a well-documented migration. Going the other direction, from table-per-tenant back to shared, is much harder.\n\n**Partitioning is the answer more often than people think.** Postgres native partitioning gives you the performance characteristics of table-per-tenant (small indexes, accurate statistics, isolated VACUUM) with the operational characteristics of a shared table (single DDL for migrations, standard query interface). The main reasons not to use it are global unique constraints and planner overhead above ~5,000 partitions.\n\n**The catalog is a database too.** Postgres system catalogs are themselves tables that Postgres queries internally. When you have 200,000 entries in `pg_class`, every `CREATE TABLE`, every `ALTER TABLE`, and every connection establishment pays a cost. Teams that hit this wall are usually surprised because nothing in their application metrics shows it. The slowdown is in Postgres internals, not in query execution.\n\n**RLS is not free but it is worth it.** If you go with a shared table, Row-Level Security adds 5-15% overhead on simple queries. That is a real cost. It is also far less expensive than a data leakage incident. Treat RLS as the seatbelt: you do not negotiate whether to wear it.\n\n**The whale tenant problem is unavoidable.** In every multi-tenant system, eventually one customer has 100x more data than the average. Shared tables suffer the most because statistics and VACUUM are table-level. Partitioning handles it cleanly because each partition has its own statistics and VACUUM schedule. If you are on a shared table and cannot partition, consider moving whale tenants to their own table and routing queries in the application layer. It is ugly, but it works.\n\n**Deletion is an underestimated design constraint.** GDPR Article 17, the right to erasure, gives customers the right to have their data deleted. On a shared table, `DELETE FROM orders WHERE customer_id = 42` on 5 million rows generates 5 million dead tuples, triggers VACUUM, and temporarily bloats the table. With partitioning or table-per-tenant, it is a DROP, instant and clean. If your product serves European customers, factor this into the decision from day one.\n\n---\n\n## What's Next\n\nThis post covered the storage layer. The multi-tenancy decision also affects:\n\n- **Caching:** Do you partition your Redis keyspace by tenant? Use separate Redis instances? A shared instance with key prefixes?\n- **Background jobs:** Does one tenant's heavy job queue starve others? How do you enforce fairness?\n- **Rate limiting:** Per-tenant rate limits require tracking usage per tenant, which has its own data structure decisions\n\nEach of these layers inherits the same tension between isolation and simplicity.\n\n---\n\n## References\n\n- [Postgres Partitioning Documentation](https://www.postgresql.org/docs/current/ddl-partitioning.html)\n- [Row-Level Security in Postgres](https://www.postgresql.org/docs/current/ddl-rowsecurity.html)\n- [Citus Multi-Tenant SaaS Tutorial](https://docs.citusdata.com/en/stable/use_cases/multi_tenant.html)\n- [Postgres System Catalogs](https://www.postgresql.org/docs/current/catalogs.html)\n- [VACUUM Internals](https://www.postgresql.org/docs/current/routine-vacuuming.html)\n",
            "url": "https://gauravsarma.com/posts/2026-04-15_table-per-tenant-vs-shared-table",
            "title": "Table-Per-Tenant vs Shared Table: The Multi-Tenancy Tradeoff in Postgres",
            "summary": "You are building a SaaS product.  Every customer has their own orders, invoices, and documents...",
            "date_modified": "2026-04-15T00:00:00.000Z"
        },
        {
            "id": "https://gauravsarma.com/posts/2026-04-07_metric-depth-from-a-single-camera",
            "content_html": "\nTake a photo of a coffee mug on your desk. Now look at that photo. Can you tell whether the mug is 30 centimeters away or 3 meters away? You probably can, because you recognize what a mug looks like and roughly how big they are. But a computer just sees a grid of pixel values. It has no concept of physical distance. A toy car 1 meter away and a real car 100 meters away can produce nearly identical images.\n\nThis is the fundamental challenge for any robot that needs to navigate using a camera. It needs to know not just what things are, but how far away they are. The obvious solution is LiDAR, a sensor that fires laser beams and measures how long they take to bounce back. LiDAR gives you precise distances to everything in the scene. It also costs anywhere from $5,000 to $50,000, draws significant power, and adds bulk. A camera costs $30.\n\nA recent open-source project demonstrated that a single camera can match LiDAR navigation performance in most off-road scenarios. The trick is combining two complementary systems: an AI model that estimates depth from images, and a tracking algorithm that provides real-world scale. Neither works alone. Together, they give you dense, metric depth from a single lens.\n\n---\n\n## The Problem\n\nA robot driving through a forest needs to answer three questions on every frame of video, roughly 10 to 30 times per second:\n\n1. **What is around me?** Trees, rocks, paths, grass\n2. **How far away is each thing?** That tree is 3 meters away, that rock is 7 meters away\n3. **Where can I drive?** The ground ahead is flat and obstacle-free for 5 meters, then there is a rock to the left\n\nQuestion 1 is a perception problem. Questions 2 and 3 are geometry problems. You need 3D information about the world, and a 2D image does not contain it directly.\n\n**LiDAR solves this by brute force.** A LiDAR sensor spins a laser around, measures the return time for each beam, and produces a \"point cloud\": a set of 3D coordinates (x, y, z) for every surface the laser hits. A high-end unit produces hundreds of thousands of points per scan. The robot gets an accurate 3D model of its surroundings every few milliseconds.\n\n**A single camera gives you none of this.** The image is a 2D projection of a 3D world. The projection destroys depth information. Every pixel tells you what color something is and what direction it is in, but not how far away it is.\n\nThis is called **scale ambiguity**. Think of it like a function that is not invertible:\n\n```\nproject(3D world) → 2D image     // this works\ninvert(2D image) → 3D world      // this is impossible without extra information\n```\n\nMultiple 3D scenes can produce the exact same 2D image. A hallway that is 2 meters wide and 10 meters long looks identical to a hallway that is 4 meters wide and 20 meters long if you scale everything proportionally. The projection is the same.\n\nSo how do you recover depth from a single camera? You need two ingredients, and the key insight is that each one has exactly what the other is missing.\n\n---\n\n## Prerequisites\n\n- Basic programming familiarity (variables, functions, arrays). No computer vision experience needed.\n- A rough mental model of what a 2D image is: a grid of pixels, each with color values\n- Comfort with simple arithmetic (division, averages)\n\nThat is it. Everything else is explained from scratch.\n\n---\n\n## Technical Decisions\n\n### Ingredient 1: AI Depth Estimation (Dense but Scaleless)\n\nClose one eye and look around the room. You can still judge depth reasonably well, even without stereo vision. How? Your brain uses cues it has learned from a lifetime of visual experience:\n\n- **Relative size:** A person who appears small in your field of view is probably far away\n- **Occlusion:** If object A blocks part of object B, A is in front\n- **Texture gradient:** The floor tiles get denser as they recede into the distance\n- **Atmospheric haze:** Distant mountains look hazier and bluer than nearby trees\n- **Perspective lines:** Parallel lines converge toward a vanishing point\n\nAn AI model called **Depth Anything V2** has learned these same cues by training on millions of images paired with depth information. You feed it any image, and it outputs a **depth map**: a grayscale image where every pixel's brightness represents its estimated distance. Bright = far, dark = close (or vice versa depending on the convention).\n\n```\nInput:    A photo of a forest path\nOutput:   A map where every pixel has a depth VALUE\n\n           Nearby tree trunk  →  depth value: 0.15\n           Ground 5m ahead   →  depth value: 0.42\n           Distant treeline   →  depth value: 0.91\n```\n\nThe catch: these values are **relative**, not metric. The model tells you \"this pixel is about 3 times farther than that pixel,\" but it does not tell you \"this pixel is 4.7 meters away.\" The values have no physical unit. They are like percentages without knowing what 100% represents.\n\nThis is useful but not sufficient. A robot needs to know \"that rock is 3 meters ahead, I should turn.\" Not \"that rock has depth value 0.35.\"\n\nThink of it as a function that returns a sorted array of distances but without the actual numbers:\n\n```python\n# What Depth Anything V2 gives you:\nrelative_depths = [0.12, 0.35, 0.42, 0.67, 0.91]\n# You know the ORDER: item 0 is closest, item 4 is farthest\n# You know the RATIOS: item 4 is ~7.5x farther than item 0\n# You do NOT know: item 0 is 1.2 meters away\n\n# What you need for navigation:\nmetric_depths = [1.2, 3.5, 4.2, 6.7, 9.1]  # actual meters\n```\n\nThe model gives you a dense depth map (every single pixel gets a value) but with an unknown scale factor. Now you need something that can provide that scale factor.\n\n### Ingredient 2: Visual SLAM (Sparse but Metric)\n\n**SLAM** stands for Simultaneous Localization and Mapping. It solves a chicken-and-egg problem: to know where you are, you need a map. To build a map, you need to know where you are.\n\nIf you are a software developer, here is an analogy. Imagine you are building a version control system, but instead of tracking lines of code, you are tracking visual features across video frames.\n\nA \"visual feature\" is a distinctive pattern in the image that is easy to find again in the next frame. Think: the corner of a window, the tip of a rock, a high-contrast spot on bark. These are like unique identifiers that the algorithm can grep for across frames.\n\nHere is what SLAM does, frame by frame:\n\n```\nFrame 1:  Detect 500 features (corners, edges, distinctive patterns)\n          Store their pixel positions: [(x1, y1), (x2, y2), ...]\n\nFrame 2:  Detect features again\n          MATCH them to Frame 1's features: \"Feature A moved 12 pixels right\"\n          From how features moved, INFER how the camera moved\n          From camera motion + feature motion, TRIANGULATE 3D positions\n\nFrame 3:  Repeat. Now you have camera poses for frames 1, 2, 3\n          and 3D positions for tracked features\n          This IS the map. The camera poses ARE the localization.\n```\n\nThe specific SLAM system used here is called **VINS-Mono**. The \"VI\" stands for Visual-Inertial, meaning it also uses an **IMU** (Inertial Measurement Unit), which is a chip with an accelerometer and a gyroscope. Your phone has one. It measures acceleration in meters per second squared.\n\nThe IMU is critical because it gives SLAM something pure vision cannot: **real-world scale**. The accelerometer measures in physical units (m/s^2), so when SLAM integrates those measurements over time, the resulting positions are in real meters. Without the IMU, visual-only SLAM has the same scale ambiguity as the depth model. With it, you get metric positions.\n\nThe result: SLAM gives you a few hundred 3D points per frame, each with a real-world position in meters. But only a few hundred, not the millions you get from LiDAR, and only where distinctive visual features happen to exist. Smooth walls, flat ground, and uniform surfaces produce no features.\n\n```\nSLAM output per frame:\n  Camera position: (x=2.3m, y=0.1m, z=1.4m)\n  Camera orientation: (roll, pitch, yaw)\n  3D feature points: [(3.1m, 0.5m, -0.2m), (4.7m, 1.2m, 0.3m), ...]\n                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n                      Only ~100-500 points. Sparse. But in real meters.\n```\n\n### Why Neither Works Alone\n\nNow the situation is clear:\n\n| System | Coverage | Scale |\n|---|---|---|\n| Depth Anything V2 | Every pixel (dense) | Unknown (relative only) |\n| VINS-Mono SLAM | A few hundred points (sparse) | Real meters (metric) |\n\nThe depth model sees everything but does not know how far anything is in real units. SLAM knows exact distances but only for a handful of points. Each system has precisely what the other lacks.\n\n---\n\n## Implementation\n\n### Phase 1: The Rescaling Trick\n\nThis is the core idea, and it is surprisingly simple.\n\nFor each SLAM point, you know two things:\n1. Its **metric depth** from SLAM (e.g., 4.2 meters)\n2. Its **pixel position** in the image (e.g., pixel 340, 220)\n\nYou also have the depth model's output for that same pixel:\n3. Its **relative depth** from Depth Anything V2 (e.g., 0.38)\n\nFrom these, you can compute a scale factor:\n\n```python\n# For one SLAM point:\nscale = metric_depth / relative_depth\nscale = 4.2 / 0.38\nscale = 11.05\n\n# Apply to ALL pixels:\nfor every pixel in depth_map:\n    metric_depth[pixel] = scale * relative_depth[pixel]\n```\n\nIn practice, you have hundreds of SLAM points, and each gives a slightly different scale estimate due to noise. So you take a robust estimate across all of them, typically a median or a least-squares fit:\n\n```python\ndef compute_scale(slam_points, depth_map):\n    scales = []\n    for point in slam_points:\n        pixel_x, pixel_y = project_to_image(point.position_3d)\n        relative_d = depth_map[pixel_y][pixel_x]\n        metric_d = point.depth\n        if relative_d > 0:  # avoid division by zero\n            scales.append(metric_d / relative_d)\n\n    # Median is more robust to outliers than mean\n    return median(scales)\n```\n\nNow multiply the entire depth map by this single scale factor, and every pixel has a metric depth estimate. You just turned a relative depth map into a metric one.\n\n```\nBefore rescaling:                    After rescaling:\n  Nearby tree  →  0.15              Nearby tree  →  1.7m\n  Ground ahead →  0.42              Ground ahead →  4.6m\n  Far treeline →  0.91              Far treeline →  10.1m\n```\n\nThe result: dense metric depth from a single camera. Every pixel now has a distance in meters, just like LiDAR would give you, except computed from a $30 camera and some clever math.\n\n### Phase 2: Fixing the Jitter (Temporal Smoothing)\n\nThe rescaling trick works in theory. In practice, the scale factor jumps around.\n\nSLAM points come and go as the robot moves. In one frame, you might have 300 well-tracked features with consistent scale estimates. In the next frame, the robot turns sharply, SLAM loses half its features, and the remaining ones give a different scale. The computed scale might jump from 11.0 to 14.5 between two frames.\n\nFrom the robot's perspective, a tree that was 4 meters away just teleported to 5.3 meters away for one frame, then snapped back. The path planner sees the obstacle map changing erratically and makes bad decisions.\n\nThe fix is an **exponential moving average**, which is the same technique used in financial charts to smooth stock price jitter:\n\n```python\n# alpha controls how much smoothing: \n#   alpha = 1.0 → no smoothing, use raw value\n#   alpha = 0.0 → infinite smoothing, never update\n#   alpha = 0.2 → heavy smoothing (the paper uses this)\n\nstable_scale = None\n\ndef smooth_scale(new_raw_scale, alpha=0.2):\n    global stable_scale\n    if stable_scale is None:\n        stable_scale = new_raw_scale\n    else:\n        stable_scale = (1 - alpha) * stable_scale + alpha * new_raw_scale\n    return stable_scale\n```\n\nEach frame, the stable scale moves 20% toward the new measurement and keeps 80% of the previous value. Quick noise gets dampened out. Real changes (the robot drives closer to something) still come through, just slightly delayed.\n\n```\nFrame 1: raw_scale = 11.0  →  stable_scale = 11.0\nFrame 2: raw_scale = 14.5  →  stable_scale = 11.0 * 0.8 + 14.5 * 0.2 = 11.7\nFrame 3: raw_scale = 10.8  →  stable_scale = 11.7 * 0.8 + 10.8 * 0.2 = 11.5\nFrame 4: raw_scale = 11.2  →  stable_scale = 11.5 * 0.8 + 11.2 * 0.2 = 11.4\n```\n\nThe spike at Frame 2 barely registers. The stable scale stays near 11, which is where it should be.\n\n### Phase 3: Removing Phantom Obstacles (Edge Masking)\n\nThis one is subtle and took real debugging to identify.\n\nDepth Anything V2 produces excellent depth estimates in the middle of objects. But at the edges, where a tree trunk meets the sky, the depth map does not produce a sharp cutoff. It creates a smooth gradient from \"3 meters\" (the tree) to \"50 meters\" (the sky) across several pixels.\n\nWhen you convert these depth values to 3D points, those gradient pixels become points floating in mid-air between the tree and the sky. To the robot's navigation system, those floating points look like invisible walls behind every obstacle.\n\n```\nWhat the camera sees:\n\n    |  sky (far)  | tree (near) |  sky (far)  |\n    |             |             |             |\n\nWhat the depth model produces at the tree edge:\n\n    | 50m | 50m | 35m | 12m | 3m | 3m | 3m | 15m | 40m | 50m |\n                  ^^^^^^^^^^^                  ^^^^^^^^^^^\n                  Blurry transition             Blurry transition\n                  These become phantom          These too\n                  obstacles in 3D\n\nWhat the robot's planner sees:\n\n    [ clear ] [WALL??] [ tree ] [WALL??] [ clear ]\n                ↑                  ↑\n        \"I can't go          \"I can't go\n         behind the           behind the\n         tree!\"               tree!\"\n```\n\nThe robot refuses to navigate around obstacles because it thinks there are walls behind them.\n\nThe fix: detect these blurry edges and remove them. The paper uses a **Sobel filter**, which is a simple 3x3 kernel that computes how fast pixel values change in the x and y directions. High change rate = an edge.\n\n```python\ndef create_edge_mask(depth_map, threshold=0.1):\n    # Compute gradients (how fast depth changes between neighboring pixels)\n    gradient_x = depth_map[:, 1:] - depth_map[:, :-1]  # horizontal change\n    gradient_y = depth_map[1:, :] - depth_map[:-1, :]  # vertical change\n\n    # Combine into gradient magnitude\n    gradient = sqrt(gradient_x**2 + gradient_y**2)\n\n    # Where gradient is high, depth is changing fast = a blurry edge\n    edge_pixels = gradient > threshold\n\n    # Dilate the mask by 5 pixels in each direction\n    # (the blur extends a few pixels around the actual edge)\n    edge_mask = dilate(edge_pixels, radius=5)\n\n    return edge_mask\n\n# Usage: zero out depth values at edges before converting to 3D points\ndepth_map[edge_mask] = 0  # these pixels won't become 3D points\n```\n\nYou lose a thin strip of depth information around every object boundary. But you eliminate the phantom walls entirely. The robot can now see that there is clear space behind a tree trunk, just like it would with LiDAR.\n\n### Phase 4: From Depth Map to Navigation\n\nOnce you have a clean metric depth map, the rest of the pipeline converts it into something a path planner can use:\n\n1. **Depth map to point cloud:** Each pixel with a metric depth value becomes a 3D point. A 640x480 image with valid depth at 80% of pixels gives you ~245,000 3D points. That is comparable to a mid-range LiDAR.\n\n2. **Ground segmentation:** Separate \"this is the ground\" from \"this is an obstacle.\" The system uses a technique called Cloth Simulation Filter: imagine dropping a stiff cloth onto an upside-down version of your point cloud. The cloth settles on what would be the ground surface. Points below the cloth are ground, points above are obstacles.\n\n3. **Elevation map:** Convert the obstacle points into a 2D grid where each cell stores the height of whatever is there. Flat ground cells have near-zero height. A rock might be 0.5 meters. A tree trunk might be 2 meters. The path planner treats any cell above a threshold as impassable.\n\n4. **Path planning:** A standard A*-style planner finds a route through the elevation map from the robot's current position to the goal, avoiding cells marked as obstacles.\n\n```\nCamera image\n    ↓\nDepth Anything V2 → relative depth map\n    ↓\nSLAM rescaling → metric depth map\n    ↓\nEdge masking → clean metric depth map\n    ↓\nProject to 3D → point cloud (~200k points)\n    ↓\nGround segmentation → ground vs obstacle points\n    ↓\nElevation map → 2D grid with heights\n    ↓\nPath planner → \"go left, then straight for 5m\"\n```\n\n---\n\n## How It All Fits Together\n\nThe full system runs as a loop, processing each camera frame through the pipeline:\n\n```\n┌──────────────────────────────────────────────────────┐\n│                    Camera Frame                       │\n└──────────────────┬───────────────────────────────────┘\n                   │\n        ┌──────────┴──────────┐\n        ▼                     ▼\n┌───────────────┐    ┌──────────────────┐\n│ Depth Anything│    │    VINS-Mono     │\n│     V2        │    │    (SLAM)        │\n│               │    │                  │\n│ Dense relative│    │ Camera pose +    │\n│ depth map     │    │ sparse metric    │\n│ (every pixel) │    │ 3D points       │\n└───────┬───────┘    └────────┬─────────┘\n        │                     │\n        └──────────┬──────────┘\n                   ▼\n        ┌─────────────────────┐\n        │   Scale Rescaling   │\n        │ + Temporal Smoothing│\n        │ + Edge Masking      │\n        └──────────┬──────────┘\n                   ▼\n        ┌─────────────────────┐\n        │  Dense metric depth │\n        │  (every pixel, in   │\n        │   real meters)      │\n        └──────────┬──────────┘\n                   ▼\n        ┌─────────────────────┐\n        │  3D Point Cloud     │\n        │  → Ground Segment.  │\n        │  → Elevation Map    │\n        └──────────┬──────────┘\n                   ▼\n        ┌─────────────────────┐\n        │    Path Planner     │\n        │  → Motor Commands   │\n        └─────────────────────┘\n```\n\nThe entire loop runs at about 10 frames per second on an NVIDIA Jetson ORIN (a GPU board roughly the size of a credit card). For comparison, the same pipeline with a LiDAR sensor instead of the camera runs at 20 FPS, because you skip the depth estimation step entirely.\n\n---\n\n## Lessons Learned\n\n**The results are surprisingly competitive.** In simulation tests with photorealistic environments (trees, rocks, uneven terrain), the monocular camera achieved a 93% success rate at reaching navigation goals, while LiDAR achieved 63-67%. The camera setup actually outperformed LiDAR in most scenarios. In real-world tests, both hit 100% success rates, though the camera paths were about 22% longer due to slower obstacle detection.\n\n| Configuration | Success Rate (sim, medium) | Path Efficiency | Perception Speed |\n|---|---|---|---|\n| LiDAR (128-channel) | 63-67% | 0.49-0.60 SPL | 20 Hz |\n| Monocular camera | 93% | 0.83 SPL | 10 Hz |\n\n**Edge masking is the unsung hero.** Without it, the robot gets stuck constantly because it sees phantom walls behind every obstacle. The ablation study showed that enabling both edge masking and temporal smoothing improved path efficiency from 0.39 to 0.58, a 49% improvement. Most of that gain comes from edge masking enabling the robot to navigate around obstacles rather than treating them as impassable walls.\n\n**It breaks on soft obstacles.** High grass is the worst case. The depth model sees grass as a surface at a certain height and the system has no way to know that you can drive through it. LiDAR has the same problem to some degree, but its higher resolution and update rate help. This is a fundamental limitation: you need a separate system that classifies whether an obstacle is traversable, not just where it is.\n\n**It cannot see holes.** A depression in the ground, a ditch, a hidden drop-off: these produce no signal in the depth map because they are below the ground plane. LiDAR can sometimes detect these from the right angle. A single forward-facing camera cannot. This is a real safety concern for off-road applications.\n\n**Speed matters more than accuracy.** The monocular system is accurate enough for navigation, but it runs at half the frame rate of the LiDAR pipeline. In real-world tests, this showed up as the robot reacting more slowly to obstacles, producing longer, more hesitant paths. The depth estimation model (Depth Anything V2) is the bottleneck. A faster model with slightly worse accuracy would likely improve overall navigation performance.\n\n**The combination is more powerful than either component.** This is the central lesson. Neither Depth Anything V2 nor VINS-Mono alone can navigate a robot through a forest. The depth model gives you dense coverage with no scale. SLAM gives you metric scale with sparse coverage. The rescaling trick, which is just a division, bridges the gap. The engineering insight is recognizing that these two systems complement each other and combining them with minimal machinery.\n\n---\n\n## What's Next\n\nThe paper identifies several natural extensions:\n\n- **Traversability classification:** Instead of treating everything above ground level as an obstacle, train a model to distinguish \"hard obstacle (rock)\" from \"soft obstacle (tall grass you can drive through)\"\n- **Negative obstacle detection:** Detecting holes and drop-offs likely requires either a different sensor angle, a second camera, or temporal analysis of how the ground plane changes as the robot approaches\n- **Faster depth models:** Depth Anything V2 is accurate but slow. Lighter models exist that trade some accuracy for 2-3x speed. At 20+ FPS, the monocular pipeline would match LiDAR reaction times\n\nThe broader takeaway is that foundation models, large AI models pretrained on massive datasets, are becoming practical building blocks for robotics. Depth Anything V2 was not trained for off-road robot navigation. It was trained on general images from the internet. Yet it works well enough, out of the box, with zero fine-tuning, to navigate a robot through a forest. That is a shift in how robotic systems get built.\n\n---\n\n## References\n\n- [An Open-Source LiDAR and Monocular Off-Road Autonomous Navigation Stack (arXiv 2604.03096)](https://arxiv.org/abs/2604.03096)\n- [Depth Anything V2: Fine-tuned Large-Scale Monocular Depth Estimation](https://depth-anything-v2.github.io/)\n- [VINS-Mono: A Robust and Versatile Monocular Visual-Inertial State Estimator](https://github.com/HKUST-Aerial-Robotics/VINS-Mono)\n- [LARIAD/Offroad-Nav: Open-source code from the paper](https://github.com/LARIAD/Offroad-Nav)\n- [Cloth Simulation Filter for Ground Segmentation (Zhang et al., 2016)](https://www.mdpi.com/2072-4292/8/6/501)\n",
            "url": "https://gauravsarma.com/posts/2026-04-07_metric-depth-from-a-single-camera",
            "title": "Metric Depth from a Single Camera: How Robots See Without LiDAR",
            "summary": "Take a photo of a coffee mug on your desk.  Now look at that photo...",
            "date_modified": "2026-04-07T00:00:00.000Z"
        },
        {
            "id": "https://gauravsarma.com/posts/2026-04-06_when-cdc-becomes-a-crutch",
            "content_html": "\nYour orders service needs the user's email to send a confirmation. The payments service needs the user's tier to apply a discount. The analytics pipeline needs the user's signup date for cohort analysis. The search service needs the user's display name for autocomplete.\n\nSo someone sets up a CDC pipeline from the users table, and now four services each have their own copy of every user. A column gets renamed in the source. The CDC connector picks up the new schema. Three consumers handle it. One doesn't. A customer sees stale pricing for six hours, and the on-call engineer spends a full shift figuring out which copy of the data diverged.\n\nCDC didn't cause this. But it made it effortless, and effortless replication without a clear philosophy behind it is how you end up with 12 copies of the same table, zero confidence in any of them, and a reconciliation job that runs every night \"just in case.\"\n\n---\n\n## The Problem\n\nChange Data Capture is a genuinely useful pattern. It captures row-level changes from a database's write-ahead log (or oplog, or binlog) and streams them to downstream consumers. Tools like Debezium, Kafka Connect, and managed offerings from every major cloud provider have made it trivially easy to set up.\n\nThe problem is not CDC itself. The problem is that CDC reduces the cost of replication to near zero without reducing the cost of maintaining replicas. Setting up a CDC connector takes an afternoon. Debugging a consistency issue across six consumers of that connector takes a week.\n\nHere is what typically happens:\n\n1. Service A needs one field from Service B's database\n2. Someone suggests an API call, but there are latency concerns, or Service B's API doesn't expose that field, or \"what if Service B is down?\"\n3. Someone else says: \"just set up CDC, replicate the table, query it locally\"\n4. It works. It's fast. Nobody thinks about it for six months\n5. The source schema changes. Or the CDC pipeline lags. Or a consumer's local copy drifts because of a bug in its ingestion logic\n6. Now you have a distributed data consistency problem that didn't need to exist\n\nThe pattern repeats across organizations because the incentives are misaligned: the team setting up the CDC pipeline bears almost no cost. The cost lands on whoever debugs the divergence months later.\n\n---\n\n## Prerequisites\n\n- Familiarity with CDC concepts (WAL-based change capture, Debezium, Kafka Connect, or similar)\n- Experience with microservice architectures where services own their own databases\n- General understanding of eventual consistency and its operational implications\n\n---\n\n## Technical Decisions\n\n### Why Teams Reach for CDC\n\nBefore defining when not to replicate, it helps to understand why replication feels like the obvious answer.\n\n**Latency.** A local query is faster than a network call. If the payments service needs the user's tier on every transaction, a local lookup in its own database avoids a synchronous dependency on the users service.\n\n**Availability.** If the users service goes down, every service that depends on its API also degrades. A local copy means the payments service can keep processing even during an outage upstream.\n\n**Query flexibility.** The users service exposes a REST API with specific endpoints. The analytics team needs to join users with events in SQL. CDC lets them replicate the users table into their warehouse and query it however they want.\n\nThese are all real concerns. The mistake is treating them as universal justifications. Not every consumer needs sub-millisecond latency. Not every service needs to survive an upstream outage. Not every query pattern requires a local copy.\n\n### The Three Modes of Data Consumption\n\nEvery time a service needs data from another service, the interaction falls into one of three categories:\n\n**1. Replicate: the consumer needs the data at rest, in a different shape**\n\nThis is the legitimate core use case for CDC. The consumer is not just mirroring the source, it is transforming the data to serve a fundamentally different access pattern.\n\nExamples:\n- Replicating a normalized relational table into Elasticsearch for full-text search\n- Feeding transactional data into a columnar warehouse for analytical queries\n- Building a materialized view that pre-joins three tables for a read-heavy dashboard\n\nThe key signal: the consumer's schema looks nothing like the source. It has different indexes, different denormalization, maybe even a different data model entirely. You cannot serve this use case with an API call because the consumer needs to query the data in ways the source was never designed for.\n\n**2. Query the source: the consumer needs fresh data in the same shape**\n\nIf the consumer is essentially mirroring the source table, same columns, same structure, just in a different database, ask why it needs a copy at all.\n\nExamples:\n- The orders service needs the user's email to send a confirmation (one field, one lookup per order)\n- A dashboard needs the current count of active users (a single aggregation query)\n- A service needs to validate that a product ID exists before creating an order\n\nThe key signal: the consumer's query would work fine against the source database. The data is small, the access pattern is simple, and freshness matters. A synchronous API call or a thin caching layer is simpler, cheaper, and always consistent.\n\nThe \"but what if the source is down?\" objection is real but often overstated. If the users service is down, should the payments service really continue processing with stale user data? Sometimes yes, but often the correct behavior is to degrade gracefully, not to silently use a copy that might be hours behind.\n\n**3. React and discard: the consumer needs to respond to a change, not store it**\n\nMany CDC consumers don't actually need the data at rest. They need to do something when the data changes.\n\nExamples:\n- Send a welcome email when a new user is created\n- Invalidate a cache entry when a product price changes\n- Trigger a fraud check when a transaction is created\n- Update a counter or metric in a monitoring system\n\nThe key signal: the consumer processes the event and is done. It doesn't need to query the data later. It doesn't build a local copy. The event is a trigger, not a data transfer.\n\nThis is often the most over-engineered pattern. Teams set up full CDC replication when all they needed was an event bus. The consumer ends up with a complete replica of the users table just to detect new signups.\n\n---\n\n## Implementation\n\n### The Decision Framework\n\nBefore setting up a CDC pipeline, run through these five questions:\n\n**Question 1: Does the consumer need the data at rest, or does it just need to react to changes?**\n\nIf the answer is \"react,\" you don't need CDC replication. You need an event. Publish a domain event (\"user.created,\" \"order.completed\") from the source service, let the consumer subscribe, process, and move on. No local copy, no sync lag, no schema coupling.\n\n**Question 2: Can the consumer tolerate staleness?**\n\nCDC is eventually consistent. Depending on your pipeline, the lag can range from milliseconds to minutes. If the consumer cannot tolerate any staleness (e.g., checking a user's balance before authorizing a payment), a local replica is the wrong answer. You need a synchronous read from the source of truth.\n\n**Question 3: Is the consumer reshaping the data or mirroring it?**\n\nThis is the most important question. If the consumer's table is structurally identical to the source, you have a mirror, not a materialized view. Mirrors are almost always a sign of a missing API or an over-cautious availability concern.\n\nReshaping is legitimate. Mirroring is a code smell.\n\n**Question 4: Who owns the schema?**\n\nWhen the source team renames a column, adds a field, or changes a type, what happens downstream? If 12 CDC consumers ingest that table, each one needs to handle the schema change. You've built a distributed monolith: tightly coupled systems connected by a log instead of an API.\n\nAt least with an API, the source team can version it, deprecate fields gracefully, and maintain a contract. With raw CDC, the contract is the database schema itself, and database schemas were never designed to be public interfaces.\n\n**Question 5: What is the blast radius of divergence?**\n\nWhen (not if) the local copy drifts from the source, what breaks? If the answer is \"a customer sees the wrong price\" or \"a payment is authorized against stale data,\" the operational risk of maintaining a replica outweighs the convenience.\n\n### The Decision Table\n\n| Signal | Replicate | Query Source | React & Discard |\n|---|---|---|---|\n| Consumer schema differs from source | Yes | | |\n| Consumer needs sub-millisecond reads | Yes | | |\n| Consumer must survive source outages | Yes | | |\n| Consumer mirrors source schema | | Yes | |\n| Consumer needs strong consistency | | Yes | |\n| Access is infrequent or low-volume | | Yes | |\n| Consumer processes event then is done | | | Yes |\n| Consumer doesn't query the data later | | | Yes |\n| Consumer only needs to trigger a side effect | | | Yes |\n\n### The Distributed Monolith Antipattern\n\nThe most dangerous failure mode of CDC overuse is the distributed monolith. It looks like this:\n\n```\nUsers DB (source of truth)\n    │\n    ├── CDC → Kafka → Orders Service (local users table)\n    ├── CDC → Kafka → Payments Service (local users table)\n    ├── CDC → Kafka → Analytics Warehouse (local users table)\n    ├── CDC → Kafka → Search Service (users in Elasticsearch)\n    ├── CDC → Kafka → Notifications Service (local users table)\n    └── CDC → Kafka → Fraud Service (local users table)\n```\n\nSix consumers. Five of them have a structurally identical copy of the users table. When the source team adds a `phone_verified` boolean:\n\n- The Kafka connector picks up the new column\n- Analytics handles it fine (their ingestion is schema-flexible)\n- Search re-indexes (Elasticsearch is schema-flexible)\n- Orders, Payments, and Notifications have rigid table schemas. Their CDC consumers fail to deserialize the new column. Events back up in Kafka. A lag alert fires at 3am.\n\nThis is tight coupling with extra steps. The teams thought they were decoupled because there's no synchronous API call. But they're coupled to the schema, coupled to the CDC pipeline's uptime, and coupled to Kafka's consumer group coordination. The coupling just moved from the request path to the data path, which is harder to see and harder to debug.\n\nThe fix is not \"make the CDC pipeline more resilient.\" The fix is to ask: do Orders, Payments, and Notifications actually need a full copy of the users table? Usually the answer is no. Orders needs the user's email. Payments needs the user's tier. Notifications needs the user's preferences. These are API calls, not replication use cases.\n\n### CDC Used Well\n\nFor contrast, here are patterns where CDC genuinely earns its complexity:\n\n**Search indexing.** Elasticsearch needs a denormalized, full-text-indexed copy of your data in a fundamentally different structure. You cannot serve this with API calls. CDC into Elasticsearch (or OpenSearch, or Typesense) is one of the cleanest uses of the pattern.\n\n**Analytics and data warehousing.** Your warehouse needs historical, append-only data in a columnar format for analytical queries that your OLTP database was never designed to serve. CDC into BigQuery, Snowflake, or Redshift is the standard pattern here, and it works because the consumer is reshaping, not mirroring.\n\n**Materialized views across service boundaries.** A dashboard needs to show data that joins across three services' databases. Rather than making three API calls on every page load, you CDC the relevant tables into a read-optimized store and materialize the join. The consumer's schema is a purpose-built denormalization, not a copy.\n\n**Event sourcing integration.** CDC from an existing database into an event store lets you incrementally adopt event-driven patterns without rewriting the source. The events are derived from real state changes, not synthetic.\n\nIn each case, the consumer is transforming the data to serve a purpose the source cannot.\n\n---\n\n## How It All Fits Together\n\nThe philosophy boils down to one principle: **replicate shape, not data.**\n\nIf the consumer needs the data in a different shape (different indexes, different joins, different query patterns), replication is justified because no amount of API design can bridge the gap between an OLTP row store and a full-text search engine.\n\nIf the consumer needs the same data in the same shape, you don't have a replication problem. You have a service boundary problem. Fix the boundary: expose an API, add a caching layer, or reconsider whether the data should live in that service at all.\n\nIf the consumer doesn't need the data at rest, you don't have a replication problem either. You have an eventing problem. Publish domain events, not database changelogs.\n\n```\n                  ┌─────────────────────┐\n                  │  Does the consumer  │\n                  │  need data at rest? │\n                  └─────────┬───────────┘\n                       ┌────┴────┐\n                      Yes       No\n                       │         │\n              ┌────────┴──┐   React &\n              │ Different │   Discard\n              │  shape?   │  (events)\n              └────┬──────┘\n              ┌────┴────┐\n             Yes       No\n              │         │\n          Replicate   Query\n           (CDC)     Source\n                     (API)\n```\n\n---\n\n## Lessons Learned\n\n**The cost of replication is not in the setup.** Setting up a Debezium connector takes an afternoon. Maintaining schema compatibility across consumers, debugging lag-induced inconsistencies, and running reconciliation jobs to detect drift, that's where the real cost lives. Teams consistently underestimate this because the feedback loop is months long.\n\n**\"What if the source is down?\" is not always a replication argument.** Sometimes the correct behavior during an upstream outage is to degrade, not to silently serve stale data. A payment authorized against a cached user tier that changed two hours ago is worse than a payment that fails gracefully and retries.\n\n**Mirrors masquerade as materialized views.** The most common CDC antipattern is a consumer that replicates a table into an identical schema in its own database. If you can describe the consumer's data model by saying \"it's the same as the source, but local,\" you almost certainly don't need CDC.\n\n**Schema coupling is still coupling.** Moving from API coupling to schema coupling via CDC doesn't decouple your services. It makes the coupling implicit, which is worse. At least APIs have versioning, contracts, and deprecation policies. A database schema has none of those when it's being consumed through a WAL stream.\n\n**CDC pipelines need SLOs.** If you do replicate, treat the pipeline as a production system. Define acceptable lag (e.g., p99 under 30 seconds), monitor consumer offsets, alert on schema changes, and have a runbook for when the pipeline breaks. Most teams set up CDC and forget it, then discover during an incident that the pipeline has been broken for days.\n\n---\n\n## What's Next\n\nThis post focused on the when and why of data replication. A natural follow-up is the mechanics of keeping replicas consistent once you've decided replication is justified:\n\n- **Schema evolution strategies:** How do you handle source schema changes without breaking consumers? Avro with a schema registry, Protobuf, or JSON Schema each have different guarantees.\n- **Lag monitoring and SLOs:** How do you define and measure \"fresh enough\" for each consumer? What's the operational playbook when lag spikes?\n- **Reconciliation patterns:** When you do detect drift between a replica and its source, how do you fix it without a full re-sync?\n\n---\n\n## References\n\n- [Debezium: Change Data Capture for Databases](https://debezium.io/)\n- [Turning the Database Inside-Out (Martin Kleppmann, 2015)](https://www.confluent.io/blog/turning-the-database-inside-out-with-apache-kafka-samza/)\n- [Designing Data-Intensive Applications (Martin Kleppmann, O'Reilly)](https://dataintensive.net/)\n- [The Log: What Every Software Engineer Should Know About Real-Time Data (Jay Kreps)](https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying)\n- [Implement your own CDC using Kafka (gauravsarma.com)](https://www.gauravsarma.com/posts/custom-cdc-using-kafka-5ca716634126)\n",
            "url": "https://gauravsarma.com/posts/2026-04-06_when-cdc-becomes-a-crutch",
            "title": "When CDC Becomes a Crutch: A Philosophy on Data Replication",
            "summary": "Your orders service needs the user's email to send a confirmation.  The payments service needs the user's tier to apply a discount...",
            "date_modified": "2026-04-06T00:00:00.000Z"
        },
        {
            "id": "https://gauravsarma.com/posts/2026-04-05_text-editor-data-structures",
            "content_html": "\nOpen a text editor, type a character, and it appears on screen. That single keystroke triggers a surprisingly deep question: how does the editor represent your document in memory so that insertions, deletions, and cursor movements all feel instant, even on a file with millions of lines?\n\nThe naive answer, a flat string or array of characters, falls apart fast. Insert a character at position 0 of a 10MB file and you are copying 10MB of data to make room. Real editors cannot afford this. The data structure behind the text buffer is one of the most consequential architectural decisions an editor makes, and different editors have made radically different choices.\n\n---\n\n## The Problem\n\nText editing looks simple but has a hostile access pattern for data structures. The operations are:\n\n- **Insert** a character or string at an arbitrary position\n- **Delete** a character or range at an arbitrary position\n- **Read** a range of text (for rendering, search, syntax highlighting)\n- **Navigate** to a line number or byte offset quickly\n\nThe challenge is that these operations happen at random positions throughout the document, often thousands of times per second (think: holding down backspace, or a find-and-replace across a large file). The user expects every operation to feel instantaneous.\n\nA flat array handles reads beautifully, O(1) random access, but insertions and deletions in the middle are O(n) because everything after the edit point must shift. For a 100-line config file this is invisible. For a 500,000-line log file, it is a visible stutter on every keystroke.\n\nThis is the core tension: data structures that are great for sequential reads tend to be terrible for random inserts, and vice versa. Every text editor buffer is a different answer to this trade-off.\n\n---\n\n## Prerequisites\n\n- Familiarity with basic data structures: arrays, linked lists, balanced binary trees\n- Understanding of big-O complexity notation\n- General awareness of how text editors render documents (viewport, cursor, selections)\n- For the CRDT section: a rough idea of what eventual consistency means in distributed systems\n\n---\n\n## Technical Decisions\n\n### The Four Contenders\n\nThe history of text editor buffers is essentially four major data structures, each born from a different era and set of constraints:\n\n| Data Structure | Notable Users | Era |\n|---|---|---|\n| Gap Buffer | Emacs, Scintilla | 1970s-present |\n| Piece Table | VS Code, AbiWord, original Word for Windows | 1980s-present |\n| Rope | Xi Editor, Crop, some game engines | 1990s-present |\n| CRDT (Yjs, Automerge, etc.) | Google Docs (OT variant), Zed, various collaborative editors | 2010s-present |\n\nEach one made a bet about what matters most: simplicity, memory efficiency, worst-case latency, or multi-user concurrency. None of them won universally.\n\n### Why Not Just Use a Linked List of Lines?\n\nEarly editors like ed and vi used arrays of lines, where each line was a separate string. This works reasonably well for line-oriented editing, but it has two fatal problems for modern use. First, operations that span lines (multi-cursor edits, block selections, large pastes) become complicated because you are working across array boundaries. Second, very long lines (minified JavaScript, for example) degenerate to the flat array problem within a single line. Most modern editors operate on a flat character sequence internally and derive line information separately through a line index.\n\n### Why Undo is a Data Structure Concern\n\nThe choice of buffer data structure deeply affects how undo/redo works. A gap buffer must snapshot or diff the content to support undo, because edits modify the buffer in place. A piece table, by contrast, is append-only: the original file content is never mutated, and every edit just adds a new piece descriptor. This means you can implement undo by simply removing the most recent piece table entries, essentially rewinding the edit history. This difference alone was a major reason the VS Code team chose piece tables.\n\n---\n\n## Implementation\n\n### The Gap Buffer: Emacs's Workhorse\n\nThe gap buffer is the oldest trick in the editor book and possibly the most elegant for its simplicity. The idea: store the document in a single contiguous array, but maintain an empty \"gap\" at the cursor position. When the user types, characters fill the gap. When the gap runs out, you grow the array (typically doubling, amortized O(1) for sequential inserts). When the cursor moves, you shift the gap to the new position by copying text across it.\n\nThe memory layout looks like this for the text \"Hello World\" with the cursor after \"Hello\":\n\n```\n[ H | e | l | l | o | _ | _ | _ | _ | _ | W | o | r | l | d ]\n                      ^                   ^\n                   gap start           gap end\n```\n\nTyping a space fills one gap slot. Moving the cursor to the end of \"World\" means copying \"World\" from after the gap to before it, then the gap sits at the end.\n\n**Where it shines:** The gap buffer is fast when edits are localized. Most human editing happens near the cursor: you type a line, maybe backspace a few characters, type some more. For this access pattern, the gap buffer is extremely fast, often just a single array write per keystroke with no allocations.\n\n**Where it breaks down:** Moving the cursor a large distance and then editing requires shifting the gap, which copies data proportional to the distance moved. On a 50MB file, jumping from the top to the bottom and inserting a character copies 50MB. This is why Emacs can sometimes pause noticeably on very large files when you jump around. Multi-cursor editing is also painful because you either need multiple gaps (complicated) or you are constantly shifting a single gap between cursor locations.\n\nThe gap buffer also has poor cache behavior when the gap is large. Modern CPUs love sequential memory access, and the gap creates a discontinuity that the prefetcher cannot bridge.\n\n**Complexity summary:**\n\n| Operation | Cost |\n|---|---|\n| Insert at cursor | O(1) amortized |\n| Insert at arbitrary position | O(n) to move gap |\n| Delete at cursor | O(1) |\n| Read across gap | O(1) but two-part copy |\n| Line index lookup | Requires auxiliary structure |\n\nEmacs has used a gap buffer since the late 1970s. Despite its limitations, it has survived because most editing really is local, and the simplicity means fewer bugs and easier maintenance than fancier structures.\n\n### The Piece Table: VS Code's Append-Only Log\n\nThe piece table was described by J Strother Moore in 1981, but it reached mainstream attention when the VS Code team published their analysis of why they chose it. The core idea is deceptively simple: never modify the original file content. Instead, maintain two buffers:\n\n1. The **original buffer**: the file as it was read from disk, immutable\n2. The **add buffer**: an append-only buffer where all new text goes\n\nThe document is described by a table of \"pieces,\" each of which points to a span in either the original buffer or the add buffer:\n\n```\nOriginal buffer: \"Hello World\"\nAdd buffer:      \" Beautiful\"\n\nPiece table:\n  [original, 0, 5]   → \"Hello\"\n  [add, 0, 10]       → \" Beautiful\"\n  [original, 5, 6]   → \" World\"\n\nLogical document: \"Hello Beautiful World\"\n```\n\nInserting text means: append the new text to the add buffer, then split the piece that contains the insertion point into two pieces and insert a new piece between them pointing to the appended text. Deletion means adjusting piece boundaries (or removing pieces entirely). The original buffer and previously appended text are never touched.\n\n**Where it shines:** Memory efficiency is outstanding for typical editing sessions. Opening a 10MB file uses ~10MB for the original buffer (which can be memory-mapped directly from disk), and the add buffer only grows by the amount of text you actually type, often a few KB. The piece table itself is a small list of descriptors.\n\nUndo is almost free: since the original buffer and add buffer are never modified, you can undo by reverting piece table entries. VS Code exploits this heavily.\n\nThe piece table also handles large file operations well. Deleting a 1MB block is just adjusting a few piece boundaries, not moving any text. Copy-paste of a large block within the same file can reference the same underlying buffer spans.\n\n**Where it breaks down:** Reading text is no longer a simple array index. To read a range, you must walk the piece table to find which pieces contain the range, then concatenate slices from potentially different buffers. For syntax highlighting and rendering, this means the editor must materialize text into a contiguous buffer for the renderer, or the renderer must understand the piece table abstraction.\n\nSequential character-by-character reads (like a regex engine scanning the file) pay overhead per piece boundary. If the piece table has thousands of entries after heavy editing, this cost adds up. VS Code mitigates this by storing the piece table in a balanced binary tree (a red-black tree) indexed by both offset and line number, giving O(log n) access to any position where n is the number of pieces.\n\n**Complexity summary (VS Code's tree-based implementation):**\n\n| Operation | Cost |\n|---|---|\n| Insert | O(log n) where n = piece count |\n| Delete | O(log n) |\n| Read at offset | O(log n) to find piece, then O(1) within piece |\n| Line number lookup | O(log n) via augmented tree |\n| Memory overhead | Original file + typed text + piece descriptors |\n\nOne subtle advantage: because the original file buffer is immutable, VS Code can detect external file modifications by comparing the on-disk content to the original buffer. If they match, the piece table is still valid. If not, the file was modified externally.\n\n### The Rope: Xi Editor's Balanced Tree of Strings\n\nThe rope data structure, formalized by Boehm, Atkinson, and Plass in their 1995 paper \"Ropes: an Alternative to Strings,\" takes a different approach entirely. Instead of one buffer with clever indexing, a rope breaks the text into chunks stored at the leaves of a balanced binary tree. Internal nodes store metadata: the total length of their left subtree (and often line counts, Unicode code point counts, or other metrics).\n\nA rope representing \"Hello Beautiful World\" might look like:\n\n```\n           [21]\n          /    \\\n       [5]     [16]\n       /       /    \\\n   \"Hello\"  [10]   [6]\n            /       |\n   \" Beautiful\"  \" World\"\n```\n\nEach leaf holds a chunk of text (typically 64 to 1024 bytes). Internal nodes cache aggregate information. To find the character at position 7, you walk from the root: the left subtree has 5 characters, so position 7 is at offset 2 in the right subtree. Walk right: the left child has 10 characters, so offset 2 is in that leaf. The character is 'e' (the 'e' in \"Beautiful\"). This walk is O(log n) where n is the number of leaves.\n\n**Where it shines:** Ropes have excellent worst-case behavior. Every operation, insert, delete, concatenation, split, is O(log n) regardless of where in the document it happens. There is no gap to move, no piece table to fragment. This makes ropes predictable, which matters for real-time editors that need consistent frame times.\n\nConcatenation is particularly cheap: to join two ropes, create a new root node with the two ropes as children. This is O(1) if you defer rebalancing (or O(log n) if you rebalance immediately). This makes operations like \"paste a 100MB chunk\" essentially instant.\n\nRopes also compose well with functional programming patterns. Since nodes are immutable (you create new nodes for edits rather than modifying in place), you get persistent data structures for free. Xi editor used this for its undo system: each edit creates a new rope that shares most of its nodes with the previous version. The memory overhead is proportional to the edit, not the document size.\n\nThe chunk-based structure also maps well to parallel processing. Syntax highlighting, word counting, and search can operate on chunks independently and combine results.\n\n**Where it breaks down:** Ropes have higher constant factors than gap buffers and piece tables for small documents. Each node is a heap allocation (or arena allocation), and tree traversal involves pointer chasing, which is hostile to CPU caches. For a 200-line file, a gap buffer will outperform a rope on every operation simply because the gap buffer is one contiguous allocation.\n\nThe implementation complexity is also significantly higher. Balancing the tree, managing chunk sizes (too small and you have overhead, too large and you lose the benefits), and maintaining augmented metadata through rotations requires careful engineering. Xi editor's rope implementation in Rust is roughly 2,000 lines of non-trivial code. Emacs's gap buffer logic is a fraction of that.\n\nReading a contiguous range of text requires visiting multiple leaves and copying their contents. For a renderer that needs a screen's worth of text (say, 80 columns by 50 rows = 4,000 characters), this might touch 4 to 60 leaves depending on chunk size and edit history, with a memory copy for each.\n\n**Complexity summary:**\n\n| Operation | Cost |\n|---|---|\n| Insert | O(log n) |\n| Delete | O(log n) |\n| Concatenate two ropes | O(log n) with rebalance |\n| Split at position | O(log n) |\n| Read at offset | O(log n) to find leaf |\n| Index by line number | O(log n) with augmented nodes |\n\nXi editor, which was developed at Google as an experimental high-performance editor, chose ropes precisely because of this predictable worst-case behavior. The project also used ropes as the wire format between the front-end and back-end processes, serializing rope diffs rather than full text snapshots. The project was archived in 2023, but its rope library (xi-rope) influenced several subsequent editors.\n\n### CRDTs: When Multiple People Edit the Same Document\n\nEverything above assumes a single user. The moment two users edit the same document simultaneously, the problem changes fundamentally. It is no longer enough to have an efficient buffer. You need the buffer to support concurrent, potentially conflicting edits and converge to the same state on all replicas without central coordination.\n\nThis is the domain of Conflict-free Replicated Data Types (CRDTs) and their predecessor, Operational Transformation (OT).\n\n**Operational Transformation** was the first approach, used by Google Docs and earlier collaborative editors. OT works by transforming operations against each other: if user A inserts at position 5 and user B deletes at position 3, then by the time A's operation reaches B's replica, position 5 is now position 4 (because B's deletion shifted everything). OT defines transformation functions that adjust operations to account for concurrent edits.\n\nOT works, Google Docs proves it at scale, but it has a painful property: the transformation functions must be correct for every possible pair of concurrent operations, and proving correctness is notoriously hard. The original OT paper had bugs. Many subsequent papers also had bugs. OT also typically requires a central server to determine the total ordering of operations, which limits architectural flexibility.\n\n**CRDTs** take a different approach. Instead of transforming operations after the fact, CRDTs design the data structure itself so that concurrent operations commute: applying them in any order produces the same result. For text editing, this usually means assigning a globally unique, ordered identifier to every character.\n\nIn a sequence CRDT like Yjs or Automerge, each character gets an ID that encodes both its position in the sequence and which replica created it. These IDs are designed so that the intended ordering can always be reconstructed, regardless of the order in which replicas receive operations.\n\nFor example, if user A types \"Hello\" and user B concurrently types \"World\" after the same anchor point, the CRDT's ID scheme ensures a deterministic merge: maybe \"HelloWorld\" or \"WorldHello\" depending on the tiebreaking rule, but always the same result on every replica.\n\n**Where CRDTs shine:** No central server required. Replicas can work offline, sync later, and converge automatically. This enables true peer-to-peer collaboration and offline-first editing. Zed, the collaborative code editor built in Rust, uses a CRDT for its buffer. So does the Ink & Switch research lab's suite of local-first applications.\n\nCRDTs also compose well: you can have a CRDT for the text content, a separate CRDT for cursor positions, another for comments or annotations, and they all merge independently.\n\n**Where CRDTs break down:** Memory overhead. Every character that has ever existed in the document (including deleted ones, which must be retained as \"tombstones\" for convergence) needs a unique ID. For a document with heavy editing, the metadata can exceed the text content by 2-10x. Yjs is remarkably efficient here, compressing runs of sequential inserts by the same user, but the overhead is still real.\n\nPerformance of the merge operation can also be surprising. While single-character inserts are typically O(log n), merging two replicas that have diverged significantly can be expensive as the CRDT must integrate many concurrent operations and resolve their ordering.\n\nThe \"intention preservation\" problem is also fundamental. When user A selects a word and bolds it while user B deletes that word, what should happen? CRDTs guarantee convergence (all replicas agree) but not necessarily that the result matches anyone's intention. These semantic conflicts still require application-level resolution.\n\n**Complexity summary:**\n\n| Operation | Cost | Notes |\n|---|---|---|\n| Local insert | O(log n) typical | Depends on CRDT implementation |\n| Local delete | O(log n) | Tombstone created, not actually removed |\n| Merge with remote | O(k log n) | k = number of remote operations |\n| Memory per character | ID + tombstone flag | 2-10x overhead vs plain text |\n| Convergence | Guaranteed | By mathematical construction |\n\n---\n\n## How It All Fits Together\n\nThe choice of buffer data structure ripples through the entire editor architecture:\n\n**Rendering pipeline:** A gap buffer can hand the renderer a near-contiguous block of memory (just skip the gap). A piece table requires materializing text from multiple pieces. A rope requires walking leaves. A CRDT must filter tombstones. The renderer's complexity is inversely proportional to the buffer's simplicity.\n\n**Syntax highlighting:** Modern editors use incremental parsing (often via tree-sitter) which needs to efficiently re-parse only the changed region. Ropes and piece tables naturally track edit boundaries, making incremental parse tree updates easier. Gap buffers require the editor to separately track what changed.\n\n**Undo/redo:** Piece tables and ropes (when used persistently) get undo nearly for free because they preserve history structurally. Gap buffers require an explicit undo stack with either snapshots or inverse operations. CRDTs can treat undo as a new operation (insert what was deleted, delete what was inserted) that propagates to all replicas, though this interacts complexly with concurrent edits.\n\n**Memory mapping:** Piece tables are uniquely suited to memory-mapped file I/O. The original buffer can be an mmap'd view of the file, meaning the OS handles paging and the editor uses no additional memory for unchanged content. Gap buffers cannot be mmap'd because they mutate the buffer in place. Ropes could theoretically mmap leaf nodes, but the chunk structure rarely aligns with file layout.\n\n**Large file performance:** For files over 100MB, piece tables and ropes maintain consistent performance because they avoid O(n) copies. Gap buffers become impractical unless the user only edits near the cursor. CRDTs are generally not optimized for large-file single-user editing.\n\nThe tradeoff space in summary:\n\n| Dimension | Gap Buffer | Piece Table | Rope | CRDT |\n|---|---|---|---|---|\n| Local edit speed | Excellent (at cursor) | Good | Good | Good |\n| Worst-case edit | Poor (O(n) gap move) | Good (O(log n)) | Good (O(log n)) | Varies |\n| Memory efficiency | Very good | Excellent | Good | Poor |\n| Undo complexity | High (explicit stack) | Low (structural) | Low (persistent) | Medium (operational) |\n| Implementation complexity | Low | Medium | High | Very high |\n| Multi-cursor support | Poor | Good | Good | Native |\n| Collaborative editing | Not supported | Not supported | Possible but complex | Native |\n| Cache friendliness | Good (near cursor) | Moderate | Poor (pointer chasing) | Poor |\n\n---\n\n## Lessons Learned\n\n**Locality of edits is the key insight.** The gap buffer survives because it exploits a deep truth about human editing: most edits happen near the cursor. Data structures that optimize for the common case rather than the worst case often win in practice. Emacs has been fast enough for 45 years on a data structure with O(n) worst-case behavior.\n\n**Immutability is a superpower for editors.** Both piece tables and persistent ropes demonstrate that never mutating existing content simplifies undo, crash recovery, and change detection. VS Code's piece table literally keeps the original file intact in memory. If the editor crashes, the original file is untouched on disk.\n\n**CRDTs change the question, not just the answer.** Moving from single-user to collaborative editing is not about finding a faster data structure. It is about accepting fundamentally different constraints: you must handle concurrent operations, you must preserve causality, and you must accept that \"correct\" sometimes means \"deterministically resolved but not what either user intended.\"\n\n**There is no universal winner.** The Zed editor uses CRDTs even for single-user editing because they built for collaboration from the start. Emacs uses a gap buffer because its extension system and editing model assume contiguous memory. VS Code uses a piece table because it handles large files and undo elegantly. Xi used ropes because it wanted predictable latency. Each choice reflects the editor's values and constraints, not a ranking of data structures.\n\n**The markdown-specific problem is actually the rendering problem.** Markdown editors like Obsidian, Typora, and notable do not typically use exotic buffer data structures. Their technical challenge is live preview: parsing markdown to an AST, rendering it, and keeping the rendered view synchronized with edits. Tree-sitter's incremental parsing makes this tractable by re-parsing only the changed subtree, but the complexity lives in the rendering pipeline, not the buffer.\n\n---\n\n## What's Next\n\nThis post covered the buffer, the core data structure that stores text. But a modern editor has at least three other data-structure-heavy subsystems worth exploring:\n\n- **The line index:** How do you go from byte offset to line:column and back? Augmented balanced trees, Fenwick trees, and cached newline arrays all appear in the wild.\n- **The syntax tree:** Tree-sitter builds an incremental concrete syntax tree that survives edits. How it does incremental re-parsing is its own deep topic.\n- **The selection model:** Multiple cursors, rectangular selections, and folded regions all need their own data structures that compose with the buffer.\n\nEach of these interacts with the buffer choice and inherits its trade-offs.\n\n---\n\n## References\n\n- [Ropes: An Alternative to Strings (Boehm, Atkinson, Plass, 1995)](https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.14.9450&rep=rep1&type=pdf)\n- [Text Buffer Reimplementation (VS Code blog, 2018)](https://code.visualstudio.com/blogs/2018/03/23/text-buffer-reimplementation)\n- [Data Structures for Text Sequences (Charles Crowley, 1998)](https://www.cs.unm.edu/~crowley/papers/sds.pdf)\n- [Xi Editor Rope Science (Raph Levien)](https://xi-editor.io/docs/rope_science_00.html)\n- [Yjs: A CRDT Framework for Shared Editing](https://github.com/yjs/yjs)\n- [Automerge: A JSON-like CRDT](https://automerge.org/)\n- [Tree-sitter: Incremental Parsing](https://tree-sitter.github.io/tree-sitter/)\n- [Zed Editor Architecture](https://zed.dev/blog)\n- [The Gap Buffer Data Structure (Emacs Internals)](https://www.gnu.org/software/emacs/manual/html_node/elisp/Buffer-Gap.html)\n",
            "url": "https://gauravsarma.com/posts/2026-04-05_text-editor-data-structures",
            "title": "The Data Structures Behind Text Editors: Gap Buffers, Piece Tables, Ropes, and CRDTs",
            "summary": "Open a text editor, type a character, and it appears on screen.  That single keystroke triggers a surprisingly deep question: how does the editor represent your document in memory so that insertions, deletions, and cursor movements all feel instant, even on a file with millions of lines...",
            "date_modified": "2026-04-05T00:00:00.000Z"
        },
        {
            "id": "https://gauravsarma.com/posts/2026-03-31_split-brain-in-distributed-databases",
            "content_html": "\n![How Split Brain Happens in Distributed Databases and How It Gets Fixed](split-brain-in-distributed-databases-cover.png)\n\nYou have a three-node database cluster running in production. A network switch fails and the nodes can no longer talk to each other. Both sides of the partition think the other side is dead. Both promote themselves to primary. Both start accepting writes. When the network heals, you have two divergent histories of your data and no automatic way to reconcile them. This is split brain, and it has caused real outages at every scale, from startups to GitHub to the entire AWS us-east-1 region.\n\n---\n\n## The Problem\n\nDistributed databases replicate data across multiple nodes to survive hardware failures. In normal operation, one node (the primary, leader, or master) accepts writes and replicates them to followers. If the primary fails, a follower takes over. This is straightforward when failures are clean: the primary crashes, followers detect it, one gets elected.\n\nThe problem is that real failures are not clean. Network partitions do not announce themselves. From node A's perspective, node B might be dead, unreachable, or perfectly healthy but separated by a failed switch. Node A cannot tell the difference. If both sides of a partition independently decide \"I am the primary now,\" you get split brain: two nodes accepting conflicting writes to the same data.\n\nThe consequences are severe. Imagine two clients updating the same bank account balance on two different primaries. When the partition heals, which balance is correct? The answer is neither, both, or \"it depends on your conflict resolution strategy,\" none of which inspire confidence in a financial system.\n\nSplit brain is not a theoretical concern. It is the failure mode that motivates most of the complexity in distributed consensus protocols.\n\n---\n\n## Prerequisites\n\n- Understanding of primary/replica (leader/follower) replication at a conceptual level\n- Basic familiarity with what a network partition is\n- Awareness of CAP theorem helps but is not required\n- Knowing what a distributed consensus protocol does (not how it works internally)\n\n---\n\n## Technical Decisions\n\n### Why Not Just Use Timeouts?\n\nThe naive failure detector is a heartbeat with a timeout: if the primary does not respond within N seconds, declare it dead and elect a new one. This is exactly what causes split brain.\n\nThe primary might be alive but slow. A garbage collection pause, a saturated network link, or a CPU spike can all cause heartbeat delays without the node actually failing. If the timeout is too aggressive, you get false positives: a new primary is elected while the old one is still running and accepting writes. If the timeout is too conservative, you get long periods of unavailability while the system waits to be sure the primary is gone.\n\nThere is no timeout value that eliminates split brain. The fundamental issue is that failure detection in an asynchronous distributed system is inherently uncertain. This is the insight behind the FLP impossibility result: in an asynchronous system, you cannot distinguish a crashed process from a slow one.\n\n### Why Quorum-Based Approaches Win\n\nThe key insight that prevents split brain is: do not let any single node make unilateral decisions. Instead, require a majority (quorum) of nodes to agree before any state change takes effect.\n\nIn a cluster of N nodes, a quorum is `floor(N/2) + 1`. For a three-node cluster, the quorum is 2. For five nodes, it is 3. The critical property: if the cluster splits into two groups, at most one group can contain a majority. The minority side literally cannot form a quorum, so it cannot elect a leader or commit writes.\n\nThis is why distributed databases run on odd numbers of nodes. A two-node cluster has a quorum of 2, meaning both nodes must agree for anything to happen. A network partition between them brings the entire cluster down, which defeats the purpose. A three-node cluster only needs two nodes, so it survives a single-node failure or partition.\n\n---\n\n## Implementation\n\n### Phase 1: How Split Brain Actually Happens\n\nLet us trace through the failure step by step with a three-node PostgreSQL cluster using streaming replication and a failover manager like Patroni.\n\n```\nNormal operation:\n  Node A (primary) ──replication──> Node B (sync replica)\n                   ──replication──> Node C (async replica)\n\n  Clients ──writes──> Node A\n```\n\nA network partition isolates Node A from Nodes B and C, but B and C can still talk to each other:\n\n```\nAfter partition:\n  [Partition A]          |          [Partition B]\n  Node A (thinks it's    |    Node B ──── Node C\n   still primary)        |    (elect new primary?)\n                         |\n  Clients on this side   |    Clients on this side\n  still writing to A     |    can't reach A\n```\n\nNode B and Node C detect that Node A's heartbeat has stopped. After the configured timeout, they initiate a leader election. Node B wins. It promotes itself to primary and starts accepting writes.\n\nMeanwhile, Node A has no idea this happened. It never received a \"you are no longer primary\" message because the network is partitioned. It continues accepting writes from clients on its side of the partition.\n\nNow both Node A and Node B are accepting writes. You have split brain.\n\n```\nSplit brain state:\n  Node A: INSERT INTO orders (id, amount) VALUES (1001, 50.00);\n  Node B: INSERT INTO orders (id, amount) VALUES (1001, 99.00);\n\n  Same primary key, different data. Which one is right?\n```\n\n### Phase 2: Prevention with Fencing\n\nThe first line of defense is fencing: ensuring the old primary cannot accept writes after a new primary is elected. There are several mechanisms:\n\n**STONITH (Shoot The Other Node In The Head)**\n\nThe most aggressive approach. When the new primary is elected, it sends a hardware-level command to power off the old primary. This is common in traditional HA clusters using Pacemaker/Corosync.\n\n```bash\n# Pacemaker fencing agent example\nstonith_admin --fence node-a\n# This physically powers off node-a via IPMI/iLO/DRAC\n```\n\nSTONITH is effective but brutal. It requires out-of-band management hardware (IPMI, iLO) and introduces its own failure modes: what if the fencing command itself fails to reach the old primary?\n\n**Fencing Tokens (Logical Fencing)**\n\nA more elegant approach used by systems like ZooKeeper and etcd. Each leader election produces a monotonically increasing token (epoch number, term number, or lease version). Every write request must include the current fencing token. Storage systems reject writes with stale tokens.\n\n```\nLeader election 1: Node A gets token 42\nLeader election 2: Node B gets token 43\n\nNode A sends: WRITE(key=balance, value=50, token=42)\nStorage sees token 42 < current token 43 → REJECTED\n\nNode B sends: WRITE(key=balance, value=99, token=43)\nStorage sees token 43 = current token 43 → ACCEPTED\n```\n\nThis works even if Node A is still alive and thinks it is the primary. The storage layer enforces the invariant that only the most recently elected leader's writes are accepted. The old leader's writes silently fail.\n\n**Lease-Based Fencing**\n\nThe primary holds a time-limited lease. It can only accept writes while the lease is valid. To renew the lease, it must contact a quorum. If it is partitioned from the quorum, its lease expires and it stops accepting writes.\n\n```\nTimeline:\n  T=0:  Node A acquires lease (valid for 10s)\n  T=5:  Network partition happens\n  T=8:  Node A tries to renew lease, cannot reach quorum\n  T=10: Lease expires, Node A stops accepting writes\n  T=12: Node B acquires new lease from quorum, becomes primary\n```\n\nThe gap between T=10 and T=12 is intentional unavailability. The system chooses to be unavailable rather than risk split brain. This is the CP side of the CAP theorem in practice.\n\n### Phase 3: Prevention with Consensus Protocols\n\nModern distributed databases avoid split brain by design using consensus protocols. The two most widely deployed are Raft and Multi-Paxos.\n\n**Raft (used by etcd, CockroachDB, TiKV, Consul)**\n\nIn Raft, every write must be replicated to a majority of nodes before it is considered committed. A leader that is partitioned from the majority cannot commit any writes because it cannot get quorum acknowledgment.\n\n```\nNormal write in Raft (3-node cluster):\n\n  Client ──write──> Leader (Node A)\n  Node A ──AppendEntries──> Node B  ✓ (ACK)\n  Node A ──AppendEntries──> Node C  ✓ (ACK)\n  Quorum reached (2/3 including leader): COMMIT\n\nAfter partition (A isolated):\n\n  Client ──write──> Leader (Node A)\n  Node A ──AppendEntries──> Node B  ✗ (unreachable)\n  Node A ──AppendEntries──> Node C  ✗ (unreachable)\n  Cannot reach quorum: WRITE BLOCKS / TIMES OUT\n\nMeanwhile, B and C elect a new leader with a higher term:\n  Node B becomes leader (term 2)\n  Node B can reach Node C → quorum of 2/3 → writes succeed\n```\n\nWhen the partition heals, Node A discovers that a new leader with a higher term exists. It steps down, discards any uncommitted entries in its log, and replicates from Node B. No split brain, by construction.\n\nThe critical invariant in Raft is: a leader must have been elected by a majority, and every committed entry must be stored on a majority. Since any two majorities overlap in at least one node, a new leader is guaranteed to know about all previously committed entries.\n\n**Multi-Paxos (used by Google Spanner, variations in many systems)**\n\nMulti-Paxos works on a similar quorum principle but separates the concern differently. A leader is elected via a Paxos round (Phase 1), and then can issue writes without repeating Phase 1 for each operation (Phase 2 only). If the leader is partitioned, its Phase 2 messages will not reach a quorum, and a new leader will be elected via a new Phase 1 round with a higher ballot number.\n\nThe math is the same: two quorums always overlap, so you cannot have two leaders that can both commit writes.\n\n### Phase 4: Recovery After Split Brain\n\nDespite all prevention mechanisms, split brain can still happen in practice, especially in systems that prioritize availability over consistency (AP systems). When it does, you need a recovery strategy.\n\n**Last-Writer-Wins (LWW)**\n\nThe simplest approach: attach a timestamp to every write, and when conflicts are detected, keep the write with the latest timestamp.\n\n```\nNode A: SET balance = 50  (timestamp: 1711872000001)\nNode B: SET balance = 99  (timestamp: 1711872000002)\n\nAfter merge: balance = 99 (higher timestamp wins)\n```\n\nThis is simple but dangerous. It silently discards writes. If Node A processed a deposit and Node B processed a withdrawal, you just lost the deposit. DynamoDB and Cassandra both support LWW, but the documentation is very clear about the trade-off.\n\nClock skew makes LWW even worse. If Node A's clock is ahead, its writes always win regardless of when they actually happened. This is why Spanner uses TrueTime (GPS-synchronized clocks with bounded uncertainty) instead of relying on system clocks.\n\n**CRDTs (Conflict-free Replicated Data Types)**\n\nCRDTs are data structures designed so that concurrent updates can always be merged without conflicts. A G-Counter (grow-only counter), for example, tracks increments per node and sums them on read:\n\n```\nNode A counter: {A: 5, B: 0}  (Node A saw 5 increments)\nNode B counter: {A: 0, B: 3}  (Node B saw 3 increments)\n\nMerged: {A: 5, B: 3} → total = 8\n```\n\nNo data is lost, but CRDTs only work for data structures that have a natural merge operation. A counter merges easily. A bank account balance does not, because you need to enforce constraints (balance >= 0) that require coordination.\n\nRiak was the most prominent database to build around CRDTs. Redis also supports CRDT-based conflict resolution in its active-active geo-replication.\n\n**Application-Level Resolution**\n\nSome systems punt the problem to the application. CouchDB stores all conflicting revisions and lets the application decide which one to keep. This is maximally flexible but puts the burden on the developer, and in practice many applications simply pick a winner arbitrarily, which is LWW with extra steps.\n\n### Phase 5: Rollback Mechanics After Split Brain\n\nConflict resolution picks a winner. Rollback is the harder problem: undoing the loser's writes without corrupting the data that survived. The mechanics differ significantly between systems.\n\n**Raft Log Truncation**\n\nIn Raft-based systems, rollback is baked into the protocol. When a partitioned leader (Node A, term 1) rejoins the cluster and discovers a new leader (Node B, term 2), it compares logs. Any entries in Node A's log that are not present in Node B's log (the authoritative leader) are _uncommitted_ by definition, because they never reached a quorum. Node A truncates its log back to the point where it diverges from Node B's log, then replays Node B's entries forward.\n\n```\nNode A log (stale leader, term 1):\n  [1:1] [1:2] [1:3] [1:4] [1:5]\n                      ↑ diverges here\n\nNode B log (current leader, term 2):\n  [1:1] [1:2] [1:3] [2:1] [2:2] [2:3]\n\nAfter rollback on Node A:\n  [1:1] [1:2] [1:3] [2:1] [2:2] [2:3]\n                      ↑ entries [1:4] and [1:5] are discarded\n```\n\nThe key safety property: entries [1:4] and [1:5] were never committed (never ACKed to clients as durable), so discarding them does not violate any promise the system made. Clients that sent those writes received timeouts or errors, not success responses. This is why Raft-based systems only acknowledge a write after quorum replication, never before.\n\nIn CockroachDB, this log truncation happens at the Raft layer, but there is an additional concern: those uncommitted writes may have partially applied side effects in the storage engine (RocksDB/Pebble). CockroachDB handles this with its MVCC (multi-version concurrency control) layer. Uncommitted writes exist as intents, which are cleaned up during the rollback process. No committed data is affected.\n\n**PostgreSQL: Timeline Divergence and pg_rewind**\n\nPostgreSQL does not use a consensus protocol for replication. When split brain happens in a PostgreSQL HA cluster (two nodes both acting as primary), the divergence is at the WAL (write-ahead log) level. Both nodes generated WAL records from the same starting point but with different content.\n\nAfter the partition heals, the old primary cannot simply reconnect as a replica. Its WAL has diverged, it has data pages on disk that reflect writes the new primary never saw. You have three options:\n\n1. **Rebuild from scratch**: `pg_basebackup` the entire database from the new primary. Safe but slow, especially for large databases (hours for terabyte-scale).\n\n2. **pg_rewind**: A targeted rollback tool. It reads the new primary's WAL to find the exact point of divergence, then copies only the changed data pages from the new primary to the old one. The old primary's divergent WAL is discarded.\n\n```bash\n# On the old primary (Node A), after it has been stopped:\npg_rewind --target-pgdata=/var/lib/postgresql/data \\\n          --source-server=\"host=node-b port=5432 user=rewind_user\"\n\n# pg_rewind does:\n# 1. Finds the timeline divergence point in the WAL\n# 2. Reads all WAL records on the new primary since divergence\n# 3. Identifies which data pages were modified\n# 4. Copies those pages from the new primary to the old one\n# 5. Old primary can now start as a replica of Node B\n```\n\nThe critical requirement for `pg_rewind` is that `wal_log_hints` or `data_checksums` must be enabled. Without these, `pg_rewind` cannot reliably identify which pages changed. Patroni enables `wal_log_hints` by default for exactly this reason.\n\n3. **Manual WAL inspection**: In the worst case, a DBA can use `pg_waldump` to inspect the divergent WAL records on both sides, identify what writes were lost, and manually reconcile them. This is a last resort, but it is sometimes the only option when the lost writes had real-world side effects (emails sent, payments initiated).\n\n```bash\n# Inspect divergent WAL on the old primary\npg_waldump /var/lib/postgresql/data/pg_wal/000000020000000000000042 \\\n  --start=0/4200000 --end=0/4300000\n\n# Output shows individual record types:\n# rmgr: Heap    len: 54  tx: 1234  INSERT off 3 blk 0: rel 1663/16384/16385\n# rmgr: Btree   len: 64  tx: 1234  INSERT_LEAF off 42 blk 0: rel 1663/16384/16389\n```\n\n**MySQL/MariaDB with GTID-Based Rollback**\n\nMySQL's Global Transaction Identifiers (GTIDs) make divergence detection straightforward. Each transaction gets a unique ID in the format `server_uuid:sequence_number`. After split brain, the two primaries have GTID sets that diverged:\n\n```\nNode A GTID set: uuid-a:1-100, uuid-b:1-50\n  (Node A originated transactions 1-100, replicated B's 1-50 before split)\n\nNode B GTID set: uuid-a:1-80, uuid-b:1-70\n  (Node B only saw A's first 80, then originated its own 51-70)\n\nDivergent on Node A: uuid-a:81-100 (writes A made during partition)\nDivergent on Node B: uuid-b:51-70  (writes B made during partition)\n```\n\nTo roll back Node A and rejoin it as a replica of Node B, you need to undo transactions `uuid-a:81-100`. MySQL does not have a built-in \"undo these GTIDs\" command. The options are:\n\n- **mysqlbinlog with --exclude-gtids**: Extract the divergent binlog events, generate reverse SQL statements, and apply them. Tools like `gh-ost` or `pt-online-schema-change` can help, but this is manual and error-prone.\n- **Clone plugin**: MySQL 8.0+ can clone a fresh copy of the data from the new primary, similar to `pg_basebackup`. Faster than a full dump/restore but still requires downtime on the rejoining node.\n- **Group Replication automatic rollback**: If you are using MySQL Group Replication (InnoDB Cluster) instead of async replication, the rejoining node automatically rolls back divergent transactions using the `group_replication_applier` channel. This is the closest MySQL gets to Raft-style automatic rollback.\n\n**Cassandra: Rollback by Convergence**\n\nCassandra does not roll back in the traditional sense. As an AP system, it accepts that both sides of a split brain produced valid writes. Instead of picking a winner and discarding the loser, it converges through read repair and anti-entropy repair:\n\n- **Read repair**: When a client reads a key, the coordinator queries multiple replicas. If they disagree, the most recent value (by timestamp) wins, and stale replicas are updated in the background.\n- **Anti-entropy repair** (`nodetool repair`): A background process that compares Merkle trees of data ranges across replicas and synchronizes any differences.\n\nThe \"rollback\" in Cassandra is really \"eventual overwrite.\" Old values are not explicitly undone. They are superseded by newer values during the repair process. Tombstones (deletion markers) ensure that deletes on one side of the partition are not undone by stale reads from the other side.\n\n```\nDuring partition:\n  Node A: DELETE FROM users WHERE id = 42;  (tombstone at T=100)\n  Node B: SELECT * FROM users WHERE id = 42; → returns row (stale)\n\nAfter partition heals + read repair:\n  Tombstone (T=100) > row's last write (T=90)\n  → DELETE wins, row is removed from Node B\n  → Without tombstones, the delete would be \"resurrected\"\n```\n\nThis is why Cassandra has `gc_grace_seconds` (default 10 days): tombstones must survive long enough for all replicas to see them. If a node is down for longer than `gc_grace_seconds`, tombstones may be garbage collected before that node sees them, and deleted data can reappear. This is one of the most common operational surprises in Cassandra.\n\n---\n\n## How It All Fits Together\n\nThe defenses against split brain form layers:\n\n```\nLayer 1: Consensus Protocol (Raft, Paxos)\n  → Prevents split brain by requiring quorum for all commits\n  → A partitioned leader cannot commit writes\n\nLayer 2: Fencing (tokens, leases, STONITH)\n  → Prevents stale leaders from interacting with storage\n  → Even if consensus has a bug, the storage layer rejects stale writes\n\nLayer 3: Conflict Resolution (LWW, CRDTs, app-level merge)\n  → Handles the aftermath if split brain occurs despite layers 1 and 2\n  → Trade-offs between simplicity, correctness, and data loss\n```\n\nCP systems (etcd, ZooKeeper, CockroachDB, Spanner) invest heavily in layers 1 and 2 and aim to never reach layer 3. They accept temporary unavailability during partitions as the cost of avoiding split brain.\n\nAP systems (Cassandra, DynamoDB, Riak) accept that split brain will happen during partitions and invest in layer 3. They remain available but require careful application design to handle conflicts.\n\nThe choice between these is not a technical one. It is a product decision: is it worse for your users to see stale or conflicting data, or to see an error page? For a shopping cart, stale data is fine. For a wire transfer, an error page is the only safe option.\n\n---\n\n## Lessons Learned\n\n**Split brain is a spectrum, not a binary.** Partial partitions, where some nodes can reach some but not all other nodes, create scenarios that are harder to reason about than a clean two-way split. The \"Byzantine\" failure modes (nodes lying about their state) are even harder. Most production systems only handle crash-stop failures and clean partitions.\n\n**Testing split brain is harder than preventing it.** You can reason about Raft's correctness on paper, but you also need to verify that your specific implementation handles edge cases: clock skew, disk full, partial network failures, and leader elections during compaction. Tools like Jepsen have found split-brain bugs in almost every distributed database they have tested, including etcd, CockroachDB, and MongoDB.\n\n**Monitoring matters as much as prevention.** If split brain does happen, fast detection limits the damage. Track metrics like the number of active leaders (should always be 0 or 1), replication lag across replicas, and fencing token monotonicity. Alert on any of these violating expectations.\n\n**Operator error causes more split brain than software bugs.** Misconfigured timeouts, manual failovers without proper fencing, and \"temporary\" firewall rules that partition the cluster are far more common than actual consensus protocol bugs. The most common cause of split brain in PostgreSQL HA setups is someone manually promoting a replica without first shutting down the old primary.\n\n---\n\n## What's Next\n\nIf you want to go deeper, Jepsen's analysis reports are the gold standard for understanding how real distributed databases handle (or fail to handle) partitions. The Raft paper by Ongaro and Ousterhout is surprisingly readable and covers the leader election and log replication mechanisms in enough detail to implement them. For a more formal treatment, Lamport's \"Paxos Made Simple\" is the canonical reference, though \"simple\" is doing heavy lifting in that title.\n\n---\n\n## References\n\n- [In Search of an Understandable Consensus Algorithm (Raft paper)](https://raft.github.io/raft.pdf)\n- [Paxos Made Simple, Leslie Lamport](https://lamport.azurewebsites.net/pubs/paxos-simple.pdf)\n- [Jepsen: Distributed Systems Safety Research](https://jepsen.io/analyses)\n- [Designing Data-Intensive Applications, Martin Kleppmann, Chapter 8-9](https://dataintensive.net/)\n- [How to do distributed locking (Fencing tokens), Martin Kleppmann](https://martin.kleppmann.com/2016/02/08/how-to-do-distributed-locking.html)\n- [CockroachDB Architecture: Replication Layer](https://www.cockroachlabs.com/docs/stable/architecture/replication-layer.html)\n- [Spanner: Google's Globally-Distributed Database](https://research.google/pubs/pub39966/)\n",
            "url": "https://gauravsarma.com/posts/2026-03-31_split-brain-in-distributed-databases",
            "title": "How Split Brain Happens in Distributed Databases and How It Gets Fixed",
            "summary": ". [How Split Brain Happens in Distributed Databases and How It Gets Fixed](split-brain-in-distributed-databases-cover...",
            "date_modified": "2026-03-31T00:00:00.000Z"
        },
        {
            "id": "https://gauravsarma.com/posts/2026-03-15_checkpointing-without-stopping-the-world",
            "content_html": "\nYour database has gigabytes of dirty pages in memory. At some point they need to hit disk. The naive approach is to pause all writes, flush everything cleanly, and resume. It works, but it means your p99 latency spikes every few minutes, your write throughput drops to zero for hundreds of milliseconds, and your on-call team gets paged. Every major storage system has had to solve this. The solutions are more varied than you'd expect.\n\n---\n\n## The Problem\n\nA checkpoint has one job: produce a consistent snapshot of the database on disk so that, after a crash, recovery does not have to replay the entire write-ahead log from the beginning.\n\nThe tricky part is \"consistent.\" If you flush page 42 at time T1 and page 43 at time T2, and a transaction modified both between T1 and T2, you now have a disk image that never existed in memory. Recovering from that image gives you a corrupted database.\n\nThe brute-force solution is a \"sharp checkpoint\": freeze all writes, flush everything, unfreeze. You get a provably consistent image, but you also get a multi-hundred-millisecond stall. For a OLTP system doing 50,000 writes per second, that stall shows up as a cliff in your latency histogram every time the checkpoint fires.\n\nThe alternatives, used by virtually every production database, are collectively called \"fuzzy\" or \"online\" checkpointing. The core insight: you do not need to freeze the world if you have a way to reconstruct what the state _was_ at a specific point in time, even while the state continues to change.\n\n---\n\n## Prerequisites\n\n- Familiarity with write-ahead logging (WAL) at a conceptual level\n- Basic understanding of buffer pool management in databases\n- Knowing what \"LSN\" (Log Sequence Number) means helps for the PostgreSQL section\n- Awareness of what copy-on-write semantics are at the OS level\n\n---\n\n## The Approaches\n\n### Fuzzy Checkpointing with WAL Replay (PostgreSQL)\n\nPostgreSQL's checkpoint does not stop writes. Instead it does this:\n\n1. Record the current WAL position as the \"checkpoint start LSN\" (redo point).\n2. Begin scanning the buffer pool and writing dirty pages to disk in the background, via the `bgwriter` and `checkpointer` processes.\n3. While this is happening, normal write traffic continues. Pages that were already flushed can get dirtied again. That is fine.\n4. When all pages that were dirty at step 1 have been flushed, write a `CHECKPOINT` record to the WAL with the redo point from step 1.\n5. Update `pg_control` to record the new checkpoint location.\n\nThe result is not a clean snapshot. Some pages on disk reflect state after the redo point. But that is acceptable, because on crash recovery, PostgreSQL replays the WAL forward from the redo point. Any page written after the redo point will be overwritten with the correct version from the WAL. Pages written before the redo point are already durable.\n\nThe key invariant is not \"all pages are consistent with each other.\" It is \"all pages are at least as old as the redo point, and the WAL from the redo point forward is complete.\" Recovery corrects everything else.\n\n```\nTimeline:\n  LSN 1000: dirty pages start flushing  <-- redo point\n  LSN 1020: page 42 flushed (state from LSN 1005)\n  LSN 1040: page 43 flushed (state from LSN 1038, after redo point -- this is fine)\n  LSN 1050: CHECKPOINT record written\n\nCrash at LSN 1045:\n  Recovery replays WAL from LSN 1000 forward.\n  Page 42 gets replayed to its correct state.\n  Page 43 is already current.\n```\n\nOne subtlety: `full_page_writes`. The first time a page is modified after a checkpoint starts, PostgreSQL writes the _entire_ page image into the WAL, not just the change. This guards against partial writes: if the OS crashes mid-page-write, the full-page image in the WAL can restore the page before replaying the diff. It costs WAL volume but eliminates a whole class of corruption.\n\nThe cost of fuzzy checkpointing in PostgreSQL is I/O spread: the checkpointer deliberately throttles its write rate (controlled by `checkpoint_completion_target`, default 0.9) to avoid a burst of I/O that would starve foreground queries. You trade a short pause for a longer, gentler I/O ramp.\n\n---\n\n### Shadow Paging with WAL Checkpointing (SQLite WAL Mode)\n\nSQLite's WAL mode flips the architecture. Instead of writing to the main database file and logging changes separately, it writes _only_ to the WAL file during transactions. The main database file is the \"checkpoint,\" and it is always consistent because it is only updated during an explicit checkpoint operation.\n\nReads check the WAL first. If a page appears in the WAL, that version is used. Otherwise the main file is read. This means readers never block writers and writers never block readers, which is the headline feature of WAL mode.\n\nA checkpoint copies pages from the WAL back to the main database file. The tricky part: you cannot overwrite a WAL page that a current reader might still need. SQLite tracks this with \"read marks,\" a small array of frame numbers indicating the WAL position at which each active reader started. A checkpoint can only copy WAL frames up to the minimum read mark.\n\n```c\n// Simplified: SQLite WAL checkpoint logic\nfor (frame = 0; frame < wal->nBackfill; frame++) {\n    if (frame >= minReadMark) break;  // don't overwrite frames active readers need\n    copyFrameToDatabase(wal, frame);\n}\nwal->nBackfill = frame;\n```\n\nThe checkpoint is non-blocking by default (PASSIVE mode): it copies as many frames as it can without waiting for readers. Frames that active readers are sitting on get left in the WAL. The WAL never truncates until all frames can be checkpointed (or you use TRUNCATE mode and accept that readers might have to block briefly).\n\nThis means in write-heavy workloads, the WAL can grow unboundedly if a long-running reader is holding back the checkpoint. This is the main operational footgun in SQLite WAL mode.\n\n---\n\n### Fork-Based Snapshot (Redis BGSAVE)\n\nRedis keeps its entire dataset in memory. Persisting it to disk (the RDB file) requires serializing potentially gigabytes of data. Redis's answer: `fork()`.\n\n```\n$ redis-cli BGSAVE\nBackground saving started\n```\n\nWhen `BGSAVE` runs, Redis calls `fork()` to create a child process. The child gets a copy-on-write view of the parent's memory at the exact moment of the fork. The child then walks all the data structures and writes them to a new RDB file sequentially.\n\nThe parent continues serving writes. When the parent modifies a memory page, the OS creates a private copy for the parent, leaving the child's view (the original page) intact. The child always sees the consistent snapshot from the fork point, regardless of what the parent does.\n\n```\nParent process (writes continue):\n  [page A] -> modified, OS creates copy, parent gets new page\n  [page B] -> unmodified, parent and child share the same physical page\n\nChild process (reads from fork-point snapshot):\n  [page A] -> reads original version (before parent's write)\n  [page B] -> reads shared page (same as parent, no copy needed)\n```\n\nThe cost is memory. In the worst case, if every page is written during the fork, memory usage doubles. Redis exposes this as `rdb_changes_since_last_save` and `used_memory_rss`, and it is the reason why Redis instances need headroom above their working set size. A 16 GB Redis instance on a 20 GB host will run out of memory during a checkpoint under heavy write load.\n\nThe RDB file is written atomically: the child writes to a temp file and renames it over the old RDB on completion. If the child crashes, the old RDB is intact.\n\n---\n\n### Memtable Flush and Compaction Pipeline (RocksDB)\n\nRocksDB does not have a traditional checkpoint in the database sense. Writes go to a MemTable (an in-memory skip list), and when the MemTable reaches a size threshold, it is converted to an immutable MemTable and a new active MemTable is allocated. A background thread then flushes the immutable MemTable to an SSTable file on disk (Level 0).\n\n```\nWrite path:\n  WAL append (synchronous, configurable) --> MemTable insert\n                                              |\n                                    [MemTable full]\n                                              |\n                              Rotate to immutable MemTable\n                              Allocate new active MemTable\n                                              |\n                           [Background flush thread]\n                                              |\n                              Write L0 SSTable to disk\n```\n\nThe flush itself never blocks writes because the active MemTable is separate from the immutable one being flushed. Writes accumulate in the new active MemTable while the flush proceeds. The WAL guarantees durability: even if the flush has not finished, a crash can be recovered by replaying the WAL.\n\nRocksDB also supports `GetLiveFiles()` for point-in-time snapshots. This is used by tools like `rocksdb_checkpoint` and by TiKV for consistent backups. It works by flushing the MemTable to L0, then hardlinking all current SSTable files into a new directory. Hardlinks are instantaneous and the files are immutable once written, so this is a consistent snapshot with no write stall.\n\n```cpp\n// RocksDB checkpoint: flush memtable, then hardlink all SSTables\nStatus Checkpoint::CreateCheckpoint(const std::string& checkpoint_dir) {\n    // 1. Flush memtable to L0 SSTable\n    db_->Flush(FlushOptions());\n\n    // 2. Get list of all live SSTable files\n    std::vector<std::string> live_files;\n    db_->GetLiveFiles(live_files, &manifest_file_size);\n\n    // 3. Hardlink each SSTable into the checkpoint directory\n    for (const auto& file : live_files) {\n        env_->LinkFile(db_dir + file, checkpoint_dir + file);\n    }\n    // Hardlinks are atomic at the filesystem level -- no partial state possible\n}\n```\n\nThe compaction process (merging L0 through LN SSTables) runs entirely in the background and never blocks reads or writes. Reads consult all levels concurrently using a consistent view; the old SSTable files are not deleted until all active iterators pointing to them have been released.\n\n---\n\n### WiredTiger's Hazard Pointers and Checkpoint Cursor (MongoDB)\n\nWiredTiger, the storage engine behind MongoDB since 3.0, uses a B-tree structure with a checkpoint mechanism that is closer to PostgreSQL's fuzzy checkpoint but implemented with its own concurrency primitives.\n\nWiredTiger maintains two \"checkpoints\" at all times: the last durable checkpoint (on disk) and the in-progress one being built. When a checkpoint starts, it records the current \"stable timestamp\" (in MongoDB, this is coordinated with the replication system so only majority-committed writes are checkpointed). It then walks all modified B-tree pages and writes them to disk.\n\nConcurrent readers use \"hazard pointers\": before reading a page, a thread registers the page's address. The checkpoint process checks hazard pointers before evicting or overwriting a page, ensuring it does not free memory that a reader is actively using. This is a form of lock-free synchronization that avoids any global pause.\n\nThe checkpoint writes to a new location on disk rather than overwriting the old pages (WiredTiger uses append-only writes). When the checkpoint completes, it updates a small metadata file atomically. The old pages become garbage and are reclaimed on the next pass. If the process crashes mid-checkpoint, the metadata file still points to the previous valid checkpoint, and recovery replays the journal (WiredTiger's WAL) from that point.\n\n```\nDisk layout during checkpoint:\n  [checkpoint N: pages A, B, C at offset 0x1000, 0x2000, 0x3000]\n  [in-progress writes: pages A', B' at offset 0x8000, 0x9000]\n\n  Crash mid-checkpoint:\n    metadata.json still points to checkpoint N\n    Recovery replays journal from checkpoint N timestamp\n    Pages A', B' at 0x8000 are ignored (never committed)\n```\n\nMongoDB exposes the checkpoint interval via `storage.syncPeriodSecs` (default: 60 seconds). The checkpoint does not stall writes, but it does consume I/O bandwidth. On heavily loaded systems, this can cause latency spikes if the disk is saturated; the fix is usually faster storage or more aggressive `wiredTigerCacheSizeGB` tuning to reduce the dirty page ratio.\n\n---\n\n## How It All Fits Together\n\nEvery non-blocking checkpoint strategy reduces to one of three primitives, or a combination:\n\n```\n1. Record where you are, flush async, replay the log forward from that point\n   (PostgreSQL fuzzy checkpoint, WiredTiger)\n\n2. Write to a side channel, checkpoint = merge side channel back to main store\n   (SQLite WAL, RocksDB L0 flush)\n\n3. Fork the process to get a copy-on-write snapshot, serialize from the child\n   (Redis BGSAVE)\n```\n\n```\nThe trade-offs follow directly from the primitive:\n\n| System | Primitive | Write stall | Memory overhead | Recovery cost |\n|--------|-----------|-------------|-----------------|---------------|\n| PostgreSQL | WAL + fuzzy flush | None (I/O spread) | Low | Replay from redo point |\n| SQLite WAL | Side-channel merge | Brief (TRUNCATE mode) | Low (WAL file) | WAL replay |\n| Redis BGSAVE | fork() | None | Up to 2x RSS | None (RDB is full snapshot) |\n| RocksDB | Immutable flush | None | MemTable per flush | WAL replay to L0 |\n| WiredTiger | Append-only + hazard ptrs | None (I/O bound) | Low | Journal replay |\n```\n\n---\n\n## Lessons Learned\n\nThe \"no stall\" claim in most systems documentation is technically true but practically incomplete. PostgreSQL does not pause writes during a checkpoint, but it does throttle them via `checkpoint_completion_target` to spread I/O. Redis does not stall the parent, but the child's memory pressure can trigger OOM or swap thrashing. RocksDB flushes do not stall unless you hit the write buffer limit and the flush thread falls behind.\n\nThe practical lesson: checkpoint behavior is only observable under load. A system that checkpoints cleanly at 10% write saturation may stall badly at 80% because the background flush cannot keep up with the incoming write rate. Tuning checkpoint aggressiveness (frequency, write rate, buffer size) is always workload-specific.\n\nThe other non-obvious cost is recovery time. A fuzzy checkpoint is cheap to produce but more expensive to recover from, because recovery must replay the WAL forward. A full snapshot (Redis RDB, RocksDB checkpoint via `GetLiveFiles()`) has a higher upfront cost but zero WAL replay on restart. For systems with multi-hour WAL streams, the recovery time difference matters a lot.\n\n---\n\n## What's Next\n\nThe next layer of this problem is distributed checkpointing: how do you produce a consistent snapshot across multiple nodes without a global pause? Chandy-Lamport gives you the theoretical model, but systems like Flink (asynchronous barrier snapshotting) and Spanner (TrueTime-based snapshot reads) have had to bend those ideas considerably to make them work at production scale. That is a different post.\n\n---\n\n## References\n\n- [PostgreSQL Documentation: WAL Configuration](https://www.postgresql.org/docs/current/wal-configuration.html)\n- [SQLite WAL Mode](https://www.sqlite.org/wal.html)\n- [Redis Persistence](https://redis.io/docs/latest/operate/oss_and_stack/management/persistence/)\n- [RocksDB Wiki: Checkpoints](https://github.com/facebook/rocksdb/wiki/Checkpoints)\n- [WiredTiger: Checkpoint Overview](https://source.wiredtiger.com/develop/checkpoint.html)\n- [The Chubby Lock Service (Google)](https://research.google/pubs/the-chubby-lock-service-for-loosely-coupled-distributed-systems/)\n- [Flink Asynchronous Barrier Snapshotting](https://nightlies.apache.org/flink/flink-docs-stable/docs/concepts/stateful-stream-processing/)\n",
            "url": "https://gauravsarma.com/posts/2026-03-15_checkpointing-without-stopping-the-world",
            "title": "How Databases Checkpoint to Disk Without Stopping the World",
            "summary": "Your database has gigabytes of dirty pages in memory.  At some point they need to hit disk...",
            "date_modified": "2026-03-15T00:00:00.000Z"
        },
        {
            "id": "https://gauravsarma.com/posts/2026-03-11_cursor-pagination-vs-offset-pagination",
            "content_html": "\n![Cursor vs Offset Pagination](cursor-pagination-vs-offset-pagination-cover.png)\n\nYour API returns the first page of results in 12ms. Page 10 takes 45ms. Page 100 takes 800ms. The query hasn't changed. The table hasn't grown. The only thing that changed is the offset. This is not a mystery once you understand what the database is actually doing, and it is entirely avoidable.\n\n---\n\n## The Problem\n\nMost APIs are built with offset pagination first because it maps naturally to how humans think about pages. \"Give me items 0 through 10, then 10 through 20.\" It also maps naturally to the SQL you already know:\n\n```sql\nSELECT * FROM posts ORDER BY created_at DESC LIMIT 10 OFFSET 100;\n```\n\nThe problem is what happens inside the database when you run this. The query planner cannot seek directly to row 100. It must scan the index from the beginning, count 100 rows, discard them, and then return the next 10. At offset 100 that cost is small. At offset 100,000 on a busy table, you are discarding 100,000 rows per request, every request, for every user sitting on a late page.\n\nThere is a second problem: **drift**. If a new row is inserted into the table between the time a client fetches page 1 and page 2, every subsequent page shifts by one. Items get duplicated or silently skipped. This is not theoretical. It happens on any live table with ongoing writes.\n\n---\n\n## Prerequisites\n\n- Familiarity with SQL: SELECT, WHERE, ORDER BY, indexes\n- Basic understanding of how database indexes work (B-tree lookup vs sequential scan)\n- Some exposure to building or consuming paginated REST APIs\n\n---\n\n## The Three Approaches\n\n### Offset Pagination\n\n```sql\nSELECT id, title, created_at\nFROM posts\nORDER BY created_at DESC\nLIMIT 10 OFFSET $offset;\n```\n\nThe client tracks a page number or offset integer and increments it on each request. Simple to implement, simple to reason about, and completely broken at scale.\n\n**Cost**: O(offset + page_size). The database must touch every row before the offset to count past them.\n\n**Drift**: any write between pages can shift results.\n\n**Random access**: works. You can jump to page 47 with `OFFSET 470`.\n\n### Cursor Pagination\n\nThe client instead passes the position of the last item it saw, and the server fetches rows *after* that position:\n\n```sql\n-- First page (no cursor)\nSELECT id, title, created_at\nFROM posts\nORDER BY id DESC\nLIMIT 10;\n\n-- Subsequent pages (cursor = last seen id)\nSELECT id, title, created_at\nFROM posts\nWHERE id < $last_seen_id\nORDER BY id DESC\nLIMIT 10;\n```\n\nThe `WHERE id < $last_seen_id` clause turns this into an index seek. The database goes directly to the position in the B-tree and reads forward. Cost is O(log N + page_size) regardless of how far into the dataset you are.\n\n**Cost**: O(log N + page_size). Constant with respect to pagination depth.\n\n**Drift**: none. The cursor encodes an absolute position, not a relative count.\n\n**Random access**: impossible. You cannot jump to page 47 without traversing pages 1 through 46 first.\n\n### Keyset Pagination\n\nKeyset pagination is the generalization of cursor pagination for arbitrary sort orders. When your sort column is not unique (common in practice: `created_at`, `score`, `price`), you add a tiebreaker:\n\n```sql\n-- Sort by created_at DESC, id DESC (stable, unique composite key)\nSELECT id, title, created_at\nFROM posts\nWHERE (created_at, id) < ($last_ts, $last_id)\nORDER BY created_at DESC, id DESC\nLIMIT 10;\n```\n\nThe tuple comparison `(created_at, id) < ($last_ts, $last_id)` matches PostgreSQL and most other databases' row value comparisons. The composite index on `(created_at, id)` makes this an index seek rather than a scan.\n\n---\n\n## Technical Decisions\n\n### Cursor column requirements\n\nNot every column works as a cursor. The requirements are strict:\n\n1. **Indexed**: the column must be part of an index the database can seek on.\n2. **Monotonic or stable for your sort order**: the cursor encodes a position in the sort order, so the sort must be deterministic.\n3. **Unique (or made unique via tiebreaker)**: if two rows have the same cursor value, the `WHERE` clause will skip all of them or return duplicates.\n\nAuto-increment integer IDs satisfy all three naturally. UUIDs do not work unless they are time-ordered (UUIDv7, ULID). `created_at` timestamps are not unique, so you always need `(created_at, id)` as a composite cursor.\n\n### Opaque cursors\n\nExposing raw `id` or `timestamp` values as the cursor leaks schema internals to clients and creates fragile contracts. If you later switch from integer IDs to UUIDs, every client breaks.\n\nThe standard practice is to base64-encode the cursor value:\n\n```go\n// Encode: serialize the cursor payload and base64 it\ntype CursorPayload struct {\n    CreatedAt time.Time `json:\"created_at\"`\n    ID        int64     `json:\"id\"`\n}\n\nfunc EncodeCursor(p CursorPayload) string {\n    b, _ := json.Marshal(p)\n    return base64.StdEncoding.EncodeToString(b)\n}\n\nfunc DecodeCursor(s string) (CursorPayload, error) {\n    b, err := base64.StdEncoding.DecodeString(s)\n    if err != nil {\n        return CursorPayload{}, err\n    }\n    var p CursorPayload\n    return p, json.Unmarshal(b, &p)\n}\n```\n\nThe API response includes the cursor for the next page:\n\n```json\n{\n  \"items\": [...],\n  \"next_cursor\": \"eyJjcmVhdGVkX2F0IjoiMjAyNi0wMy0xMVQxMjowMDowMFoiLCJpZCI6NDJ9\"\n}\n```\n\nThe client passes `?cursor=eyJ...` on the next request. You can change the internal encoding at any time without breaking the contract, as long as you version or gracefully handle old cursors.\n\n### Forward-only is a real constraint\n\nCursor pagination does not support backward navigation or random page access without significant additional complexity. If your product has \"page N of M\" UI with a page number input, cursor pagination forces you to either:\n\n- Drop the random-access feature\n- Pre-paginate results and cache page cursors server-side\n- Accept offset pagination's costs for this specific use case\n\nMany consumer products (Twitter, Instagram, GitHub notifications) use cursor-based infinite scroll precisely because the UX does not require random page access.\n\n---\n\n## Implementation\n\n### Setting up the index\n\nBefore writing any application code, make sure the index exists. A missing index turns a cursor seek into a full table scan:\n\n```sql\n-- For cursor on id only (simpler case)\nCREATE INDEX IF NOT EXISTS posts_id_desc ON posts (id DESC);\n\n-- For cursor on (created_at, id) composite\nCREATE INDEX IF NOT EXISTS posts_created_id ON posts (created_at DESC, id DESC);\n```\n\nPostgreSQL's query planner will use these for the tuple comparison `WHERE (created_at, id) < ($1, $2)`.\n\n### The query\n\n```sql\n-- No cursor (first page)\nSELECT id, title, body, created_at\nFROM posts\nORDER BY created_at DESC, id DESC\nLIMIT $1;\n\n-- With cursor\nSELECT id, title, body, created_at\nFROM posts\nWHERE (created_at, id) < ($1, $2)\nORDER BY created_at DESC, id DESC\nLIMIT $3;\n```\n\nIn Go with `database/sql`:\n\n```go\nfunc ListPosts(ctx context.Context, db *sql.DB, cursor *CursorPayload, limit int) ([]Post, *CursorPayload, error) {\n    var (\n        rows *sql.Rows\n        err  error\n    )\n\n    if cursor == nil {\n        rows, err = db.QueryContext(ctx, `\n            SELECT id, title, body, created_at\n            FROM posts\n            ORDER BY created_at DESC, id DESC\n            LIMIT $1\n        `, limit+1) // fetch one extra to detect if there's a next page\n    } else {\n        rows, err = db.QueryContext(ctx, `\n            SELECT id, title, body, created_at\n            FROM posts\n            WHERE (created_at, id) < ($1, $2)\n            ORDER BY created_at DESC, id DESC\n            LIMIT $3\n        `, cursor.CreatedAt, cursor.ID, limit+1)\n    }\n    if err != nil {\n        return nil, nil, err\n    }\n    defer rows.Close()\n\n    var posts []Post\n    for rows.Next() {\n        var p Post\n        if err := rows.Scan(&p.ID, &p.Title, &p.Body, &p.CreatedAt); err != nil {\n            return nil, nil, err\n        }\n        posts = append(posts, p)\n    }\n\n    // If we got limit+1 results, there is a next page\n    var nextCursor *CursorPayload\n    if len(posts) > limit {\n        last := posts[limit-1]\n        nextCursor = &CursorPayload{CreatedAt: last.CreatedAt, ID: last.ID}\n        posts = posts[:limit] // trim the extra row\n    }\n\n    return posts, nextCursor, rows.Err()\n}\n```\n\nThe `limit+1` trick avoids a separate `COUNT(*)` query to determine whether a next page exists. You fetch one more than you need: if you get it, there is a next page and the cursor points to the last item you actually return.\n\n### The HTTP handler\n\n```go\nfunc (h *Handler) ListPostsHandler(w http.ResponseWriter, r *http.Request) {\n    limit := 20\n    var cursor *CursorPayload\n\n    if raw := r.URL.Query().Get(\"cursor\"); raw != \"\" {\n        decoded, err := DecodeCursor(raw)\n        if err != nil {\n            http.Error(w, \"invalid cursor\", http.StatusBadRequest)\n            return\n        }\n        cursor = &decoded\n    }\n\n    posts, nextCursor, err := ListPosts(r.Context(), h.db, cursor, limit)\n    if err != nil {\n        http.Error(w, \"internal error\", http.StatusInternalServerError)\n        return\n    }\n\n    resp := map[string]any{\"items\": posts}\n    if nextCursor != nil {\n        resp[\"next_cursor\"] = EncodeCursor(*nextCursor)\n    }\n\n    w.Header().Set(\"Content-Type\", \"application/json\")\n    json.NewEncoder(w).Encode(resp)\n}\n```\n\n---\n\n## How It All Fits Together\n\nA client fetches the first page with no cursor. The server returns items plus `next_cursor`. The client stores the cursor and passes it as `?cursor=...` on the next request. The server decodes the cursor, uses it in a `WHERE (created_at, id) < (...)` index seek, returns the next page plus a new cursor. This continues until `next_cursor` is absent from the response, signalling the last page.\n\n```\nClient                          Server                      DB\n  |                               |                          |\n  |-- GET /posts ----------------->|                          |\n  |                               |-- SELECT ... LIMIT 21 -->|\n  |                               |<-- 21 rows --------------|\n  |<-- {items, next_cursor} ------|                          |\n  |                               |                          |\n  |-- GET /posts?cursor=eyJ... --->|                          |\n  |                               |-- SELECT ... WHERE (created_at,id) < (...) -->|\n  |                               |<-- 21 rows -------------------------------|\n  |<-- {items, next_cursor} ------|                          |\n```\n\nEvery request is an index seek at the same cost, regardless of which page you are on.\n\n---\n\n## Lessons Learned\n\n**Offset pagination is fine for small, stable datasets.** If your table has fewer than 10,000 rows and write volume is low, the offset cost is negligible and the simplicity is worth it. Optimise when you have a measured problem, not before.\n\n**Composite cursors are the rule, not the exception.** Pure `id`-based cursors only work when sorting by ID. The moment a client wants to sort by `created_at`, `score`, or any non-unique column, you need a composite cursor. Build the infrastructure for it once and all sort orders become easy.\n\n**The limit+1 trick is underused.** Many implementations do a separate `SELECT COUNT(*)` to determine if a next page exists. That count query is expensive on large tables and becomes a bottleneck as the table grows. Fetching one extra row is always cheaper.\n\n**Backwards pagination is genuinely hard.** If you need \"previous page\", you either need to store the cursor history client-side (feasible) or add a second query that reverses the sort direction. Neither is terrible, but neither is as clean as forward-only. Design your UX around this constraint early.\n\n**Do not sort by `RANDOM()` with cursor pagination.** Cursor pagination requires a stable, deterministic sort order. Randomised feeds need a different approach entirely (pre-generated feed tables, snapshot isolation, or accepting that cursor pagination does not apply).\n\n---\n\n## What's Next\n\nIf your dataset is large enough that even keyset pagination struggles (extremely high-cardinality columns, cross-shard queries), the next step is usually **pre-materialized feed tables** or **seek-based pagination with snapshot reads**. These are common patterns in high-scale social feeds but add significant infrastructure complexity.\n\nFor most APIs, keyset pagination on a composite index is the right answer and the ceiling for when it stops being enough is very high.\n\n---\n\n## References\n\n- [Markus Winand: \"We Need Tool-Support for Keyset Pagination\"](https://use-the-index-luke.com/no-offset)\n- [PostgreSQL Documentation: Row Comparisons](https://www.postgresql.org/docs/current/functions-comparisons.html#ROW-WISE-COMPARISON)\n- [Slack Engineering: Evolving API Pagination at Slack](https://slack.engineering/evolving-api-pagination-at-slack/)\n- [Stripe API Pagination](https://stripe.com/docs/api/pagination)\n",
            "url": "https://gauravsarma.com/posts/2026-03-11_cursor-pagination-vs-offset-pagination",
            "title": "Cursor Pagination vs Offset Pagination: Which One Should You Use?",
            "summary": ". [Cursor vs Offset Pagination](cursor-pagination-vs-offset-pagination-cover...",
            "date_modified": "2026-03-11T00:00:00.000Z"
        },
        {
            "id": "https://gauravsarma.com/posts/2026-03-09_mongodb-wiredtiger-vs-sqlite-storage",
            "content_html": "![MongoDB WiredTiger vs SQLite Storage](mongodb-wiredtiger-vs-sqlite-storage-cover.png)\n\n\nYou migrate a collection from MySQL to MongoDB expecting simpler operations and schema flexibility. Reads are fast at first. Then, as documents grow with nested arrays and embedded objects, some queries start taking ten times longer than expected. The collection isn't huge. The indexes are there. `explain()` shows the index is being used. What's happening underneath is that MongoDB's storage engine is managing pages, overflow references, reconciliation, and cache pressure in ways that have real costs, and those costs are invisible until you understand the storage model.\n\nThis post covers how WiredTiger, MongoDB's default storage engine since version 3.2, actually stores data on disk and in memory. It then compares that model directly to SQLite's fixed-page B-tree, which we covered in detail in the [SQLite overflow pages post](https://www.gauravsarma.com/posts/2026-03-06_sqlite-overflow-pages).\n\n---\n\n## Prerequisites\n\n- Familiarity with the SQLite storage model: fixed pages, B-trees, overflow chains. Read [SQLite Overflow Pages - When Your Rows Don't Fit](https://www.gauravsarma.com/posts/2026-03-06_sqlite-overflow-pages) first if you haven't.\n- Basic understanding of what BSON is and how MongoDB documents are structured\n- Awareness of B-trees and B+ trees as data structures\n- Awareness of I/O and cache as performance concerns in databases\n\n---\n\n## SQLite's Model in One Paragraph\n\nSQLite stores everything in a flat array of fixed-size pages, 4KB by default. Each page is a node in a B-tree. Leaf pages hold rows packed as cells from the bottom up. When a row's data exceeds the per-cell threshold (roughly 4057 bytes on a 4KB page), SQLite stores the first portion inline and chains the remainder through a linked list of overflow pages. The entire file, from header to last page, uses the same fixed page size. There is no compression. The on-disk format and the in-memory format are essentially the same: the page cache holds exact copies of on-disk pages, unmodified in structure.\n\nThat model is simple and predictable. WiredTiger is neither.\n\n---\n\n## WiredTiger's Foundations\n\nWiredTiger is a general-purpose key-value storage engine. MongoDB uses it to store each collection as a WiredTiger B-tree, where the key is the document's `_id` field (serialized as BSON) and the value is the full document serialized as BSON. Indexes are stored as separate WiredTiger B-trees.\n\nTwo things set WiredTiger apart from SQLite's storage model from the start:\n\n**Variable-size pages.** WiredTiger does not use a single fixed page size for the entire database. Internal (non-leaf) pages default to 4KB. Leaf pages default to 32KB. These are configured per collection, not globally, and can be changed at collection creation time. The larger default leaf page size reflects WiredTiger's expectation that documents are bigger and more varied than SQLite rows.\n\n**Separate in-memory and on-disk formats.** SQLite's page cache holds exact copies of on-disk pages. WiredTiger does not. When a page is read from disk into the WiredTiger cache, it is decompressed and transformed into an in-memory representation that is structurally different from what's on disk. When a dirty page needs to be written back, it goes through a process called **reconciliation** that re-serializes and re-compresses the in-memory state into the on-disk format. This split is fundamental to how WiredTiger achieves its performance characteristics.\n\n---\n\n## The On-Disk Page Format\n\nEvery WiredTiger page on disk starts with two headers, then contains a sequence of cells.\n\nThe **page header** is 28 bytes and contains: the page type, the number of entries, the page's logical record count, and two checksums. The **block header** immediately follows and is 12 bytes, containing the on-disk size, the uncompressed size (for decompression), and a checksum.\n\nAfter the headers comes the cell data. Each cell encodes one key or one value using a compact variable-length format. A 1-byte cell descriptor encodes the cell type and, for short values, the length inline. Longer values use additional length bytes.\n\nA WiredTiger leaf page for a MongoDB collection looks like this:\n\n![WiredTiger leaf page on-disk format](mongodb-wiredtiger/page-format.mp4)\n\n```\n┌──────────────────────────────────────────────────────────────────┐\n│ Page Header (28 bytes)                                           │\n│   type=WT_PAGE_ROW_LEAF, entries=N, checksum=...                 │\n├──────────────────────────────────────────────────────────────────┤\n│ Block Header (12 bytes)                                          │\n│   disk_size, memsize, checksum                                   │\n├──────────────────────────────────────────────────────────────────┤\n│ Cell 0: key   [_id of document 0, BSON ObjectId, 12 bytes]      │\n│ Cell 1: value [full BSON document 0, variable length]           │\n│ Cell 2: key   [_id of document 1]                               │\n│ Cell 3: value [full BSON document 1]                            │\n│ ...                                                              │\n│ Cell N-1: key   [_id of document N/2]                           │\n│ Cell N:   value [full BSON document N/2]                        │\n└──────────────────────────────────────────────────────────────────┘\n         (entire page compressed on disk with snappy by default)\n```\n\nUnlike SQLite, which packs cells from the bottom of the page upward with a pointer array at the top, WiredTiger writes cells sequentially from the start of the data area. There is no pointer array. To find a specific key within a page, WiredTiger scans cells linearly or uses an in-memory search structure built when the page is loaded into cache.\n\n---\n\n## Compression: A Fundamental Difference\n\nWiredTiger compresses pages before writing them to disk. The default compression algorithm is Snappy, which is fast and achieves roughly 2:1 compression on typical BSON data. zlib and zstd are also available for higher compression ratios at a greater CPU cost.\n\nThe consequence is that the on-disk size of a page has no fixed relationship to its in-memory size. A 32KB leaf page in the WiredTiger cache might occupy only 14KB on disk. Reading that page from disk means reading 14KB, then decompressing to 32KB in memory. Writing means compressing from 32KB to some smaller size.\n\nSQLite has no equivalent. Its pages are written to disk exactly as they exist in the page cache: 4KB in, 4KB out. What you see on disk is what you get in memory. This makes SQLite's I/O model simpler to reason about but means it cannot reduce storage or I/O volume through compression.\n\nThe compression has a direct effect on the overflow threshold.\n\n---\n\n## Large Document Handling: Overflow in WiredTiger\n\nWiredTiger uses overflow pages for values that are too large to store on a leaf page without making that page unwieldy. The overflow threshold in WiredTiger is configurable but defaults to roughly one-quarter of the maximum leaf page size. For a 32KB leaf page, values larger than approximately 8KB are candidates for overflow storage.\n\nWhen a value exceeds the threshold, WiredTiger does not store any portion of it inline on the leaf page. The entire value is written to one or more dedicated overflow pages, and the leaf page stores a compact overflow reference: a 12-byte token that encodes the address of the overflow page on disk.\n\n![Overflow comparison: SQLite vs WiredTiger](mongodb-wiredtiger/overflow-comparison.mp4)\n\n```\nWiredTiger leaf page (32KB max)\n┌────────────────────────────────────────────────────────────┐\n│ Cell: key   [ObjectId]                                     │\n│ Cell: value [overflow ref → page offset 0x3A200, len=52KB] │ ← 12 bytes\n│ Cell: key   [ObjectId]                                     │\n│ Cell: value [full BSON document, 4KB]                      │ ← inline\n│ ...                                                         │\n└────────────────────────────────────────────────────────────┘\n                │\n                ▼\nWiredTiger overflow page\n┌────────────────────────────────────────────────────────────┐\n│ Page Header + Block Header                                 │\n│ Raw value data (52KB uncompressed, ~24KB compressed)       │\n└────────────────────────────────────────────────────────────┘\n```\n\nThis is the opposite of SQLite's behavior. SQLite always stores the first ~4057 bytes of an overflowing cell inline and chains the rest. WiredTiger stores nothing of the overflowing value inline: the leaf page holds only the 12-byte reference and the overflow page holds the complete value.\n\nThe practical implication: in SQLite, reading a row that overflows requires at least two reads: one for the leaf page (which gives you the first portion of the data) and one or more for the overflow pages. In WiredTiger, reading a document that overflows requires at least two reads too: one for the leaf page (which gives you the reference) and one for the overflow page (which gives you the complete document). For very large documents spread across multiple overflow pages, WiredTiger chains them similarly to SQLite's linked list of overflow pages.\n\nMongoDB enforces a 16MB limit on individual documents. This means overflow chains in WiredTiger are bounded: even in the worst case, a document requires at most a few hundred overflow pages. In practice, most documents that trigger overflow are in the tens to hundreds of kilobytes and occupy a single overflow page.\n\n---\n\n## The In-Memory Format and Reconciliation\n\nThis is where WiredTiger diverges most significantly from SQLite.\n\nWhen SQLite reads a page from disk into its page cache, the in-memory representation is the page itself: the same 4KB block, unmodified. Modifications are made directly to the in-memory page. When the page needs to be written back (during a checkpoint or WAL flush), the modified page is written as-is.\n\nWiredTiger is different at every step.\n\nWhen a page is read from disk, it is decompressed and then **split into separate in-memory structures**. The key-value pairs are unpacked from their compact cell encoding into a format that supports fast in-memory search and update. Specifically, each row on a leaf page becomes a skip list entry in memory, allowing O(log n) search within the page.\n\n![WiredTiger reconciliation: on-disk to in-memory and back](mongodb-wiredtiger/reconciliation.mp4)\n\n```\nOn disk (compressed, sequential cells):\n┌─────────────────────────────────────────────────────┐\n│ [cell: key0][cell: val0][cell: key1][cell: val1]... │\n│ (Snappy compressed, 14KB on disk)                   │\n└─────────────────────────────────────────────────────┘\n             │\n             │  read + decompress + unpack\n             ▼\nIn memory (skip list, uncompressed):\n┌─────────────────────────────────────────────────────┐\n│ WT_ROW entries:                                     │\n│   [key0 → val0 + update chain]                     │\n│   [key1 → val1 + update chain]                     │\n│   ...                                               │\n│   (32KB+ in memory, skip list indexed)             │\n└─────────────────────────────────────────────────────┘\n```\n\nUpdates to documents do not modify the in-memory value in place. Instead, WiredTiger prepends an **update** to a linked list hanging off that row's entry in the skip list. The update list is the in-memory MVCC mechanism: each update carries a transaction ID and a timestamp, and reads select the appropriate version by walking the update list.\n\nWhen the page is eventually written back (during eviction or checkpoint), it goes through **reconciliation**: WiredTiger walks every row in the in-memory page, selects the committed version visible to the checkpoint, serializes it into compact cell encoding, compresses the result, and writes the final block to disk. The on-disk page that results may be completely different in size and layout from the one that was originally read.\n\nThis reconciliation cost is real. A page with many small updates accumulates a long update chain. Reconciliation must walk the entire chain for every row on the page to determine the visible version. This is work that SQLite's WAL model avoids: in SQLite, the WAL contains the complete modified page, and a checkpoint simply copies WAL pages to the main database file.\n\n---\n\n## MVCC: Update Chains vs WAL Pages\n\nBoth MongoDB and SQLite support multi-version concurrency control, but they implement it differently.\n\nSQLite's WAL-mode MVCC works at the page level. When a page is modified, the new version of the entire page is written to the WAL file. Readers that started before the write see the original page (in the main database file). Readers that started after see the WAL version. Checkpointing copies WAL pages back to the main file. The page is the unit of versioning.\n\nWiredTiger's MVCC works at the row level. Each row in the in-memory skip list carries an update chain: a linked list of modifications ordered by transaction timestamp. A read at a given timestamp walks the update chain to find the first version visible at that timestamp. A write appends to the front of the chain. The row is the unit of versioning.\n\nThe row-level MVCC has implications:\n\nFor **write-heavy workloads** with many concurrent transactions updating different rows on the same page, WiredTiger's row-level MVCC is more efficient than page-level MVCC. In SQLite, any modification to a page causes the entire page to be written to WAL, which serializes writers at the page level. WiredTiger allows concurrent row-level updates within the same page.\n\nFor **long-running reads**, WiredTiger's update chains grow unbounded in memory until the old versions are no longer needed. A slow read transaction holds a timestamp that prevents older updates from being discarded, causing memory pressure. SQLite's page-level MVCC has the same problem in a different form: old WAL pages cannot be checkpointed until all readers that started before the corresponding write have finished.\n\n---\n\n## How the Page Cache Differs\n\nSQLite's page cache is simple: a fixed-size pool of 4KB slots. Each slot holds one page. When the cache is full, a least-recently-used page is evicted by writing it to disk (if dirty) and reclaiming the slot.\n\nWiredTiger's cache is more complex. The cache holds in-memory pages in their uncompressed, unpacked form. Because the in-memory representation is larger than the on-disk representation (due to unpacking and decompression), cache occupancy is measured in uncompressed bytes, not page count. A collection with 32KB leaf pages and 2:1 compression uses roughly twice as much cache space per page as disk space.\n\nWiredTiger monitors cache pressure through two thresholds: the **eviction trigger** (default 80% of configured cache size) and the **eviction target** (default 95%). When the cache exceeds the trigger, background eviction threads begin reconciling and evicting dirty pages. When the cache exceeds the target, foreground operations begin participating in eviction, which directly adds latency to reads and writes.\n\nThe configured cache size matters significantly. The default is the larger of 256MB or half of system RAM. For workloads with large documents, this can be consumed quickly by a relatively small number of pages.\n\n---\n\n## Comparison Summary\n\n```\n| Dimension                    | SQLite                                    | MongoDB (WiredTiger)                            |\n| ---------------------------- | ----------------------------------------- | ----------------------------------------------- |\n| Page size                    | Fixed, 4KB default (512B to 64KB)         | Variable; 4KB internal, 32KB leaf (configurable)|\n| On-disk format               | Same as in-memory (no transformation)     | Compressed; different from in-memory format     |\n| In-memory format             | Page cache holds exact disk pages         | Decompressed skip list with update chains        |\n| Overflow threshold           | ~4057 bytes (table leaf, 4KB page)        | ~8KB (one-quarter of 32KB leaf page)            |\n| Inline partial storage       | First ~4057 bytes inline, rest chained    | Nothing inline; full value on overflow page      |\n| Overflow chain structure     | Linked list of 4KB pages                  | Linked overflow pages (full value, compressed)  |\n| Compression                  | None                                      | Snappy by default (zlib, zstd available)        |\n| MVCC granularity             | Page-level (WAL pages)                    | Row-level (in-memory update chains)             |\n| Write-back mechanism         | Copy modified page to WAL                 | Reconciliation: re-serialize + compress page    |\n| Max storable value           | No hard limit (page size configurable)    | 16MB per document                               |\n| Document size limit          | Unlimited (row size = overflow chain)     | 16MB hard limit enforced by MongoDB layer       |\n\n```\n\n---\n\n## How It All Fits Together\n\n![Read path comparison: MongoDB WiredTiger vs SQLite](mongodb-wiredtiger/read-path-comparison.mp4)\n\nThe core structural difference is the extra transformation layer in WiredTiger. SQLite's page cache is a transparent mirror of the disk. WiredTiger's cache is a different data structure that happens to represent the same logical data as what's on disk, with the reconciliation process bridging the two forms.\n\n---\n\n## Lessons Learned\n\n**WiredTiger's larger default page size shifts the overflow threshold.** At 32KB leaf pages, documents up to ~8KB stay inline. Most MongoDB documents in typical workloads (user records, order documents, event logs) are well under 8KB in BSON and never trigger overflow. SQLite's 4KB page and ~4057 byte threshold means even moderately sized rows hit overflow much sooner. This is not an accident: MongoDB was designed with richer, larger documents in mind.\n\n**Compression changes the I/O math.** A 32KB in-memory page might write as 14KB to disk. SQLite's 4KB page always writes as 4KB. For storage-bound workloads, WiredTiger can achieve significantly higher effective throughput despite its larger page size. For CPU-bound workloads (compression and decompression on every page boundary), the calculus reverses.\n\n**The in-memory vs on-disk split makes cache sizing critical.** WiredTiger's cache holds pages in their uncompressed form. If your compression ratio is 2:1 and you have 100GB of data on disk, the effective working set in memory can be up to 200GB of uncompressed pages. Running WiredTiger with too small a cache means constant eviction, reconciliation overhead on every write path, and cache misses on reads.\n\n**Row-level MVCC is powerful but has hidden costs.** Long-running transactions that hold an old read timestamp prevent WiredTiger from discarding update chain entries. On a busy write workload, this can cause in-memory update chains to grow very long, making reconciliation expensive and increasing cache pressure. MongoDB's session timeout and transaction timeout settings exist partly to prevent this.\n\n**SQLite's simplicity is a genuine advantage for its use case.** The fixed-page, no-compression, page-is-the-unit-of-everything model is trivially understandable. You can reason about exactly how many I/Os a query will cost. You can compute overflow page counts with arithmetic. WiredTiger's model is more capable but significantly harder to reason about: cache occupancy, compression ratios, update chain lengths, and reconciliation timing all interact in ways that make performance prediction difficult without measurement.\n\n---\n\n## References\n\n- [SQLite Overflow Pages - When Your Rows Don't Fit](https://www.gauravsarma.com/posts/2026-03-06_sqlite-overflow-pages) - SQLite storage internals, previous post in this series\n- [WiredTiger Storage Engine Documentation](https://source.wiredtiger.com/develop/arch-page.html)\n- [WiredTiger Architecture Guide - Pages](https://source.wiredtiger.com/develop/arch-page.html)\n- [MongoDB WiredTiger Storage Engine](https://www.mongodb.com/docs/manual/core/wiredtiger/)\n- [WiredTiger Source - Page Format](https://github.com/wiredtiger/wiredtiger/blob/develop/src/include/btree.h)\n- [SQLite File Format - Overflow Pages](https://www.sqlite.org/fileformat2.html#overflow_pages)\n- [MongoDB Production Notes - WiredTiger Cache](https://www.mongodb.com/docs/manual/administration/production-notes/#wiredtiger-cache)\n\n## Conclusion\n\nPlease reach out to me [here](https://gauravsarma.com/ping) for more ideas or improvements.\n",
            "url": "https://gauravsarma.com/posts/2026-03-09_mongodb-wiredtiger-vs-sqlite-storage",
            "title": "How MongoDB Stores Data - WiredTiger Pages vs SQLite",
            "summary": ". [MongoDB WiredTiger vs SQLite Storage](mongodb-wiredtiger-vs-sqlite-storage-cover...",
            "date_modified": "2026-03-09T00:00:00.000Z"
        },
        {
            "id": "https://gauravsarma.com/posts/2026-03-08_sqlite-query-optimisation",
            "content_html": "![SQLite Query Optimisation](sqlite-query-optimisation-cover.png)\n\n\nYou write a query against a table with 500,000 rows and an index on the column you're filtering by. The query takes 800ms. You check the index is there. It is. You run `EXPLAIN QUERY PLAN` and see \"SCAN table\" where you expected \"SEARCH table USING INDEX\". The index exists but the planner chose not to use it. Why? The answer is almost never \"SQLite is broken\". It is almost always something you did that made the index inaccessible, or that made a full scan look cheaper than an index seek to the planner.\n\nThis post covers how SQLite's query planner works, how it makes decisions, and the specific patterns that cause it to make the wrong ones.\n\n---\n\n## The Problem\n\nSQLite's query optimiser is simpler than PostgreSQL's or MySQL's. It doesn't have a full cost-based planner with table statistics informing every decision. It uses a rule-based approach for a lot of choices, supplemented by lightweight statistics when `ANALYZE` has been run. This means the planner is predictable, but it also means the responsibility for giving it the right conditions falls more squarely on you.\n\nUnderstanding the planner well enough to avoid its blind spots is the difference between queries that run in single-digit milliseconds and queries that silently scan millions of rows every time they execute.\n\n---\n\n## Prerequisites\n\n- Familiarity with the previous posts in this series: [SQLite overflow pages](https://www.gauravsarma.com/posts/2026-03-06_sqlite-overflow-pages) and the general concept of B-tree indexes\n- Basic understanding of what an index is and why it's faster than a full scan\n- Comfort reading SQL queries in prose form\n\n---\n\n## The Schema Used Throughout This Post\n\nRather than switching examples with every section, every concept in this post is grounded in one schema: a simple e-commerce database with four tables.\n\n`users` has 1,000,000 rows. Each row has an `id`, `email`, `name`, `status`, and `created_at`. Status values are `'active'` (900,000 rows), `'suspended'` (80,000), and `'deleted'` (20,000).\n\n`orders` has 5,000,000 rows. Each row has an `id`, `user_id`, `status`, `created_at`, and `total_amount`. Status values are `'delivered'` (3,500,000 rows), `'shipped'` (1,000,000), `'pending'` (400,000), and `'cancelled'` (100,000).\n\n`order_items` has 20,000,000 rows. Each row has an `id`, `order_id`, `product_id`, `status`, `quantity`, and `unit_price`. Status values are `'active'` (18,000,000 rows), `'returned'` (1,500,000), and `'refunded'` (500,000).\n\n`products` has 50,000 rows. Each row has an `id`, `name`, `category`, and `price`.\n\n```\nusers        (1,000,000 rows)   id | email | name | status | created_at\norders       (5,000,000 rows)   id | user_id | status | created_at | total_amount\norder_items  (20,000,000 rows)  id | order_id | product_id | status | quantity | unit_price\nproducts     (50,000 rows)      id | name | category | price\n```\n\nThis schema is intentionally skewed. The `orders.status` distribution is lopsided: most orders are delivered and a small fraction are pending or cancelled. That skew will matter when we get to statistics.\n\n---\n\n## How SQLite Executes a Query\n\nBefore getting into optimisation, it helps to understand what the planner is actually doing.\n\nSQLite compiles each SQL statement into a program for a register-based virtual machine called the **VDBE** (Virtual Database Engine). The compilation step is where the planner operates: it takes the parsed query and decides how to satisfy it: which indexes to use, what order to join tables, whether to sort or use an indexed order.\n\nThe compiled program is a sequence of opcodes. Each opcode manipulates a small set of registers and drives the cursor(s) that walk through the B-tree. There is no vectorised execution, no parallel workers, no pre-fetching pipeline. SQLite processes one row at a time, in order, using a nested-loop model. The efficiency of a query comes almost entirely from whether the planner can avoid reading rows that don't contribute to the result.\n\n![SQLite query lifecycle: from SQL text to result rows via VDBE](sqlite-query-optimisation/query-lifecycle.mp4)\n\n```\nQuery lifecycle:\n\nSQL text\n   │\n   ▼\nParser (builds AST)\n   │\n   ▼\nQuery planner (chooses access paths and join order)\n   │\n   ▼\nCode generator (emits VDBE opcodes)\n   │\n   ▼\nVDBE (executes opcodes, drives B-tree cursors)\n   │\n   ▼\nResult rows\n```\n\nThe planner's job is to answer one question for each table in the query: \"which rows do I need, and what is the cheapest way to find them?\"\n\n---\n\n## Indexes in SQLite\n\nSQLite indexes are B-trees. Each entry in an index B-tree stores the indexed column values followed by the rowid of the corresponding table row. The index entries are sorted by the indexed values.\n\nWhen a query filters on an indexed column, the planner can do a **binary search** through the index to find the starting point, then walk forward through the leaf pages to collect matching entries. Each entry gives a rowid; the planner then uses that rowid to do a point lookup in the table B-tree to fetch the full row.\n\nCompare that to a full table scan, which visits every leaf page of the table B-tree in order. For `orders` with 5,000,000 rows spread across tens of thousands of pages, a full scan reads every page. An index-driven lookup reads only the index pages needed to find the matching rowids, then one table page per matching row.\n\nThe relative cost depends on **selectivity**: how large a fraction of the table matches the filter condition. An index is useful when only a small fraction matches. When most rows match, the index lookup costs more than a scan, because each rowid match requires a random read into the table B-tree rather than the sequential reads a scan produces.\n\n---\n\n## Reading EXPLAIN QUERY PLAN\n\nBefore debugging a slow query, the first tool to reach for is `EXPLAIN QUERY PLAN`. It shows the access strategy the planner chose, without executing the query.\n\nThe output is a short table with one row per table access in the plan. The important column is `detail`, which describes the access method. The three things you'll see most:\n\n**SCAN table**: A full table scan. Every row is visited. This is not always wrong (for small tables or high-selectivity conditions, it may be the right choice), but it is the first thing to look for when a query is slow.\n\n**SEARCH table USING INDEX index_name**: An index seek. The planner found a usable index, performed a binary search to the starting point, and walked forward to collect matching rows.\n\n**SEARCH table USING COVERING INDEX index_name**: The planner found not just a usable index but one that contains all the columns the query needs. The table B-tree is never touched. This is the fastest possible access path for a filtered read.\n\nWhen you see SCAN where you expected SEARCH, that is the signal to investigate why the index was not used.\n\n---\n\n## When the Planner Uses an Index\n\nSQLite can use an index when the filter condition on the indexed column is one the planner knows how to translate into a range search on the index B-tree.\n\n### Equality conditions\n\nAn equality filter on an indexed column is the clearest case. Filtering `orders` where `user_id = 42` on a table with an index on `user_id` allows the planner to binary-search the index to the first entry where `user_id = 42`, walk forward collecting all matching entries, and stop when the value changes. For a user with 200 orders in a 5,000,000-row table, that is roughly 200 rows read instead of 5,000,000.\n\n### Range conditions\n\nRange filters on an indexed column also work. Filtering `orders` where `created_at > '2026-01-01'` on an indexed `created_at` column allows the planner to binary-search to the first entry at or after that timestamp and walk forward to the end of the index. The planner can use the index for any of the standard comparison operators: `<`, `>`, `<=`, `>=`, and `BETWEEN`.\n\n### Multi-column indexes and the left-prefix rule\n\nA multi-column index on `(status, user_id, created_at)` on `orders` sorts entries first by `status`, then by `user_id` within each status value, then by `created_at` within each user. The planner can use this index for any filter that references a **prefix** of the indexed columns from left to right.\n\nA filter on `status = 'pending'` uses the index, narrowing from 5,000,000 rows to roughly 400,000. A filter on `status = 'pending' AND user_id = 42` uses the index more selectively, narrowing to the handful of pending orders for that user. A filter on `status = 'pending' AND user_id = 42 AND created_at > '2026-01-01'` uses it most selectively of all.\n\nA filter on only `user_id = 42` or only `created_at > '2026-01-01'` cannot use this index. Without constraining `status` first, the matching entries are scattered throughout the entire index: there is no contiguous range to search.\n\nThis is the left-prefix rule. The index is useful for any prefix of its column list, from left to right, as long as each column in the prefix has an equality or range constraint. Once you have a range constraint on a column, the planner can use subsequent columns in the index for filtering but not for narrowing the initial search range.\n\n---\n\n## What Prevents Index Use\n\n![When SQLite uses an index: filter patterns that work vs patterns that don't](sqlite-query-optimisation/index-decision.mp4)\n\n### Functions applied to the indexed column\n\nThe most common mistake. Applying a function to the column value in a filter condition makes the index inaccessible.\n\nThe index on `users.email` stores raw email values in sorted order. A filter like `lower(email) = 'alice@example.com'` wraps the column in a function. The index stores `'Alice@example.com'`, not `'alice@example.com'`. The planner cannot binary-search the index for a value that doesn't appear in it, so it falls back to a full scan of 1,000,000 users.\n\nSimilarly, filtering `orders` with `strftime('%Y', created_at) = '2026'` wraps `created_at` in a function. The index on `created_at` stores raw timestamps; it has no entries for computed year values. A full scan of 5,000,000 orders follows.\n\nThe fix in each case is to move the transformation to the other side of the comparison. Instead of `lower(email) = 'alice@example.com'`, enforce lowercase at write time and filter where `email = 'alice@example.com'`. Instead of `strftime('%Y', created_at) = '2026'`, filter where `created_at >= '2026-01-01' AND created_at < '2027-01-01'`. Both rewrites operate on raw column values, which the index does store.\n\nWhen the function cannot be moved to the other side, the index cannot be used. A **computed column** with its own index is the escape hatch: store the computed value explicitly and index that column instead.\n\n### LIKE with a leading wildcard\n\nA LIKE filter with a leading wildcard (`name LIKE '%alice%'` on `users`) cannot use a B-tree index for the same reason as a function: there is no contiguous range of matching entries in the sorted index. The planner must scan all 1,000,000 users and apply the pattern to each name.\n\nA LIKE filter with a trailing wildcard (`name LIKE 'alice%'`) can use a B-tree index. The matching entries form a contiguous range in the sorted index: everything from `'alice'` (inclusive) to `'alicf'` (the next string after all strings starting with `'alice'`). The planner can binary-search to the start of that range and walk forward.\n\nThis is why \"starts with\" searches are fast and \"contains\" searches are not, in SQLite's standard indexes. Full-text search (FTS5) is the correct tool for contains searches on `users.name` or `products.name`.\n\n### Type affinity mismatches\n\nSQLite has a type affinity system rather than strict types. Each column has an affinity (NUMERIC, INTEGER, TEXT, REAL, BLOB), and SQLite applies affinity rules when comparing values. A mismatch between the stored affinity and the type of the comparison value can prevent index use.\n\nThe most common case on this schema: `orders.user_id` is defined with INTEGER affinity. Filtering where `user_id = '42'` (a text literal) may or may not use the index on `user_id` depending on the affinity rules applied. In SQLite's type system, the integer 42 and the text `'42'` are different values. The planner may correctly choose not to use the index when it cannot determine that the affinity conversion will produce a match.\n\nThe fix is to always compare with values of the correct type: integer literals for `user_id`, `order_id`, and `product_id`; quoted strings for `status`, `email`, and `name`.\n\n### OR conditions\n\nA filter with OR between conditions on different columns, for example filtering `orders` where `status = 'pending' OR user_id = 42`, is a pattern the planner handles inconsistently. For an OR condition across two indexed columns, the planner needs to perform two separate index lookups and merge the results, deduplicating rowids.\n\nSQLite does support this in some cases through the **OR optimisation**: if both conditions are on indexed columns in the same table, the planner can use both indexes, collect the rowids from each, sort and merge them, and then fetch the matching rows. The result of `EXPLAIN QUERY PLAN` will show two SEARCH steps followed by a merge.\n\nBut this only works when each branch of the OR is independently satisfiable by an index. If `status` is indexed but `user_id` is not, the condition `status = 'pending' OR user_id = 42` requires a scan of all 5,000,000 orders regardless of the index on `status`, because there is no efficient way to find all rows where `user_id = 42`.\n\nThe reliable fix is to rewrite as a UNION: one query filtered by `status = 'pending'`, another filtered by `user_id = 42`, combined with `UNION` to deduplicate. This gives the planner two simple, independently indexable queries.\n\n### NOT and inequality filters on low-selectivity conditions\n\nThe planner will use an index for inequality filters (`!=`, `NOT IN`, `NOT LIKE`), but only when it estimates the filter to be selective enough. A condition like `orders.status != 'cancelled'` eliminates only 100,000 rows from a 5,000,000-row table. The filter matches 98% of rows. A scan is almost certainly cheaper than an index seek that produces 4,900,000 rowids each requiring a random table lookup.\n\nWithout `ANALYZE` data, the planner uses hard-coded heuristics for selectivity. It tends to overestimate the selectivity of inequality filters, sometimes choosing an index-driven path when a scan would be faster. Running `ANALYZE` gives the planner actual row count estimates to work with.\n\n---\n\n## The Role of ANALYZE\n\n`ANALYZE` scans each index and builds summary statistics stored in the `sqlite_stat1` table. For each index, it records the approximate number of rows per unique combination of indexed values.\n\nThese statistics change the planner's cost estimates significantly. Without them, the planner uses hard-coded heuristics. With them, it can compare the estimated number of rows returned by each index and choose the most selective one.\n\nThe `sqlite_stat1` table has one row per index. The `stat` column is a space-separated list of integers: the total row count in the table, followed by the average number of rows per unique value of the first column in the index, then per unique pair of the first two columns, and so on.\n\nFor an index on `(status, user_id)` on `orders`, after `ANALYZE`, the stat might read `5000000 1250000 5`. This tells the planner: there are 5,000,000 rows total; on average, 1,250,000 rows share each value of `status` (roughly right for four distinct values); on average, 5 rows share each `(status, user_id)` pair (roughly right for a million users each with a few orders per status). A filter on `status = 'pending' AND user_id = 42` is estimated at 5 rows, highly selective, and the planner will strongly prefer the index.\n\nRun `ANALYZE` after bulk loads, significant inserts, or schema changes. It is not run automatically. The statistics become stale as data changes; if the ratio of pending to delivered orders shifts significantly, the planner's estimates will be wrong until `ANALYZE` is run again.\n\n### The limit of sqlite_stat1: averages hide skew\n\n`sqlite_stat1` stores averages. For a column with a uniform distribution, an average is a reasonable proxy for any specific value's selectivity. But the `orders` schema is explicitly skewed, and that skew exposes the limitation directly.\n\nWith four distinct status values and 5,000,000 rows, the average is 1,250,000 rows per status. But the actual distribution is: `'delivered'` (3,500,000 rows), `'shipped'` (1,000,000), `'pending'` (400,000), and `'cancelled'` (100,000). The planner uses 1,250,000 as its estimate for every equality filter on `status`, regardless of which value is being queried.\n\nA filter on `status = 'cancelled'` matches 100,000 rows. The planner thinks it matches 1,250,000. It may conclude the index is not selective enough to be worth using and fall back to a full scan of 5,000,000 rows. A filter on `status = 'delivered'` matches 3,500,000 rows. The planner still thinks it matches 1,250,000, so it may use the index when a scan would actually be cheaper.\n\nThe average is wrong in both directions simultaneously. This is the fundamental limitation of `sqlite_stat1` for skewed data.\n\n### sqlite_stat4: sample-based estimates\n\n`sqlite_stat4` addresses this by storing actual sample rows from each index rather than just averages. When `ANALYZE` runs with stat4 enabled, it samples up to 24 representative rows from each index, recording their key values and the number of rows that fall between consecutive samples. This gives the planner a histogram it can interpolate against.\n\nFor the `status = 'cancelled'` filter, the planner can find `'cancelled'` among the samples, read the associated row count (100,000), and produce an accurate estimate rather than the global average. For a range query on `created_at` (say, all orders in January 2026), `sqlite_stat1` has no way to estimate how many rows fall within that window. stat4's samples let the planner interpolate between known sample points and scale by the inter-sample density.\n\n### The compile-time caveat\n\n`sqlite_stat4` is not enabled in all SQLite builds. It requires the `SQLITE_ENABLE_STAT4` flag at compile time. Most prebuilt SQLite distributions (including the one embedded in Python, Android, iOS, and the majority of Linux packages) do not enable it. You are almost certainly running with stat1 only unless you compiled SQLite yourself or are using a distribution that explicitly enables stat4.\n\nYou can check by querying the stat4 table directly. If it raises an error, stat4 is not compiled in. If it returns rows after `ANALYZE`, it is.\n\nThe practical implication: on standard SQLite builds, the planner's estimates for range queries and skewed equality filters are always based on averages. For tables like `orders` with a heavily skewed `status` distribution, `ANALYZE` alone may not be enough to prevent a bad plan. Schema changes (partial indexes on the sparse values, separating hot rows into a smaller table) often matter more than statistics in these cases.\n\n---\n\n## Join Optimisation\n\nSQLite uses a **nested-loop join** strategy. There are no hash joins, no merge joins, no parallel join workers. For every pair of tables in a query, SQLite picks one as the **outer** table and one as the **inner** table, then for each row produced by the outer side, it performs a lookup into the inner table for matching rows. If there are three tables, the result of the first two becomes the outer input for the third, and so on.\n\nThis model is simple and predictable, but its cost is entirely determined by two things: which table is outer and whether the inner table has an index on the join column. Get both of those right and the join is fast. Get either one wrong and the cost compounds badly.\n\n![SQLite join order: optimal vs wrong, with row counts](sqlite-query-optimisation/join-order.mp4)\n\n### The cost model\n\nConsider fetching all order items for a specific user. The query joins `orders` to `order_items` on `orders.id = order_items.order_id`, with a filter on `orders.user_id = 42`. User 42 has 200 orders.\n\nIf `orders` is outer (filtered to 200 rows via an index on `user_id`) and `order_items` has an index on `order_id`, the planner does one binary search into `order_items` per order: roughly 24 comparisons to reach the matching entries (log₂ of 20,000,000), then fetches the matching rows. For 200 orders that is about 4,800 index operations plus the actual row reads.\n\nIf the join order is reversed, with the unfiltered `order_items` (20,000,000 rows) as the outer table and no index on `orders.user_id` for the inner lookup, each of the 20,000,000 item rows triggers a lookup into `orders`. Without an index on `user_id`, each lookup is a full scan of the 5,000,000-row orders table. The numbers do not need to be computed to understand the problem.\n\nThe difference is not subtle. The variables are: which table is outer, how many rows it produces after filtering, and whether the inner join column is indexed.\n\n### How the planner chooses join order\n\nSQLite's planner enumerates all possible join orderings for queries with up to about 7 tables and estimates the cost of each. The cheapest ordering wins.\n\nThe cost estimate for a join ordering is based on:\n\n1. The estimated number of rows produced by the outer side (after applying any WHERE filters on that table)\n2. The cost of each inner lookup (index seek vs. full scan)\n3. Multiplied together, then summed across all tables\n\nWithout `ANALYZE` data, the planner estimates table sizes from page counts and uses fixed selectivity estimates for filter conditions. With `ANALYZE`, it uses actual row distributions from `sqlite_stat1` to estimate how many rows survive each filter.\n\nThe implication is direct: a query that joins `orders` filtered to `status = 'pending'` (400,000 real rows) with `users` (1,000,000 rows) should put `orders` as the outer table. Without statistics, the planner may estimate the `status = 'pending'` filter at the global average of 1,250,000 rows, making `orders` appear larger than `users` and potentially reversing the join order. Running `ANALYZE` gives it the real number of 400,000.\n\n### Join order with a WHERE filter on the inner table\n\nA WHERE filter on the inner table changes the dynamics. Consider joining `orders` to `order_items` on `orders.id = order_items.order_id`, with a filter on `order_items.status = 'returned'`.\n\nIf `order_items` is inner and has an index on `(order_id, status)`, the planner can binary-search to the exact range of entries where `order_id = ?` and then apply the `status = 'returned'` filter while walking that range. The combined index means both the join condition and the WHERE filter are applied in a single index traversal.\n\nIf the index only covers `order_id`, the planner can use the index for the join but must then apply `status = 'returned'` as a row-by-row filter after fetching each row from the table. For an order with 10 items where only 1 is returned, that means 10 table lookups to find 1 matching row. The `(order_id, status)` index produces 1 lookup directly.\n\nThe general rule: for inner tables that have both a join condition and a WHERE filter, an index that covers the join column first and the filter column second will almost always outperform an index on either column alone.\n\n### The wrong join order is silent\n\nWhen the planner chooses the wrong join order, no error is raised, no warning is emitted. The query returns the correct result. It just takes longer than it should. The only way to detect it is to run `EXPLAIN QUERY PLAN` and check which table appears as the outer table in each join step.\n\n`EXPLAIN QUERY PLAN` reports join steps as nested SCAN or SEARCH entries. The first table listed at each nesting level is the outer table. If you see the large, unfiltered table as the outer input feeding into the small, filtered table as the inner lookup, the order is wrong.\n\nThe fix is either to run `ANALYZE` so the planner has accurate row counts, or to force the join order by rewriting the query. SQLite respects the order of tables in a CROSS JOIN as a hint: `FROM orders CROSS JOIN order_items` forces `orders` to be outer even if the planner would have chosen otherwise.\n\n### Three-table join example\n\nConsider a query that fetches the names of products in orders placed by user 42. This joins `orders`, `order_items`, and `products`.\n\nUser 42 has 200 orders. Each order has on average 4 items. Those 800 items reference various products.\n\nThe optimal join order is:\n\n1. Start with `orders` filtered by `user_id = 42` using an index on `orders.user_id`: 200 rows\n2. For each order, look up its items in `order_items` via an index on `order_items.order_id`: ~800 total rows across all orders\n3. For each item, look up the product via `order_items.product_id` referencing the primary key of `products`: 1 lookup per item\n\nTotal inner lookups: 200 (order_items) + 800 (products) = 1,000 indexed lookups.\n\nThe worst join order would be to start with the unfiltered `products` table (50,000 rows outer), then look up matching `order_items` for each product, then check whether those items belong to orders from user 42. Even with indexes, this visits far more rows than the optimal order because the highly selective `user_id = 42` filter is applied last rather than first.\n\nThe planner will find the good ordering if indexes exist on `orders.user_id`, `order_items.order_id`, and `order_items.product_id`, and if `ANALYZE` has been run so it knows that `user_id = 42` produces 200 rows rather than an estimated fraction of 5,000,000.\n\n### Subqueries vs. joins\n\nIn SQLite, a correlated subquery in the WHERE clause is executed as a nested loop: once per row in the outer query. A query that finds all `users` who have at least one `'pending'` order, written as a correlated subquery against `orders`, has the same cost profile as a nested-loop join: once per user, the subquery scans or seeks into `orders`.\n\nWhether a correlated subquery or an explicit JOIN performs better depends on the planner's ability to push the filter down. SQLite's planner does flatten some subqueries into joins, but not all. If `EXPLAIN QUERY PLAN` shows a correlated subquery being evaluated as a full scan of `orders` for every user row, rewriting it as an explicit JOIN on `users.id = orders.user_id` with a filter on `orders.status = 'pending'` often gives the planner more flexibility to choose an efficient order and use indexes correctly.\n\n### Self-joins\n\nA self-join (joining a table to itself) follows the same rules. Suppose `users` had a `referred_by_user_id` column tracking which user referred each new signup. Walking one level of the referral tree (finding all users referred by user 42) joins `users` to itself on `users.referred_by_user_id = users.id` with a filter on the outer alias's `id = 42`. An index on `referred_by_user_id` makes the inner lookup fast; without it, every outer row triggers a full scan of all 1,000,000 users.\n\n---\n\n## Covering Indexes\n\nA **covering index** is an index that contains all the columns a query needs: filter columns, projected columns, and any columns used in ORDER BY or GROUP BY. When the planner can satisfy the entire query from the index alone, it never touches the table B-tree.\n\nConsider a query that fetches the `email` of all `'active'` users. With an index only on `users.status`, the planner binary-searches the index to find rows where `status = 'active'` (900,000 of them), then does a table lookup for each rowid to fetch `email`. That is 900,000 random reads into the `users` heap.\n\nIf the index is extended to `(status, email)`, the index entry already contains `email`. The planner can return it directly without touching the table. Those 900,000 random reads become sequential reads through the index. For a hot query running thousands of times a day, this is a significant difference.\n\nThe trade-off is index size and write overhead. An index on `(status, email)` is significantly larger than one on `status` alone. Every insert and update to `users` must maintain the additional index entry. Covering indexes are worth building for the queries that run most frequently on the largest tables; they are not worth it for every query.\n\n---\n\n## ORDER BY and Avoiding Sorts\n\nWhen a query has an ORDER BY clause, SQLite must either produce rows in the required order directly or collect all rows and sort them at the end. The sort is an additional pass over the result set and is expensive when the result set is large.\n\nAn index whose column order matches the ORDER BY column order eliminates the sort. The planner can walk the index in order and produce rows already sorted, without materialising and sorting the entire result set.\n\nConsider fetching a user's recent orders sorted by `created_at DESC`. An index on `(user_id, created_at)` on `orders` allows the planner to binary-search to user 42's entries, then walk backward through `created_at` values in descending order. No sort step is needed.\n\nWithout this index, the planner fetches all of user 42's orders using whatever index it has, collects them, and sorts. When `EXPLAIN QUERY PLAN` shows `USE TEMP B-TREE FOR ORDER BY`, a sort is happening. That is the signal that the index structure does not match the ORDER BY requirement.\n\nThe same principle applies when filtering and ordering together. An index on `(status, created_at)` on `orders` with a filter on `status = 'pending'` and ORDER BY `created_at` allows the planner to binary-search to the pending entries and walk forward in `created_at` order without a sort: all 400,000 pending orders delivered in chronological order directly from the index.\n\n---\n\n## Partial Indexes\n\nSQLite supports **partial indexes**: indexes that cover only a subset of rows, defined by a WHERE clause on the index. A partial index on `(user_id, created_at) WHERE status = 'pending'` on `orders` contains entries only for the 400,000 pending orders, not the full 5,000,000. It is smaller than a full index, faster to maintain, and available to the planner for any query whose WHERE clause is compatible with the index condition.\n\nThis is particularly useful given the skewed `status` distribution. Most operational queries care about pending and shipped orders, the small, active portion of the table. An index covering only those rows is a fraction of the size of a full index and avoids the planner's confusion about how selective `status` filters are: a partial index on `WHERE status = 'pending'` is definitionally selective, because it only contains 400,000 rows.\n\nThe planner will use it automatically when the query's WHERE clause implies the index's WHERE clause. A query that filters `status = 'pending' AND user_id = 42` can use the partial index on `(user_id, created_at) WHERE status = 'pending'` because the query's condition implies the index's condition.\n\nThe catch is specificity: the index is only usable for queries whose filter is compatible with the index condition. The partial index on `WHERE status = 'pending'` cannot help a query filtering for `status = 'shipped'`.\n\n---\n\n## How It All Fits Together\n\n```\nA query arrives:\n\nParse and bind parameters\n      │\n      ▼\nEnumerate candidate indexes for each table\n      │\n      ├── Check: is the index column used in a WHERE clause?\n      ├── Check: is the comparison operator compatible (=, <, >, LIKE prefix)?\n      ├── Check: is the column free of wrapping functions?\n      └── Check: does the index satisfy the left-prefix rule for multi-column indexes?\n      │\n      ▼\nEstimate cost of each candidate (rows read, table lookups required)\n      │  - Uses sqlite_stat1 if ANALYZE has been run\n      │  - Falls back to hard-coded heuristics otherwise\n      │\n      ▼\nChoose lowest-cost access path per table\n      │\n      ▼\nChoose join order (all permutations for <= 7 tables, heuristics beyond)\n      │\n      ▼\nEmit VDBE opcodes\n      │\n      ▼\nExecute: walk B-trees, filter rows, project columns, sort if needed\n```\n\nEvery step in this pipeline has a way to go wrong. Wrapping `orders.created_at` in `strftime()` removes the index from consideration in the first pass. A stale stat that no longer reflects the real pending/cancelled distribution leads to a wrong cost estimate in the second. A missing index on `order_items.order_id` makes the join to `order_items` scan 20,000,000 rows per order. A missing `created_at` suffix on the user's orders index forces a sort that could have been avoided.\n\n---\n\n## Lessons Learned\n\n**The planner only knows what you tell it.** If `ANALYZE` has not been run, the planner is guessing selectivity with heuristics. On the `orders` table with its skewed `status` distribution, those heuristics will be wrong for `'cancelled'` queries (100K rows, estimated at 1.25M) and wrong in the other direction for `'delivered'` queries (3.5M rows, also estimated at 1.25M). Run `ANALYZE` after any significant data change.\n\n**Functions on indexed columns are silent index killers.** A filter on `lower(email)` scans all 1,000,000 users even if `email` has an index. The query returns the right result; the planner simply does far more work than necessary. Enforce normalisation at write time and filter on the raw column value.\n\n**OR across columns is a signal to reconsider.** The condition `status = 'pending' OR user_id = 42` on `orders` is harder for the planner than two independent queries combined with UNION. When an OR condition causes a scan where you expected index use, the UNION rewrite is the most reliable fix.\n\n**A covering index is the ceiling, not the floor.** The goal of index design is not just \"don't scan\": it is \"don't even touch the table\". For the query that fetches active users' emails thousands of times a day, the 900,000 table lookups after the index seek cost more than the index traversal itself. An index on `(status, email)` eliminates them entirely.\n\n**Join column indexes on the inner table are not optional.** A missing index on `order_items.order_id` turns a join from `orders` into 5,000,000 full scans of a 20,000,000-row table. The cost compounds in a way that is easy to underestimate until you see the query plan.\n\n**`EXPLAIN QUERY PLAN` is cheap to run and expensive to ignore.** Add it to your development workflow for any query that touches more than a few thousand rows. The output is terse and readable in seconds. The queries it would flag can silently degrade for months in production before someone notices.\n\n---\n\n## References\n\n- [SQLite Overflow Pages - When Your Rows Don't Fit](https://www.gauravsarma.com/posts/2026-03-06_sqlite-overflow-pages) - previous post in this series on SQLite storage internals\n- [SQLite Query Planner Overview](https://www.sqlite.org/optoverview.html)\n- [SQLite EXPLAIN QUERY PLAN](https://www.sqlite.org/eqp.html)\n- [SQLite ANALYZE](https://www.sqlite.org/lang_analyze.html)\n- [SQLite Query Optimiser Tracing](https://www.sqlite.org/queryplanner-ng.html)\n- [SQLite Partial Indexes](https://www.sqlite.org/partialindex.html)\n- [SQLite Expression Indexes](https://www.sqlite.org/expridx.html)\n- [Use The Index, Luke - Index Selectivity](https://use-the-index-luke.com/sql/where-clause/functions/user-defined-functions)\n\n## Conclusion\n\nPlease reach out to me [here](https://gauravsarma.com/ping) for more ideas or improvements.\n",
            "url": "https://gauravsarma.com/posts/2026-03-08_sqlite-query-optimisation",
            "title": "SQLite Query Optimisation - How the Planner Thinks and Where It Goes Wrong",
            "summary": ". [SQLite Query Optimisation](sqlite-query-optimisation-cover...",
            "date_modified": "2026-03-08T00:00:00.000Z"
        },
        {
            "id": "https://gauravsarma.com/posts/2026-03-07_postgres-toast",
            "content_html": "\n![How PostgreSQL Handles Large Values](postgres-toast-cover.png)\n\nIn the [previous post](https://www.gauravsarma.com/posts/2026-03-06_sqlite-overflow-pages) we looked at how SQLite handles rows that don't fit in a page: it stores the first portion of the overflowing value inline in the B-tree leaf cell and chains the rest through a linked list of overflow pages. The mechanism is simple, predictable, and, if you're not careful about schema design, quietly expensive.\n\nPostgreSQL has the same constraint at its core: a row must fit in a page. Its solution, however, is architecturally different in almost every respect. It compresses values before deciding whether to move them. When it does move them, the entire value goes to a separate storage table, not a linked list. And the cost model at read time is inverted from SQLite's in one critical way.\n\nThis post covers how PostgreSQL's mechanism (called **TOAST**) works under the hood, what it costs, and how it compares to SQLite's approach.\n\n---\n\n## The Constraint\n\nPostgreSQL stores table rows (called **tuples**) in fixed-size 8KB pages. Every tuple must fit within a single page. Unlike SQLite's 4KB default, PostgreSQL's larger page gives more headroom, but TEXT, BYTEA, JSONB, and other variable-length types can still grow far beyond it. Something has to give.\n\nPostgreSQL doesn't reject the write. It doesn't chain overflow pages either. It applies **TOAST** (The Oversized-Attribute Storage Technique), which compresses and/or moves large values to a completely separate table. Transparently. At read time it fetches and reconstructs them.\n\nTOAST is not optional. Every PostgreSQL table with any variable-length column automatically has a TOAST table created alongside it. You won't see it in `\\dt`, but it is there.\n\n---\n\n## What TOAST Is\n\nWhen a value is too large to fit inline, PostgreSQL moves it out-of-line and stores a small pointer in its place. The \"out-of-line\" destination is a **TOAST table**: a system-managed heap table with a fixed schema:\n\n```\n| Column       | Type    | Description                                        |\n| ------------ | ------- | -------------------------------------------------- |\n| `chunk_id`   | OID     | Identifies which large value this chunk belongs to |\n| `chunk_seq`  | integer | Ordering of this chunk within the value            |\n| `chunk_data` | bytea   | Up to 2000 bytes of the actual value               |\n```\n\nA single large value is split into chunks of up to 2000 bytes each. Each chunk becomes one row in the TOAST table. The TOAST table has its own heap pages and its own B-tree index on `(chunk_id, chunk_seq)`.\n\nYou can find the TOAST table for any user table:\n\n```sql\nSELECT relname, reltoastrelid::regclass AS toast_table\nFROM pg_class\nWHERE relname = 'your_table' AND reltoastrelid != 0;\n```\n\n---\n\n## The TOAST Threshold\n\nPostgreSQL applies TOAST when a row's total size would exceed **~2KB**. But before moving anything out-of-line, it tries compression first:\n\n1. **Compress**: try to shrink the value with pglz (or lz4 if configured). If the compressed form is below ~2KB, keep it inline in compressed form; no TOAST table involved at all.\n2. **Move out-of-line**: if compression doesn't bring it below the threshold, move the entire value to the TOAST table.\n\nThis is the first major departure from SQLite. SQLite always stores the first ~4057 bytes of an overflowing value inline and chains the rest. PostgreSQL keeps nothing of the value inline: when it goes out-of-line, the heap tuple holds only an **18-byte TOAST pointer**. The whole value, compressed or not, lives in the TOAST table.\n\n![10 KB row: SQLite inline chain vs PostgreSQL all out-of-line](postgres-toast/toast-vs-sqlite.mp4)\n\n```\nSQLite (10KB value):\n  B-tree leaf cell: [4057 bytes inline] + [pointer → overflow chain]\n  Overflow page 1:  [4092 bytes]\n  Overflow page 2:  [remaining bytes]\n\nPostgreSQL (10KB value, EXTENDED):\n  Heap tuple: [18-byte TOAST pointer]\n  TOAST table: [chunk 0: 2000B] [chunk 1: 2000B] [chunk 2: 2000B] [chunk 3: 2000B] [chunk 4: ~KB]\n```\n\n---\n\n## The Four TOAST Storage Strategies\n\nPostgreSQL lets you configure per column how its values are handled when large. There are four strategies:\n\n### PLAIN\n\nNo compression, no out-of-line storage. The value is always stored inline. If the tuple won't fit in a page, the write fails. This is used for fixed-width types like `INTEGER` and `DATE`, types that can't grow large. You cannot set PLAIN on a variable-length column that can actually exceed the page limit.\n\n### EXTENDED (default for most variable-length types)\n\nCompress first, then move out-of-line if still too large. This is the default for `TEXT`, `BYTEA`, `JSON`, `JSONB`, and most other variable-length types. PostgreSQL:\n\n1. Tries to compress the value.\n2. If the compressed form is below ~2KB, stores it inline in compressed form.\n3. If still too large, moves the entire value to the TOAST table.\n\nMost values in practice are handled by EXTENDED without you thinking about it. A 5KB JSON document that compresses to 1.8KB never touches the TOAST table at all.\n\n### EXTERNAL\n\nNo compression, but move out-of-line when large. The value is stored uncompressed in the TOAST table. This is useful when:\n\n- The value is already compressed (images, video) and pglz won't help.\n- You need to run substring operations efficiently. With EXTERNAL, PostgreSQL can fetch only the relevant chunks without decompressing the full value.\n\n```sql\nALTER TABLE documents ALTER COLUMN body SET STORAGE EXTERNAL;\n\n-- Reads only the first chunk(s), no decompression needed\nSELECT substring(body, 1, 200) FROM documents WHERE id = 42;\n```\n\n### MAIN\n\nCompress first, prefer to stay inline even if compressed. Out-of-line storage is a last resort. PostgreSQL will still move the value out-of-line if there's no other way to fit the row, but it will try MAIN columns last when deciding what to evict from the page.\n\nYou can change strategy at any time:\n\n```sql\nALTER TABLE events ALTER COLUMN metadata SET STORAGE EXTENDED;\nALTER TABLE events ALTER COLUMN payload SET STORAGE EXTERNAL;\n```\n\nThe change applies to future writes only. To apply it to existing rows you need a full table rewrite (`UPDATE t SET col = col`, or `pg_repack`).\n\n---\n\n## How the TOAST Pointer Works\n\nWhen a value goes out-of-line, the heap tuple stores an **18-byte TOAST pointer** (`varattrib_1b_e` in the source). It encodes:\n\n- Whether the value is compressed or not\n- The OID of the TOAST table\n- The `chunk_id` identifying the value\n- The original uncompressed length\n- The on-disk length\n\nWhen PostgreSQL reads a tuple and encounters a TOAST pointer, it scans the TOAST table's B-tree index on `(chunk_id, chunk_seq)`, retrieves all chunks in order, decompresses if needed, and reconstructs the full value before returning it to the query.\n\n![PostgreSQL TOAST: heap tuple with pointer to TOAST table chunks](postgres-toast/toast-pointer.mp4)\n\n```\nHeap Page (main table)\n┌──────────────────────────────────────────────────────────────┐\n│ Tuple                                                        │\n│   id:       42                                               │\n│   name:     \"Alice\"                                          │\n│   metadata: [TOAST ptr → chunk_id=8817, len=52000]          │\n└──────────────────────────────────────────────────────────────┘\n                              │\n                              ▼\nTOAST Table (pg_toast_16384)\n┌──────────────────────────────────────────────────────────────┐\n│ chunk_id=8817, chunk_seq=0,  chunk_data=[2000 bytes]        │\n│ chunk_id=8817, chunk_seq=1,  chunk_data=[2000 bytes]        │\n│ ...                                                          │\n│ chunk_id=8817, chunk_seq=25, chunk_data=[remaining bytes]   │\n└──────────────────────────────────────────────────────────────┘\n         ↑ indexed on (chunk_id, chunk_seq)\n```\n\nA 52KB value that doesn't compress well becomes 26 chunks. Reading it requires one TOAST index lookup plus 26 chunk reads: 27 I/Os beyond the initial heap tuple read, repeated for every row in the result set that has a TOASTed column.\n\n---\n\n## Performance Implications\n\n### Queries that don't need the column pay nothing\n\nThis is the most important difference from SQLite's model.\n\nIn SQLite, the record format stores column values end-to-end. To locate column N, the parser must walk through all preceding columns in order. If column 2 is a 100KB TEXT field with 24 overflow pages, a query selecting only column 5 still follows those 24 overflow pages, because it needs to know where column 2 ends to find where column 3 begins.\n\nPostgreSQL's heap tuple format avoids this entirely. Any column can be accessed without reading the others. When a query doesn't project a TOASTed column, PostgreSQL reads the TOAST pointer, recognizes it as a pointer, and discards it: zero additional I/Os.\n\n```sql\n-- Assuming metadata is TOASTed (50KB):\n\n-- Triggers TOAST reads: 27 I/Os per row\nSELECT id, name, metadata FROM events WHERE id = 42;\n\n-- No TOAST reads at all: 1 heap page I/O\nSELECT id, name FROM events WHERE id = 42;\n```\n\nColumn ordering in the schema has no effect. A TOASTed column at position 2 in a 10-column table costs nothing when queries only project other columns.\n\n### `SELECT *` is expensive\n\nSince PostgreSQL skips TOAST unless the column is projected, `SELECT *` forces a TOAST lookup for every TOASTed column in every returned row. On a table with one 50KB TOASTed column returning 10,000 rows, `SELECT *` does roughly 270,000 I/Os that `SELECT id, name` does not. This cost doesn't appear clearly labeled in `EXPLAIN ANALYZE`: the heap scan time looks reasonable, but the TOAST access happens silently on top.\n\n![PostgreSQL TOAST storage strategies: PLAIN, EXTENDED, EXTERNAL, MAIN](postgres-toast/toast-strategies.mp4)\n\n```\n| Scenario                                | Heap I/O | TOAST I/O |\n| --------------------------------------- | -------- | --------- |\n| Scan 1M rows, project non-TOAST columns | ~1M      | 0         |\n| Scan 1M rows, project 50KB TOAST column | ~1M      | ~26M      |\n| Point lookup, project non-TOAST columns | 1        | 0         |\n| Point lookup, project 50KB TOAST column | 1        | ~27       |\n```\n\n### Transparent decompression has a CPU cost\n\nEXTENDED values that compress below the threshold stay inline (no TOAST table access), but every read incurs a decompression step. For values that compressed from 500KB to 90KB, this is measurable.\n\nThe LZ4 compression algorithm (available since PostgreSQL 14) is dramatically faster to decompress than the default pglz, at a modest reduction in compression ratio. For read-heavy workloads, lz4 usually wins:\n\n```sql\nALTER TABLE events ALTER COLUMN metadata SET COMPRESSION lz4;\n```\n\nOr globally:\n\n```sql\nSET default_toast_compression = lz4;\n```\n\n### VACUUM must process the TOAST table too\n\nPostgreSQL uses MVCC: when a row is updated, the old version persists until VACUUM removes it. If the row has TOASTed columns, the old version's TOAST chunks also persist. In tables with frequently updated TOASTed columns, the TOAST table accumulates dead rows alongside live ones. Autovacuum handles both, but if it can't keep up, the TOAST table grows without bound.\n\nYou can monitor it:\n\n```sql\nSELECT\n  relname,\n  n_live_tup,\n  n_dead_tup,\n  pg_size_pretty(pg_total_relation_size(relid)) AS total_size\nFROM pg_stat_user_tables\nWHERE relname LIKE 'pg_toast%';\n```\n\nThis is a different failure mode than SQLite. SQLite's fragmentation problem (overflow pages scattering throughout the file) is solved by `VACUUM` rewriting the file sequentially. PostgreSQL's fragmentation problem (dead TOAST rows accumulating) is solved by VACUUM removing dead rows, but live rows on disk are not reordered. Compacting fragmented live data in PostgreSQL requires `CLUSTER` or `pg_repack`.\n\n---\n\n## What You Can Do\n\n### Never use `SELECT *` on tables with TOASTed columns\n\nAudit your queries and project only the columns you need. This is the single highest-impact change and requires no schema work.\n\n```sql\n-- Fetches 50KB of TOAST per row\nSELECT * FROM events WHERE created_at > now() - interval '1 day';\n\n-- Skips TOAST entirely\nSELECT id, event_type, created_at FROM events WHERE created_at > now() - interval '1 day';\n```\n\n### Use JSONB instead of JSON for large documents\n\n`JSON` stores raw text. `JSONB` stores a parsed binary representation that is smaller and compresses significantly better. A document that would be 50KB as raw JSON might be 30KB as JSONB, then compress to 8KB, which is the difference between 25 TOAST chunks and 4. JSONB also enables GIN indexing for containment queries (`@>`, `?`, `?|`) without reading the full value.\n\n```sql\nALTER TABLE events ALTER COLUMN metadata TYPE JSONB USING metadata::JSONB;\n```\n\n### Switch to lz4 for frequently read large values (Postgres 14+)\n\npglz prioritizes compression ratio. lz4 prioritizes speed. For any workload that reads large TOASTed values frequently, lz4 reduces decompression CPU by an order of magnitude at a modest cost in compression ratio.\n\n```sql\nALTER TABLE events ALTER COLUMN metadata SET COMPRESSION lz4;\n```\n\nThe change applies to future writes. Existing rows keep pglz until rewritten.\n\n### Use EXTERNAL for columns you substring frequently\n\nEXTERNAL disables compression but enables partial chunk retrieval. If your application frequently reads just the first few hundred bytes of a large column, PostgreSQL can fetch only the relevant chunks, with no decompression or full value reconstruction.\n\n```sql\nALTER TABLE documents ALTER COLUMN body SET STORAGE EXTERNAL;\n\nSELECT substring(body, 1, 200) FROM documents WHERE id = 42;\n```\n\nThis is only worth it when your access pattern genuinely reads partial values. For workloads that read the full value every time, lack of compression means more chunks and more I/O.\n\n### Apply the same schema patterns as SQLite\n\nThe schema-level mitigations from the [SQLite overflow post](https://www.gauravsarma.com/posts/2026-03-06_sqlite-overflow-pages) apply here too: move large columns to a separate table so they can't be accidentally projected, and store actual binary objects in an object store instead of the database. The reasoning is the same: isolate the cost so it's only paid when explicitly needed.\n\n---\n\n## How It All Fits Together\n\n```\nA query that reads a row with a TOASTed column:\n\nHeap scan / index lookup\n      │\n      ▼\nHeap page (1 I/O)\n      │\n      │  tuple → [18-byte TOAST ptr: chunk_id=8817, len=52000]\n      ▼\nTOAST index lookup on (chunk_id=8817, chunk_seq)\n      │  (B-tree traversal, likely cached after first access)\n      ▼\nTOAST heap reads (26 chunk pages)\n      │\n      ▼\nDecompress if EXTENDED (pglz or lz4)\n      │\n      ▼\nReturn reconstructed value to query engine\n\nA query that omits the TOASTed column:\n\nHeap page (1 I/O)\n      │\n      │  tuple → sees TOAST ptr, does not dereference it\n      ▼\nReturn other column values directly\n      (0 additional I/Os)\n```\n\nThe 18-byte TOAST pointer is what enables this selectivity. PostgreSQL reads it on every tuple access, but follows it only when the query needs the value, something SQLite's inline-then-chain design cannot do.\n\n---\n\n## Comparison Summary\n\n```\n| Dimension                     | SQLite                              | PostgreSQL                                    |\n| ----------------------------- | ----------------------------------- | --------------------------------------------- |\n| Page size (default)           | 4KB                                 | 8KB                                           |\n| Overflow threshold            | ~4057 bytes per cell                | ~2KB (triggers compression); ~8160 (go OOL)  |\n| Inline portion when OOL       | First ~4057 bytes stay inline       | Nothing; whole value moves out                |\n| Built-in compression          | None                                | pglz or lz4 (EXTENDED strategy)               |\n| Storage structure             | Linked list of overflow pages       | Separate heap table with B-tree index         |\n| Column selectivity at read    | Must parse preceding columns        | Skip non-projected columns at zero cost       |\n| Index overflow threshold      | ~1007 bytes (much tighter)          | No index overflow; long keys are truncated    |\n| Fragmentation mechanism       | Overflow pages scatter in file      | Dead TOAST rows accumulate (bloat)            |\n| Defragmentation               | VACUUM rewrites file sequentially   | VACUUM removes dead rows; CLUSTER reorders    |\n| Tunable storage strategies    | Page size only (global)             | Per-column: PLAIN, EXTENDED, EXTERNAL, MAIN  |\n```\n\n---\n\n## Lessons Learned\n\n**Compression often eliminates TOAST I/O entirely.** A 5KB JSON document that compresses to 1.8KB stays inline. No TOAST table is ever consulted. SQLite has no equivalent. This is why PostgreSQL often handles large values more gracefully than SQLite even though the mechanisms look similar on the surface.\n\n**`SELECT *` is expensive in a way that is hard to see.** TOAST access does not show up separately in `EXPLAIN ANALYZE`. The heap scan time looks fine; the silent TOAST overhead does not. Explicit column projections are the most important habit to form when working with tables that have large values.\n\n**TOAST table bloat is easy to overlook.** The TOAST table is invisible in normal tooling. Heavy updates to TOASTed columns can let it grow to multiples of the main table size. Check it explicitly in `pg_stat_user_tables`.\n\n**lz4 is almost always the right choice for large values on Postgres 14+.** pglz dates from the 1990s and prioritizes compression ratio over speed. For read-heavy workloads, lz4's faster decompression is worth the marginally larger compressed size.\n\n**PostgreSQL's approach is more sophisticated but harder to reason about.** SQLite's overflow is mechanical: cells too large chain to overflow pages, you can calculate the cost exactly. PostgreSQL's TOAST involves compression decisions, per-column strategies, a separate heap table with its own B-tree, and MVCC interactions. The transparency is usually helpful. When something is slow, you have to know to look for it.\n\n---\n\n## References\n\n- [SQLite Overflow Pages - When Your Rows Don't Fit](https://www.gauravsarma.com/posts/2026-03-06_sqlite-overflow-pages) - the previous post in this series\n- [PostgreSQL Documentation - TOAST](https://www.postgresql.org/docs/current/storage-toast.html)\n- [PostgreSQL Documentation - Storage Layout](https://www.postgresql.org/docs/current/storage-page-layout.html)\n- [PostgreSQL Documentation - ALTER TABLE SET STORAGE](https://www.postgresql.org/docs/current/sql-altertable.html)\n- [PostgreSQL Documentation - default_toast_compression](https://www.postgresql.org/docs/current/runtime-config-client.html#GUC-DEFAULT-TOAST-COMPRESSION)\n- [SQLite File Format - Overflow Pages](https://www.sqlite.org/fileformat2.html#overflow_pages)\n- [pg_repack - Online table reorg for PostgreSQL](https://github.com/reorg/pg_repack)\n\n## Conclusion\n\nPlease reach out to me [here](https://gauravsarma.com/ping) for more ideas or improvements.\n",
            "url": "https://gauravsarma.com/posts/2026-03-07_postgres-toast",
            "title": "How PostgreSQL Handles Large Values - TOAST and What It Costs You",
            "summary": ". [How PostgreSQL Handles Large Values](postgres-toast-cover...",
            "date_modified": "2026-03-07T00:00:00.000Z"
        },
        {
            "id": "https://gauravsarma.com/posts/2026-03-06_sqlite-overflow-pages",
            "content_html": "\n![SQLite Overflow Pages](sqlite-overflow-pages-cover.png)\n\n\n\nYou add a `bio` column to your users table. A few months later, some users have written\nessays in there: 50KB, 100KB. Queries that used to return in single-digit milliseconds\nare now taking hundreds. The table hasn't grown that much. The indexes are fine. Nothing\nobvious in `EXPLAIN QUERY PLAN`. What's happening is that SQLite has been quietly\nallocating overflow pages every time one of those large rows was written, and reading\nthem back is costing you far more than you'd expect.\n\n---\n\n\n## The Problem\n\nSQLite stores everything (rows, index entries, internal metadata) in fixed-size pages.\nThe default page size is 4096 bytes. Every row in a table is a \"cell\" within one of\nthose pages. The rule is simple: a cell must live within a page.\n\nBut what happens when a row's data is larger than a page? SQLite doesn't reject the\nwrite. It doesn't split the row across two B-tree pages either. Instead, it takes the\nportion of the row that doesn't fit and stores it in a completely separate chain of pages\ncalled **overflow pages**. These are then stitched back together at read time by\nfollowing a linked list.\n\nThis mechanism is necessary; SQLite wouldn't be useful without it. But overflow pages\ncome with a performance cost that compounds quietly in the background until, one day,\na query that scans your table takes 25 times longer than it should.\n\n---\n\n\n## Prerequisites\n\n- Basic familiarity with how SQLite stores data (pages, B-trees)\n- Understanding of what a table scan and an index seek are\n- Awareness of I/O as a performance concern in databases\n\n---\n\n\n## How SQLite Organizes Data\n\nBefore getting into overflow, it helps to understand what SQLite is working with normally.\n\nA SQLite database file is a flat array of fixed-size pages. Every page is the same size,\ntypically 4096 bytes, configurable from 512 to 65536 bytes at database creation time.\nPages are numbered starting from 1. The entire database (table rows, indexes, internal\ntree nodes) is expressed through these pages.\n\nThe primary data structure is a B-tree. For each table, SQLite maintains a B-tree where\nleaf nodes hold the actual row data. Interior nodes hold keys and child pointers to\nnavigate the tree. For indexes, there's a separate B-tree where leaves hold the indexed\nvalue alongside the rowid of the corresponding table row.\n\nWithin each page, individual rows (or index entries) are called **cells**. A page has a\nsmall header, an array of pointers to cells sorted by key, and the cell data itself packed\nfrom the bottom of the page upward. The layout looks roughly like this:\n\n```\n┌─────────────────────────────────────────────────┐\n│ Page Header (8-12 bytes)                        │\n│ Cell Pointer Array [ ptr1, ptr2, ptr3, ... ]    │\n│                                                 │\n│            (free space in the middle)           │\n│                                                 │\n│      [ Cell 3 ] [ Cell 2 ] [ Cell 1 ]          │  ← packed from bottom\n└─────────────────────────────────────────────────┘\n```\n\nEach cell holds the complete payload for one row: all column values concatenated\ntogether, preceded by a small header describing the type and length of each column. This\nis the normal, fast case: one page read gets you the entire row.\n\n---\n\n\n## The Overflow Threshold\n\nSQLite doesn't let a cell grow arbitrarily. When a row's payload would make the cell too\nlarge to fit within a single page, SQLite stores only the first portion of the payload in\nthe cell itself and puts the rest into overflow pages.\n\nThe amount stored locally depends on the page size and whether it's a table or index page.\nFor a 4096-byte page:\n\n- **Table leaf pages**: roughly 4057 bytes stored locally per cell\n- **Index pages**: roughly 1007 bytes stored locally per cell\n\nIndex pages have a much tighter limit because index pages need to fit many entries for\nthe B-tree to remain shallow and fast. If you push a large value through an index (say,\nindexing a long JSON column), you'll hit overflow at around 1KB rather than 4KB.\n\nThe cell itself stores the total payload size, the local portion of the data, and a\n4-byte pointer to the first overflow page. From the outside, the row looks complete:\nSQLite reconstructs it transparently. From a performance perspective, that transparency\ncomes at a cost.\n\n---\n\n\n## The Overflow Chain\n\nWhen a row overflows, the excess data is stored as a **linked list of overflow pages**.\nThese pages are completely separate from the B-tree. They have no page type flag, no\ncell pointers, no B-tree semantics at all. Each overflow page has an extremely simple\nstructure:\n\n```\nBytes 0-3:   pointer to the next overflow page (0 if this is the last)\nBytes 4-end: raw payload data\n```\n\nFor a 4096-byte page, that's 4092 bytes of payload per overflow page. A 10KB blob stored\nin a table with the default page size produces roughly:\n\n- 4057 bytes stored in the cell (in the B-tree leaf page)\n- 3943 bytes remaining, covered by 1 overflow page\n\nA 100KB blob produces:\n\n- 4057 bytes in the cell\n- ~95943 bytes remaining, spread across 24 overflow pages chained together\n\nThe full picture for a row that overflows looks like this:\n\n![SQLite overflow chain: B-tree leaf page linked to 24 overflow pages](sqlite-overflow-pages/overflow-chain.mp4)\n\n```\nB-tree Leaf Page\n┌──────────────────────────────────────┐\n│ Cell                                 │\n│   total_payload_size: 100000         │\n│   local_data: [first 4057 bytes]     │\n│   overflow_ptr: ──────────────────── ┼──┐\n└──────────────────────────────────────┘  │\n                                          │\n                          ┌───────────────┘\n                          ▼\n              Overflow Page 1\n              ┌────────────────────────────┐\n              │ next: ──────────────────── ┼──┐\n              │ data: [4092 bytes]         │  │\n              └────────────────────────────┘  │\n                                              │\n                          ┌───────────────────┘\n                          ▼\n              Overflow Page 2\n              ┌────────────────────────────┐\n              │ next: ──────────────────── ┼──┐\n              │ data: [4092 bytes]         │  │\n              └────────────────────────────┘  │\n                          ...               ...\n                          ▼\n              Overflow Page 24\n              ┌────────────────────────────┐\n              │ next: 0 (end of chain)     │\n              │ data: [remaining bytes]    │\n              └────────────────────────────┘\n```\n\nOne important property: overflow pages are never shared between cells. Each row that\noverflows gets its own private chain. When a row is deleted, the entire chain is freed\nback to a freelist inside the database file and can be reused by future writes.\n\n---\n\n\n## When You'll Run Into Overflow\n\nOverflow isn't limited to obvious cases like storing files in a database. Here are the\nsituations that produce overflow pages in practice.\n\n### Large BLOBs or TEXT columns\n\nThe most direct case. Any column storing images, PDFs, long documents, or large\nserialised payloads will overflow the moment its value exceeds ~4KB. This is the\nscenario most developers recognise, though many don't know the internal mechanism\nthey've triggered.\n\n### JSON columns\n\nJSON has become a common pattern in SQLite, used for storing semi-structured data without\ndefining a rigid schema. A small JSON document is fine. But JSON documents that embed\narrays of objects, nested structures, or any significant amount of text will routinely\nexceed 4KB. A user preferences blob, an API response cached to disk, a config document:\nall of these can quietly tip into overflow territory.\n\n### Long free-text fields\n\nBios, descriptions, notes, comments. Fields where the application enforces no length\nlimit and users can write as much as they want. An uncapped `TEXT` column in a user-\ngenerated content context is an overflow waiting to happen.\n\n### Index overflow\n\nThis one surprises people. Even a moderate-length text value (around 1KB) will overflow\nwhen indexed. Index B-tree pages have a max local payload of roughly 1007 bytes (on\na 4096-byte page), which is a quarter of the table leaf limit. If you index a URL,\na hashed value, a short description, or any string that can be a few hundred characters\nlong, you may be creating overflow on the index side. The query will still be correct;\nit will just require more I/O than you'd expect when the optimizer uses that index.\n\n### Rows with many small columns\n\nThis is the least obvious case. Overflow is triggered by the total cell size, not by\nany individual column. A row with 40 columns each containing a modest amount of data\ncan collectively exceed 4057 bytes and trigger overflow even though no single column\nis especially large. Wide tables with lots of columns are worth watching.\n\n---\n\n\n## Performance Implications\n\n### Every overflow adds I/O\n\nReading a row that doesn't overflow costs one I/O (assuming the page isn't already\ncached). Reading a row with a 100KB value costs 25 I/Os: one for the leaf page, 24\nfor the overflow chain. At that ratio, a query that would touch 10,000 rows with no\noverflow reads 10,000 pages. With overflow, it reads 250,000 pages. The query is doing\n25 times more work.\n\nThe relationship is linear: the more overflow pages a row has, the more I/Os are\nrequired to materialise it.\n\n### Overflow pages are scattered on disk\n\nOverflow pages are allocated from the freelist (pages reclaimed from previous deletes)\nor appended to the end of the file. They are not adjacent to the B-tree page they belong\nto, and they are not guaranteed to be adjacent to each other in the chain.\n\nA freshly created database with no deletes will have overflow pages that are roughly\nsequential on disk, which is a tolerable access pattern. But after any significant\nwrite/delete churn, overflow pages fragment throughout the file:\n\n![Disk fragmentation: overflow pages scattered after write churn](sqlite-overflow-pages/disk-fragmentation.mp4)\n\n```\nDisk layout after churn:\nPage 5:    B-tree leaf    (overflow_ptr → 1023)\nPage 47:   Overflow #3    (next → 0)\nPage 891:  Overflow #2    (next → 47)\nPage 1023: Overflow #1    (next → 891)\n\nReading this row requires: seek to 5, seek to 1023, seek to 891, seek to 47\nFour random seeks instead of one sequential read.\n```\n\nOn a spinning disk this is a genuine disaster. On SSD the cost is lower but not zero:\nrandom reads still consume more bandwidth and IOPS than sequential reads, and every\nhop through the chain is a separate read request that can't be merged or prefetched by\nthe OS.\n\n### You can't skip a large column to reach the next one\n\nSQLite's record format stores column values end-to-end, with a header that describes\nthe type and length of each value. To find where column N starts, you must know where\ncolumn N-1 ends. If column N-1 has overflow, you must follow the overflow chain just\nto discover the starting offset of column N.\n\nThis means that even a query that only projects small columns pays the overflow cost\nfor any large column that appears earlier in the row:\n\n```sql\n-- Table: users(id INT, name TEXT, bio TEXT, email TEXT)\n-- bio is 100KB\n\nSELECT email FROM users WHERE name = 'Alice';\n```\n\nTo read `email`, SQLite must parse the record header and locate `email`'s starting\noffset. To know where `email` starts, it must know where `bio` ends. To know where\n`bio` ends, it must follow `bio`'s entire overflow chain. Even though the query never\nasked for `bio`, 24 extra I/Os happen per matching row.\n\nColumn ordering matters. A 100KB column early in the schema taxes every query that reads\nany column appearing after it, regardless of whether those queries need the large column.\n\n### Table scans with overflow are compounding\n\nA table scan visits every leaf page in the B-tree. If rows have overflow, the scan also\nvisits every overflow page for every row. There's no way to scan just the B-tree portion\nand skip the overflow chains; the record format requires following them.\n\n![I/O cost comparison: overflow impact on a 1 million row scan](sqlite-overflow-pages/io-cost-comparison.mp4)\n\n```\n| Scenario                       | I/O per row | 1M row table scan |\n| ------------------------------ | ----------- | ----------------- |\n| Small rows (< 4KB)             | 1           | ~1M reads         |\n| 10KB rows (2 overflow pages)   | 3           | ~3M reads         |\n| 100KB rows (24 overflow pages) | 25          | ~25M reads        |\n```\n\nFull table scans are already a last resort. Overflow makes them dramatically worse.\n\n---\n\n\n## What You Can Do\n\n### Move large columns to a separate table\n\nThe most effective schema-level fix. If you isolate overflow-prone columns into their\nown table, queries that don't need those columns never touch overflow pages.\n\n```sql\n-- Before\nCREATE TABLE users (id INTEGER PRIMARY KEY, name TEXT, bio TEXT, email TEXT);\n\n-- After\nCREATE TABLE users (id INTEGER PRIMARY KEY, name TEXT, email TEXT);\nCREATE TABLE user_bios (user_id INTEGER PRIMARY KEY, bio TEXT);\n```\n\n`SELECT name FROM users WHERE name = 'Alice'` now reads nothing but compact rows.\n`SELECT bio FROM user_bios WHERE user_id = 42` still pays the overflow cost, but only\nwhen you explicitly asked for the bio. Overflow is now a cost you opt into per query\nrather than one you pay on every access.\n\n### Use covering indexes for hot query paths\n\nA covering index includes all the columns a query needs. SQLite can satisfy the query\nentirely from the index without touching the table rows, and therefore without following\nany overflow chains in the table.\n\n```sql\nCREATE INDEX idx_users_name_email ON users(name, email);\nSELECT email FROM users WHERE name = 'Alice';\n```\n\nThe index entry for this query contains both `name` and `email`. SQLite traverses the\nindex B-tree, finds the matching entry, and returns the result without ever reading the\ntable row. If `bio` is in the table and causing overflow, this query is completely\nunaffected.\n\nThe catch: index entries can themselves overflow, as noted earlier. Don't index large\ncolumns. Keep covering indexes to small, fixed-width columns.\n\n### Don't store large BLOBs in the database\n\nThe cleanest fix when you're storing actual binary content: images, documents, audio.\nStore the object in an object store (S3, GCS, a local filesystem) and keep only the\nreference in SQLite:\n\n```sql\nCREATE TABLE documents (id INTEGER PRIMARY KEY, title TEXT, storage_key TEXT);\n```\n\n`storage_key` is a few dozen bytes at most. No overflow, ever. The large content is\nfetched separately when actually needed. This also moves retrieval of large objects\noff the database hot path entirely.\n\n### Increase the page size\n\nA larger page size raises the overflow threshold. With a 16384-byte page, the local\npayload limit for a table leaf is roughly 16357 bytes. Rows up to ~16KB will now fit\nwithout overflow.\n\n```sql\nPRAGMA page_size = 16384;\nVACUUM;\n```\n\nThis must be done before writing any data, or on an empty database. On an existing\ndatabase, you need to run `VACUUM` after changing `page_size` to rebuild the file with\nthe new page layout. Larger pages mean more data is read even when only part of a page\nis needed, so this is a trade-off: you reduce overflow but increase the cost of random\nrow lookups that would have been satisfied by a smaller page.\n\n### Run VACUUM to defragment overflow chains\n\n`VACUUM` rebuilds the database file from scratch. Pages are written sequentially in\nB-tree order, and overflow pages are written immediately after the cell they belong to.\nThe result is a file where overflow chains are as contiguous as they can be:\n\n```sql\nVACUUM;\n```\n\nAfter a `VACUUM`, a row and its overflow pages are adjacent on disk. Sequential reads\ncan now be served by the OS read-ahead buffer rather than individual random seeks. On\nboth SSDs and spinning disks, this is meaningfully faster.\n\n`VACUUM` is not a permanent fix; churn gradually re-fragments the file. On write-heavy\nworkloads with frequent deletes, the benefit decays over time and you'd need to run it\nperiodically.\n\n### Put large columns last in the schema\n\nSince column offsets are computed by parsing preceding columns in order, placing large\noverflow columns at the end of the row means that any query touching only early columns\nnever needs to follow the overflow chain. The record parser stops as soon as it has the\ncolumns it needs.\n\nThis is a low-cost mitigation when schema changes are difficult. It won't eliminate\noverflow, but it limits the blast radius to queries that actually need those columns.\n\n---\n\n\n## How It All Fits Together\n\n```\nA query that reads a row with overflow:\n\nB-tree traversal\n      │\n      ▼\nLeaf page (1 I/O)\n      │\n      │  cell → local_data + overflow_ptr\n      ▼\nOverflow page 1 (1 I/O, likely random seek)\n      │\n      ▼\nOverflow page 2 (1 I/O, likely random seek)\n      │\n      ...\n      ▼\nOverflow page N (1 I/O, likely random seek)\n      │\n      ▼\nReconstruct full row from all chunks\n      │\n      ▼\nReturn to query engine\n```\n\nOverflow pages are not part of the B-tree. They carry no keys, no pointers relevant\nto tree traversal, no page type metadata. They are pure sequential storage: a linked\nlist stitched onto the B-tree at the cell level. The B-tree gets you to the right leaf.\nThe overflow chain gets you the rest of the row. Both are necessary. Only the overflow\nchain is under your control via schema design.\n\n---\n\n\n## Lessons Learned\n\n**The overflow threshold for index pages is a quarter of the table threshold.** 1007\nbytes vs. 4057 bytes on a 4096-byte page. This surprises most people who think of\noverflow as a \"large BLOB\" problem. Indexing a moderately long text column (a URL,\nan address, a product name with Unicode characters) can push an index entry into\noverflow territory.\n\n**Column order is a real performance variable.** A large column early in the schema\nforces every query reading any later column to pay the overflow I/O cost, even if\nthe query never touches the large column directly. This is easy to miss because the\nquery result is correct regardless.\n\n**VACUUM helps more than people expect.** After heavy write churn, overflow chains\nthat were once sequential become scattered randomly across the file. A VACUUM restores\nlocality. On databases that are read-heavy but written in batches, scheduling periodic\nVACUUMs can recover significant read performance without any schema changes.\n\n**Increasing page size is not free.** A 16KB page means SQLite reads at least 16KB\nwhen accessing any row: useful if rows are large, wasteful if they're small and many.\nThe right page size depends on your typical row size. For a workload with mostly\ncompact rows and occasional large ones, normalising the large columns out is usually\nbetter than inflating the page size for everyone.\n\n**Overflow is not a bug.** It is SQLite's correct and necessary mechanism for handling\npayloads larger than a page. The goal isn't to eliminate overflow entirely. The goal is\nto understand which queries are paying the cost and whether schema or query design can\navoid paying it unnecessarily.\n\n---\n\n\n## References\n\n- [SQLite File Format — B-tree Pages](https://www.sqlite.org/fileformat2.html#b_tree_pages)\n- [SQLite File Format — Overflow Pages](https://www.sqlite.org/fileformat2.html#overflow_pages)\n- [SQLite PRAGMA page_size](https://www.sqlite.org/pragma.html#pragma_page_size)\n- [SQLite VACUUM](https://www.sqlite.org/lang_vacuum.html)\n- [SQLite EXPLAIN QUERY PLAN](https://www.sqlite.org/eqp.html)\n- [Use The Index, Luke — Clustering and Index-Only Scans](https://use-the-index-luke.com/sql/clustering/index-only-scan-covering-index)\n\n## Conclusion\n\nPlease reach out to me [here](https://gauravsarma.com/ping) for more ideas or improvements.\n",
            "url": "https://gauravsarma.com/posts/2026-03-06_sqlite-overflow-pages",
            "title": "SQLite Overflow Pages - When Your Rows Don't Fit",
            "summary": ". [SQLite Overflow Pages](sqlite-overflow-pages-cover...",
            "date_modified": "2026-03-06T00:00:00.000Z"
        },
        {
            "id": "https://gauravsarma.com/posts/2026-03-04_single-node-k3s-gitops-on-a-budget",
            "content_html": "\n![Single-Node Kubernetes GitOps on a Budget](single-node-k3s-gitops-on-a-budget-cover.png)\n\nYou want GitOps. You don't want to pay $70/month for a managed cluster, operate an etcd\nquorum, or maintain a fleet of nodes for a side project or small team. You want to push\nto `main` and have your app update itself. This is how I built exactly that — a\nfully-automated, single-node Kubernetes platform that goes from a fresh Ubuntu VM to a\nrunning ArgoCD-synced cluster in under ten minutes, with TLS and a locked-down firewall.\n\n---\n\n## The Problem\n\nManaged Kubernetes (EKS, GKE, DOKS) is great when you need multi-zone HA and have the\nbudget. For everything else — internal tools, staging environments, small production\nservices — it's overkill. But self-managed Kubernetes has historically meant:\n\n- Hand-rolling kubeadm configs\n- Managing etcd backups\n- Writing custom bootstrap scripts that bitrot\n- No clean local dev story\n\nk3s solves most of that. What it doesn't give you out of the box is GitOps, TLS, secret\nmanagement, or a sensible firewall. That's the gap this project fills.\n\nThe goal: one command on a fresh VM, everything bootstrapped, ArgoCD watching git, apps\nlive at `https://<app>.<ip>.nip.io`.\n\n---\n\n## Prerequisites\n\n- A Linux VM (Ubuntu 22.04) with a public IP — DigitalOcean, Hetzner, EC2, anything works\n- A GitHub repo (public or private) to store your manifests\n- For local dev: macOS 13+ with `brew install lima`\n- Basic Kubernetes familiarity (what a Deployment is, what a namespace is)\n\n---\n\n## Technical Decisions\n\n### k3s over kubeadm or managed Kubernetes\n\nk3s is a CNCF-graduated Kubernetes distribution that ships as a single binary. It bundles\nTraefik (ingress), CoreDNS, and local-path storage — everything you need for a functional\ncluster. Installation is one `curl` command and the node is ready in under a minute.\n\nThe trade-off is that it's opinionated: you get Traefik whether you want it or not, and\nthe single-node model means no HA. That's exactly the trade-off we want here.\n\n### ArgoCD app-of-apps for GitOps\n\nArgoCD's [app-of-apps pattern](https://argo-cd.readthedocs.io/en/stable/operator-manual/cluster-bootstrapping/)\nlets a single root Application discover and manage child Applications by scanning a\ndirectory in git. Any subdirectory under `k8s/apps/` that contains an `application.yaml`\nbecomes a managed app automatically — no manual ArgoCD registration needed.\n\n```yaml\n# k8s/system/argocd/app-of-apps.yaml\nspec:\n  source:\n    path: k8s/apps\n    directory:\n      recurse: true\n      include: '**/application.yaml'   # only pick up app registrations, not workload manifests\n  syncPolicy:\n    automated:\n      prune: true      # remove resources deleted from git\n      selfHeal: true   # revert manual changes made in the cluster\n```\n\nOne subtlety: without `include: '**/application.yaml'`, ArgoCD would try to apply every\nYAML file in `k8s/apps/` — including Deployment and Service manifests — and then conflict\nwith the child apps that own those same resources. The `include` filter keeps the root app\nonly aware of Application objects.\n\n### Plain Kubernetes Secrets over Infisical + ESO\n\nThe original design used [External Secrets Operator](https://external-secrets.io) pulling\nfrom a self-hosted [Infisical](https://infisical.com) instance. After several iterations,\nthis was scrapped entirely.\n\nThe problem: Infisical's Machine Identity authentication uses SRP (Secure Remote Password)\n— a challenge-response protocol that can't be scripted without reimplementing the crypto.\nEvery attempt to automate the `infisical login` step hit this wall. The result was a\nbootstrap sequence that required manual intervention at exactly the wrong moment: when\nyou're trying to bring up a fresh server unattended.\n\nBeyond the auth issue, running Infisical inside the cluster meant managing PostgreSQL,\nRedis, and the Infisical pod itself — roughly 2GB of RAM just for secret synchronisation\non a node where that RAM is needed for actual workloads.\n\nThe replacement: plain Kubernetes Secrets created with `kubectl`. Apps reference them via\n`envFrom` with `optional: true`, so pods start even if the secret hasn't been created yet.\n\n```yaml\nenvFrom:\n  - secretRef:\n      name: example-app-secrets\n      optional: true\n```\n\nThis is the right trade-off for a single-node setup. The secrets are in etcd. The threat\nmodel for a single-node cluster where you control the machine is different from a\nmulti-tenant environment. Zero extra pods, zero extra failure modes.\n\n### Baseline firewall that always runs\n\nThe original firewall script only activated when `VPN_SUBNET` was set. That meant servers\ndeployed without a VPN had port 6443 (k3s API) open to the internet. The fix: unconditional\nbaseline rules, with VPN restrictions layered on top.\n\n```bash\n# Always applied\nufw default deny incoming\nufw allow 80/tcp   # Traefik\nufw allow 443/tcp  # Traefik\nufw allow from 127.0.0.1 to any port 6443  # k3s API: localhost only\n\n# Conditional: restrict SSH to VPN peers\nif [ -n \"${VPN_SUBNET:-}\" ]; then\n  ufw allow from \"${VPN_SUBNET}\" to any port 22\n  ufw allow from \"${VPN_SUBNET}\" to any port 6443\nelse\n  ufw allow 22/tcp  # SSH open (lock down manually if needed)\nfi\n```\n\nPort 6443 is never exposed to the public internet regardless of VPN configuration. That's\nthe important invariant — even if you forget to set `VPN_SUBNET`, the API server isn't\nreachable from outside.\n\n### nip.io for zero-config DNS\n\nEvery ingress hostname uses [nip.io](https://nip.io): a free wildcard DNS service where\n`<anything>.<ip>.nip.io` resolves to `<ip>`. This gives you real hostnames (required for\nTLS, required for HTTP-01 ACME challenges, required for Traefik's host-based routing)\nwithout touching a DNS zone.\n\n`init.sh` auto-detects the node's public IP from cloud instance metadata (AWS IMDSv1,\nDigitalOcean metadata, or `api.ipify.org` as a fallback) and patches ingress hostnames at\nbootstrap time.\n\n---\n\n## Implementation\n\n### Phase 1: Bootstrap script (`init.sh`)\n\nThe entire server bootstrap is a single idempotent script. Terraform passes it as\n`user_data` via `cloud-init.sh.tpl` on first boot; for local dev, `local-setup.sh`\ntransfers it to a Lima VM and runs it there.\n\nSteps in order:\n\n1. Auto-detect git remote URL, convert SSH to HTTPS, patch `repoURL` placeholders in all `application.yaml` files\n2. Auto-detect node public IP from metadata endpoints\n3. Install k3s (`curl -sfL https://get.k3s.io | sh -s - server`)\n4. Wait for node Ready, then pause 10 seconds for API server internal init\n5. Install ArgoCD and wait for `argocd-server` deployment to become available\n6. Apply `app-of-apps.yaml` — from this point, ArgoCD manages everything in `k8s/apps/`\n7. Apply cert-manager Application and wait for both the controller and webhook deployments\n8. Apply ClusterIssuers with the Let's Encrypt email patched in\n9. Apply baseline firewall rules\n\nThe cert-manager wait step is worth explaining. `kubectl wait --watch` drops the TLS watch\nstream on resource-constrained nodes under load. The script uses a poll loop instead:\n\n```bash\nfor i in $(seq 1 120); do\n  READY=$(kubectl get deployment cert-manager -n cert-manager \\\n    -o jsonpath='{.status.availableReplicas}' 2>/dev/null || true)\n  [ \"${READY:-0}\" -ge 1 ] && break\n  sleep 5\ndone\n```\n\n120 iterations × 5 seconds = 10 minutes max wait. The webhook must also be ready before\n`ClusterIssuer` CRs can be accepted — cert-manager validates them via webhook, so applying\nClusterIssuers before the webhook is up causes a confusing 503 error.\n\n### Phase 2: GitOps structure\n\n```\nk8s/\n├── system/          # bootstrapped manually with kubectl apply\n│   ├── argocd/      # app-of-apps + ingress + VPN middleware\n│   └── cert-manager/\n└── apps/            # auto-discovered by app-of-apps\n    └── example-app/ # deployment, service, ingress, secret\n```\n\nSystem components (`argocd`, `cert-manager`) are applied once by `init.sh`. Everything\nunder `k8s/apps/` is discovered and synced by ArgoCD on every push to `main`.\n\n### Phase 3: The example-app\n\nThe repo ships with a working `example-app` — an nginx container wired up with a\nDeployment, Service, Ingress, and a placeholder Secret. It's live at\n`https://example-app.<node-ip>.nip.io` immediately after bootstrap.\n\nThe deployment is deliberately minimal:\n\n```yaml\ncontainers:\n  - name: example-app\n    image: nginx:stable-alpine\n    ports:\n      - containerPort: 80\n    envFrom:\n      - secretRef:\n          name: global-secrets\n          optional: true\n      - secretRef:\n          name: example-app-secrets\n          optional: true\n```\n\nTwo things worth noting. First, `optional: true` on both secret refs — the pod starts\ncleanly even before any secrets exist. Without this flag, a missing Secret causes the\npod to stay in `Pending` with a somewhat cryptic event message.\n\nSecond, the placeholder `secret.yaml` committed alongside the app:\n\n```yaml\napiVersion: v1\nkind: Secret\nmetadata:\n  name: example-app-secrets\n  namespace: example-app\ntype: Opaque\nstringData: {}\n```\n\nAn empty Secret might seem pointless, but it prevents ArgoCD from showing the app as\n`Degraded` when it first syncs and the secret hasn't been populated yet. ArgoCD sees the\nresource exists; the pod sees `optional: true` and ignores the empty data. To inject real\nvalues later without putting them in git:\n\n```bash\nkubectl create secret generic example-app-secrets -n example-app \\\n  --from-literal=MY_KEY=value \\\n  --dry-run=client -o yaml | kubectl apply -f -\n```\n\nTLS is handled automatically by cert-manager. The ingress starts with\n`cert-manager.io/cluster-issuer: letsencrypt-staging` — staging certs aren't browser-trusted\nbut don't burn through Let's Encrypt's rate limits while you're validating the setup. Once\nthe staging cert appears as `Ready`, switch to `letsencrypt-prod` and delete the old cert\nto trigger reissuance:\n\n```bash\nkubectl annotate ingress example-app -n example-app \\\n  cert-manager.io/cluster-issuer=letsencrypt-prod --overwrite\nkubectl delete certificate -n example-app <cert-name>\n```\n\n### Phase 4: App scaffold\n\nNew apps are scaffolded from the `example-app` template:\n\n```bash\nAPP_NAME=my-api IMAGE=ghcr.io/org/my-api:latest bash setup/new-app.sh\n# Optional: PORT=8080  DOMAIN=api.example.com\n```\n\nThis copies `k8s/apps/example-app/` into `k8s/apps/my-api/`, substitutes all\nnames/image/port/domain, and prints the git commands to push and trigger a sync.\n\nThe scaffold creates five files: `application.yaml` (ArgoCD Application),\n`deployment.yaml`, `service.yaml`, `ingress.yaml`, and `secret.yaml` (empty placeholder).\n\n### Phase 4: Local dev parity\n\n`local-setup.sh` mirrors production in a Lima VM on macOS. The key differences:\n\n- Replaces Let's Encrypt with a self-signed `ClusterIssuer` (HTTP-01 can't validate\n  private `192.168.x.x` addresses)\n- Skips the VPN firewall (no `VPN_SUBNET` set)\n- Transfers the local repo via tarball rather than cloning from git\n\nThe transfer approach means you can test uncommitted changes locally. ArgoCD inside the VM\nstill syncs from git (the pushed commit), but `init.sh` itself runs from the transferred\nfiles — useful for iterating on bootstrap scripts without pushing every change.\n\n---\n\n## How It All Fits Together\n\n```\nGitHub repo (main branch)\n        │\n        │  git push\n        ▼\n   ArgoCD (app-of-apps)\n        │\n        │  discovers k8s/apps/**/application.yaml\n        ├──► example-app Application\n        ├──► my-api Application\n        └──► ...\n              │\n              │  applies manifests to cluster\n              ▼\n         k3s cluster\n              │\n              ├── Traefik (ingress, routes by Host header)\n              ├── cert-manager (issues Let's Encrypt certs)\n              └── app pods (read secrets from etcd via envFrom)\n```\n\n![ArgoCD dashboard showing app-of-apps sync](argocd.png)\n\nTraffic flow for an incoming request:\n\n1. DNS: `my-api.<ip>.nip.io` resolves to the node's public IP\n2. UFW: allows port 443, forwards to Traefik\n3. Traefik: matches `Host: my-api.<ip>.nip.io`, terminates TLS (cert from cert-manager), proxies to `my-api` Service\n4. Pod: reads `my-api-secrets` Secret via `envFrom`\n\n---\n\n## Lessons Learned\n\n**Infisical looked great until it didn't.** SRP authentication is a reasonable security\nchoice for interactive logins. It's a disaster for automation. The lesson: when evaluating\nsecret management tools, test the machine-to-machine auth flow first, not the UI.\n\n**`kubectl wait --watch` is unreliable on constrained nodes.** It opens a long-lived TLS\nwatch stream, which drops silently when the API server is under memory pressure. Polling\nwith a loop is less elegant but more reliable in practice.\n\n**The firewall baseline matters more than the VPN restriction.** Not running the firewall\nat all when `VPN_SUBNET` isn't set was the bigger risk — port 6443 open to the internet\nis a real problem. The VPN restriction is a nice-to-have. The baseline deny-incoming is\nnot.\n\n**The app-of-apps `include` filter prevents a subtle footgun.** Without it, ArgoCD\nattempts to own every YAML file in `k8s/apps/`, then conflicts with child apps over the\nsame resources. The `SharedResourceWarning` is confusing to diagnose.\n\n**A 10-second pause after `node Ready` is necessary.** The k3s node reports Ready before\nthe API server has fully initialized its internal state. Sending large `kubectl apply`\npayloads immediately after Ready causes transient errors that look like cert or auth\nproblems.\n\n---\n\n## What's Next\n\n- **VPN-gated ArgoCD UI**: when `VPN_SUBNET` is set, `init.sh` exposes ArgoCD via Traefik\n  with an IP-allowlist middleware — the infrastructure is there, just not the default.\n- **DNS-01 for private ingresses**: HTTP-01 ACME challenges require public internet access.\n  Ingresses behind a VPN need DNS-01 with a supported provider (Cloudflare, Route53).\n- **Multi-node**: `worker-init.sh` exists and joins additional k3s agents — but the\n  storage (local-path) and networking (no CNI overlay) assumptions need revisiting for\n  real multi-node setups.\n\n---\n\n## References\n\n- [k3s documentation](https://docs.k3s.io)\n- [ArgoCD app-of-apps pattern](https://argo-cd.readthedocs.io/en/stable/operator-manual/cluster-bootstrapping/)\n- [cert-manager ACME HTTP-01 challenge](https://cert-manager.io/docs/configuration/acme/http01/)\n- [nip.io wildcard DNS](https://nip.io)\n- [RFC 6598 — Shared Address Space (100.64.0.0/10)](https://datatracker.ietf.org/doc/html/rfc6598)\n- [Tailscale install](https://tailscale.com/download/linux)\n",
            "url": "https://gauravsarma.com/posts/2026-03-04_single-node-k3s-gitops-on-a-budget",
            "title": "Single-Node Kubernetes GitOps on a Budget",
            "summary": ". [Single-Node Kubernetes GitOps on a Budget](single-node-k3s-gitops-on-a-budget-cover...",
            "date_modified": "2026-03-04T00:00:00.000Z"
        },
        {
            "id": "https://gauravsarma.com/posts/2026-03-02_building-gigaboy-autonomous-software-engineering-agent",
            "content_html": "\n![Building Gigaboy: An Autonomous Agent Orchestrator](building-an-ai-agent-orchestrator-cover.png)\n\n_2026-03-02_\n\nYour issue tracker has 40 tickets in the backlog. Twelve of them are well-defined,\nself-contained, and have been sitting there for three sprints because nobody has the\nbandwidth to pick them up. You know exactly what needs to happen — you've even written\nthe acceptance criteria — but the work never starts.\n\nGigaboy is an attempt to close that gap: an orchestrator that watches your Linear\nworkspace, picks up tickets when they move to \"Todo\", and drives them to a merged PR\nwith no human in the loop unless the agent gets stuck.\n\n---\n\n## The Problem\n\nCopilots and chat-based coding assistants are useful for in-editor autocomplete and\nshort Q&A loops, but they still require a human to drive every step: open the file,\npaste context, review the output, commit, push, open the PR. The feedback cycle is\nfaster, but the human bottleneck is the same.\n\nWhat we actually want is an agent that can own a task end-to-end:\n\n1. Read the ticket and understand what needs to change\n2. Explore the repository, form a plan\n3. Make the code changes on a branch\n4. Open a pull request and report back\n5. Respond to review comments and iterate\n6. Merge when approved\n\nAnd do all of that without requiring a developer to babysit the process — while still\nletting humans stay in the loop when the agent genuinely doesn't know what to do.\n\n---\n\n## Prerequisites\n\n- Familiarity with Go (the codebase is Go 1.22 throughout)\n- Basic understanding of Linear, GitHub, and Telegram APIs\n- Redis and PostgreSQL familiarity (the event bus and store)\n- Some exposure to LLM tool-use / function-calling APIs\n\n---\n\n## Technical Decisions\n\n### Linear as the command interface, not a custom dashboard\n\nThe most consequential design choice was using Linear as the primary UI.\n\nThe alternative was to have a separate web dashboard with agent controls which would have taken weeks\nto build and would have introduced yet another tool that engineering teams need to adopt.\nLinear is already where engineering teams manage their work. Issues have descriptions,\nlabels, comments, and a defined state machine. All the inputs and outputs an agent needs\nare already there.\n\nThe tradeoff: the agent's behavior is constrained to what Linear's model expresses. You\ncan't do rich interactive UIs. You can't show diffs inline. But for the vast majority of\nasynchronous delegation workflows, Linear's comment thread is exactly the right surface.\n\nConcretely: when an issue moves to \"Todo\", the orchestrator picks it up. When the agent\nneeds clarification, it posts a comment and marks the ticket \"Blocked\". When a human\nreplies, the agent resumes. When the PR is ready, a `/merge` comment in Linear triggers\nthe merge. No dashboard. No custom CLI.\n\n### Redis Streams for webhook fan-out\n\nWebhooks arrive from three sources, mainly Linear, GitHub, and Telegram and need to be\nrouted to the orchestrator for processing. The naive approach is to call the orchestrator\nhandler directly from the webhook HTTP handler. This works until you need at-least-once\ndelivery, retry on failure, or the ability to replay events from a crash.\n\nRedis Streams give us all three. The gateway publishes events to three named streams\n(`gigaboy:stream:linear`, `gigaboy:stream:github`, `gigaboy:stream:telegram`). The\norchestrator consumes them via a consumer group. If processing fails, the message stays\nin the pending list and gets retried. On restart, pending messages from prior runs are\nreplayed before new ones are consumed.\n\n```go\n// gateway.go — publish an event and return immediately\nraw, _ := json.Marshal(payload)\nevt := events.Event{\n    Type:    fmt.Sprintf(\"linear.%s.%s\", payload.Type, payload.Action),\n    Payload: raw,\n}\ng.bus.Publish(c.Request.Context(), events.StreamLinear, evt)\nc.JSON(http.StatusOK, gin.H{\"ok\": true})\n```\n\nThe webhook handler returns 200 in milliseconds regardless of how long processing takes.\nLinear doesn't time out, doesn't retry unnecessarily, and the orchestrator works through\nevents at its own pace.\n\n### asynq for agent work queue\n\nSpawning a long-running LLM tool-use loop inline in the orchestrator's event handler\nwould block the consumer goroutine for minutes. We need to hand off agent execution to\na separate worker.\n\nasynq (backed by Redis) handles this. The orchestrator enqueues a typed task payload;\nthe worker picks it up in a separate goroutine pool and runs the agent. asynq also\ngives us deduplication, retry scheduling, and a task inspector without building any of\nthat ourselves.\n\nThe `ResumeKind` field on the task payload lets the worker distinguish between a fresh\nstart, a clarification reply, a change-request resume, a merge request, and a crash\nrecovery:\n\n```go\ntype AgentTaskPayload struct {\n    SessionID   string\n    WorkspaceID string\n    ProjectID   string\n    IssueID     string\n    ResumeKind  ResumeKind // New | Clarification | Merge | Recover\n}\n```\n\n### Claude tool-use API, not a subprocess\n\nEarly versions considered shelling out to `claude` CLI as a subprocess. The appeal:\nyou get streaming output, process isolation, and the full tool-execution environment\nthat the Claude CLI provides.\n\nThe problem: subprocess invocation makes it hard to inject per-session context, capture\nstructured results, enforce token limits, and intercept tool calls. The tool-use API\ngives the agent the same capabilities but with the orchestrator fully in control of the\nmessage loop.\n\nThe agent loop is a straightforward `for i < maxToolIterations` cycle: send messages to\nClaude, execute any tool calls, append results, repeat until `end_turn` or a terminal\ntool result (`CLARIFICATION_REQUESTED:` or `LEARNING_SUBMITTED:`).\n\n```go\n// executor.go — the core loop\nfor i := 0; i < maxToolIterations; i++ {\n    resp, err := a.deps.Anthropic.Messages.New(ctx, anthropic.MessageNewParams{\n        Model:     anthropic.F(claudeModel),     // claude-opus-4-6\n        MaxTokens: anthropic.F(int64(8192)),\n        System:    anthropic.F([]anthropic.TextBlockParam{...}),\n        Tools:     anthropic.F(a.tools.AnthropicTools()),\n        Messages:  anthropic.F(messages),\n    })\n    // ... execute tool calls, append results ...\n}\n```\n\nA cap of 50 iterations prevents runaway loops. The agent terminates cleanly when it\nhits `end_turn` or calls one of the terminal tools.\n\n### pgvector for cross-session learning\n\nEach agent session ends with a `finish_learning` tool call. The agent summarises what\nit learned about the codebase — conventions, patterns, known pitfalls — and submits\nthem as structured entries with categories (e.g., `\"testing\"`, `\"database\"`, `\"api\"`).\n\nThese entries are embedded with OpenAI's `text-embedding-ada-002` and stored in\nPostgreSQL via pgvector. The next time an agent works in the same project, a similarity\nsearch over its issue title and description retrieves the top-10 most relevant chunks\nand injects them into the system prompt.\n\n```go\n// context/manager.go — retrieve relevant context\nfunc (m *Manager) GetContext(ctx context.Context, projectID uuid.UUID, query string) ([]*db.ContextChunkResult, error) {\n    embedding, err := m.embed(ctx, query)\n    // ...\n    return m.queries.SearchContextChunks(ctx, projectID, 10, pgvector.NewVector(embedding))\n}\n```\n\nOver time the agent accumulates a project-specific memory: which files are important,\nwhat test patterns the team uses, what caused past bugs. This is not a RAG system over\nthe full codebase — the codebase itself is accessed via the GitHub API at read time.\nThe vector store holds only synthesised lessons from prior runs.\n\n### AES-256-GCM credential storage\n\nEvery workspace stores API keys (Linear, GitHub, Telegram bot token) encrypted at rest.\nThe encryption key is a 32-byte AES-256-GCM key supplied as a 64-character hex env var\n(`ENCRYPTION_KEY`). Credentials are encrypted on write (at onboarding) and decrypted\non demand before each API call.\n\nThis is a deliberate trade-off. Storing credentials in Postgres (vs. AWS Secrets\nManager or HashiCorp Vault) keeps the deployment simple and self-contained. The\noperative risk model is: if the database is compromised without the encryption key, the\ncredentials are opaque ciphertext. If both are compromised, nothing protects them,\nbut that's a property of any key-material-in-env-var approach.\n\n---\n\n## Implementation\n\n### Phase 1: Webhook gateway and event fan-out\n\nThe Gin HTTP server exposes four routes:\n\n```\nGET  /health\nPOST /onboarding\nPOST /webhooks/linear\nPOST /webhooks/github\nPOST /webhooks/telegram\n```\n\nLinear and GitHub webhooks are validated with HMAC-SHA256 signatures before the body\nis parsed. The middleware reads and buffers the body (since `io.ReadAll` consumes it),\nvalidates the MAC, and replaces `c.Request.Body` with a re-readable buffer:\n\n```go\nbody, _ := io.ReadAll(c.Request.Body)\nc.Request.Body = io.NopCloser(newBytesReader(body))\n\nsig := c.GetHeader(\"Linear-Signature\")\nif !validateHMAC(body, sig, g.cfg.LinearWebhookSecret) {\n    c.AbortWithStatusJSON(http.StatusUnauthorized, ...)\n}\n```\n\nValidated payloads are serialised to JSON and published to the appropriate Redis Stream.\nThe webhook handler returns immediately — event processing is fully decoupled.\n\n### Phase 2: Orchestrator FSM\n\nThe orchestrator consumes events from all three streams in separate goroutines. Linear\nevents drive the main agent lifecycle:\n\n| Linear event                            | Action                                                  |\n| --------------------------------------- | ------------------------------------------------------- |\n| `Issue.update` → state type `unstarted` | Spawn agent session                                     |\n| `Issue.update` → state type `cancelled` | Mark session FAILED                                     |\n| `Comment.create`                        | Resume session (if awaiting clarification or PR review) |\n| `Project.create/update`                 | Auto-register project + seed GitHub repo                |\n\nThe session state machine is:\n\n```\nINITIALIZING → CODING → PR_OPEN → AWAITING_MERGE → LEARNING → COMPLETED\n                              ↓\n                   NEEDS_CLARIFICATION (parked, waiting for comment)\n                              ↓\n                           FAILED\n```\n\nOne critical invariant: **one active session per issue**. If a ticket is moved to Todo\nwhile a session is already running, the new event is skipped. If the ticket is moved to\nTodo after a previous session completed, the old session is reset to INITIALIZING and\nre-used (preserving branch name and PR metadata). This prevents the branch proliferation\nthat happens when agents keep creating new branches for the same issue.\n\n```go\n// Check for an active session first\nif existing, err := sm.queries.GetActiveSessionByIssue(ctx, issue.ID); err == nil {\n    log.Printf(\"[orchestrator] active session %s already exists — skipping\", existing.ID)\n    return nil\n}\n// Reuse completed session or create new one\nsession, err := sm.queries.GetLatestSessionByIssue(ctx, issue.ID)\nif err == nil {\n    session, err = sm.queries.ResetSessionForRetry(ctx, session.ID)\n} else {\n    session, err = sm.queries.CreateAgentSession(ctx, ...)\n}\n```\n\n### Phase 3: Agent tool-use loop\n\nEach session runs a `GeneralAgent` that:\n\n1. Fetches the issue and comments from Linear (filtering out its own progress comments)\n2. Runs a pgvector similarity search for relevant prior context\n3. Fetches the Linear project description to use as project-specific agent rules\n4. Builds a system prompt combining issue, context, and rules\n5. Enters the tool-use loop\n\nThe 11 tools available to the agent:\n\n| Tool                | Purpose                                               |\n| ------------------- | ----------------------------------------------------- |\n| `read_file`         | Read a file from GitHub at a given ref                |\n| `write_file`        | Create or update a file via GitHub Contents API       |\n| `delete_file`       | Delete a file                                         |\n| `list_files`        | List directory contents                               |\n| `create_branch`     | Branch from default branch                            |\n| `create_pr`         | Open a pull request                                   |\n| `get_ci_status`     | Check GitHub Actions results for a ref                |\n| `post_comment`      | Post a comment on the Linear issue                    |\n| `ask_clarification` | Park the session and post a question to Linear        |\n| `finish_learning`   | Submit structured learnings and end the session       |\n| `create_subtask`    | Create a child issue in Linear (used by PlannerAgent) |\n\nThe `ask_clarification` and `finish_learning` tools work via a sentinel return value:\nwhen the executor sees `CLARIFICATION_REQUESTED:` or `LEARNING_SUBMITTED:` in the tool\nresult, it breaks out of the loop and handles the termination condition accordingly.\n\n```go\n// tools.go — ask_clarification returns a sentinel, not a real result\nfunc (t *AskClarificationTool) Execute(ctx context.Context, input json.RawMessage) (string, error) {\n    // ...\n    return fmt.Sprintf(\"CLARIFICATION_REQUESTED:%s\", p.Question), nil\n}\n```\n\n```go\n// executor.go — executor intercepts the sentinel\nif strings.HasPrefix(result, \"CLARIFICATION_REQUESTED:\") || strings.HasPrefix(result, \"LEARNING_SUBMITTED:\") {\n    return result, nil\n}\n```\n\n### Phase 4: Agent identities\n\nWorkspaces can assign different personas to different issues via Linear labels. A label\nof `agent_backend` routes the issue to the \"backend\" identity, which overrides the\ngeneric system prompt with a backend-specialist persona.\n\nThree built-in identities are seeded on every workspace registration: `backend`,\n`frontend`, and `infrastructure`. Each has a tailored system prompt focused on the\nrelevant concerns (e.g., the backend identity emphasises security, auth, and\nobservability; the infrastructure identity emphasises Terraform, least-privilege, and\ncost).\n\n### Phase 5: Telegram control plane\n\nTelegram serves as an out-of-band control and notification channel. The agent can\nnotify the user when blocked on clarification, and the user can query agent status via\nbot commands:\n\n- `/status` — list all active sessions and their states\n- `/repo <name>` — show active sessions for a repo\n- `/issue <identifier>` — show the session for a specific Linear issue (e.g. `/issue ENG-42`)\n\nIncoming Telegram messages arrive via webhook (auto-registered at onboarding), are\npublished to `gigaboy:stream:telegram`, consumed by the orchestrator, and dispatched\nvia `tgControl.dispatch()`.\n\n---\n\n## How It All Fits Together\n\n```\nLinear           GitHub            Telegram\n   │                │                  │\n   └──── HMAC ──────┴─── HMAC ─────────┘\n              │\n         Gin Gateway (port 8082)\n              │\n    Redis Streams (fan-out)\n    ┌──────────────────────────┐\n    │ gigaboy:stream:linear    │\n    │ gigaboy:stream:github    │\n    │ gigaboy:stream:telegram  │\n    └──────────────────────────┘\n              │\n       Orchestrator FSM\n              │\n    ┌─────────┴──────────┐\n    │  asynq task queue  │\n    └─────────┬──────────┘\n              │\n         Worker pool\n              │\n       GeneralAgent\n        ┌─────┴─────────────────────────────────┐\n        │  Claude claude-opus-4-6 tool-use loop  │\n        │  11 tools (GitHub API + Linear API)    │\n        │  pgvector context retrieval            │\n        └─────────────────────────────────────────┘\n              │\n         PostgreSQL (sessions, issues, learning chunks)\n         pgvector (context embeddings)\n```\n\nA ticket enters the system as a Linear webhook. It leaves as a merged PR with a\nsession summary stored in the vector database for the next agent to benefit from.\n\n---\n\n## Lessons Learned\n\n**The one-session-per-issue invariant is load-bearing.** Early versions created a new\nsession every time a ticket moved to Todo. This produced duplicate PRs, duplicate\nbranches, and confusing Linear comment threads. The fix — reset and reuse — was simple\nbut required adding `GetLatestSessionByIssue` and `ResetSessionForRetry` to the query\nlayer.\n\n**System comments must be explicitly filtered.** The agent posts progress updates to\nLinear. Without filtering, those comments were being picked up by the orchestrator as\nhuman instructions, causing the agent to resume itself in infinite loops. The solution\nis a `isSystemComment()` check on both the orchestrator side (before re-enqueuing) and\nthe context builder side (before injecting comments into the system prompt).\n\n**HMAC validation should be middleware, not inline.** The original implementation read\nthe body inside the handler, validated the signature, then called `ShouldBindJSON`\nwhich also reads the body — and got EOF because the body was already consumed.\nBuffering the body in middleware and replacing `c.Request.Body` with a re-readable\nwrapper is the right pattern.\n\n**The pgvector context retrieval is non-fatal.** Embedding API calls fail. The context\nretrieval is wrapped in a non-fatal path: if it errors, the agent runs without prior\ncontext. A session should not fail because a vector search timed out.\n\n**Stale session recovery is necessary.** Workers can crash mid-execution. Sessions\nthat are stuck in `CODING` or `INITIALIZING` with a stale heartbeat need to be\nre-enqueued. A startup goroutine (`recoverStaleSessions`) handles this. The one\nexception: `PR_OPEN` sessions are intentionally not recovered — they are waiting for a\nhuman `/merge` comment, not for the agent to do more work.\n\n---\n\n## What's Next\n\nThe GitHub CI integration is partially wired: the orchestrator can receive `check_run`\nevents but doesn't yet resolve which session owns a given check run. Closing that loop\nwould let the agent self-correct when CI fails without any human intervention.\n\nThe `PlannerAgent` (which decomposes issues into Linear subtasks) exists but is not\nyet triggered by the orchestrator. Connecting it for large or ambiguous issues is a\nnatural next step.\n\nConflict resolution on merge currently escalates to the user with instructions to\nresolve manually. A local git merge context (checked-out worktree) would let the agent\nhandle conflicts programmatically.\n\n---\n\n## References\n\n- [Anthropic tool use documentation](https://docs.anthropic.com/en/docs/tool-use)\n- [Redis Streams — consumer groups and at-least-once delivery](https://redis.io/docs/data-types/streams/)\n- [asynq — Go background job library](https://github.com/hibiken/asynq)\n- [pgvector — vector similarity search in PostgreSQL](https://github.com/pgvector/pgvector)\n- [Linear API documentation](https://developers.linear.app/docs/graphql/working-with-the-graphql-api)\n- [go-github — GitHub API client for Go](https://github.com/google/go-github)\n",
            "url": "https://gauravsarma.com/posts/2026-03-02_building-gigaboy-autonomous-software-engineering-agent",
            "title": "Building Gigaboy, An Autonomous Software Engineering Agent Orchestrator",
            "summary": ". [Building Gigaboy: An Autonomous Agent Orchestrator](building-an-ai-agent-orchestrator-cover...",
            "date_modified": "2026-03-02T00:00:00.000Z"
        },
        {
            "id": "https://gauravsarma.com/posts/2026-02-25_building-ai-tennis-coach-mediapipe-claude",
            "content_html": "\n![Building an AI Tennis Coach with MediaPipe and Claude](building-an-ai-tennis-coach-mediapipe-claude-cover.png)\n\nA while back I started playing tennis again after a long break. The frustrating thing about tennis is that it's almost impossible to self-correct without video — you feel like you're doing something right, watch the footage, and realise your elbow is at a completely wrong angle at contact. Hiring a coach is one option but not always practical for a casual player. So I did what any engineer would do: I decided to build one.\n\nThe result is a Streamlit app that takes an uploaded tennis video, runs pose detection on every frame, computes joint angles and swing timing, renders an annotated video with a skeleton overlay, and calls Claude to produce four categories of coaching feedback grounded in the actual numbers. This post walks through the pipeline stage by stage.\n\n![Demo: skeleton overlay and coaching feedback output](tennis-coach-demo.mp4)\n\n## The Pipeline\n\nAt a high level the system is a linear chain of transformations:\n\n```\nUpload (MP4 / MOV / AVI)\n  → extract_frames()           [video_io.py]\n  → PoseDetector.detect_batch() [pose_detector.py]  → LandmarkResult | None per frame\n  → compute_frame_metrics()    [metrics.py]          → FrameMetrics per frame\n  → aggregate_metrics()        [metrics.py]          → AggregatedMetrics\n  → annotate_frame() × N       [annotator.py]        → annotated BGR frames\n  → frames_to_video()          [video_io.py]         → H.264 MP4\n  → get_coaching_feedback()    [coach.py]            → CoachingReport\n  → Streamlit display\n```\n\nEach stage is a pure function (or close to it) that takes its inputs and returns its outputs without side effects. This made it easy to develop and debug each stage independently before wiring them together in `app.py`.\n\n## The Stack\n\n| Component | Choice | Reason |\n| --- | --- | --- |\n| UI | Streamlit | All-Python, zero front-end work, native video player |\n| Pose detection | MediaPipe PoseLandmarker | 33 body landmarks, runs on CPU, Tasks API is well-maintained |\n| Video I/O | OpenCV (`opencv-python-headless`) | Headless variant avoids display dependencies on servers |\n| AI coaching | Anthropic Claude (`claude-sonnet-4-6`) | Strong instruction-following, reliable JSON output |\n| Math | NumPy | Angle calculation, peak detection, statistics |\n\n## Stage 1: Configuration and Constants\n\nBefore writing any pipeline code I put all magic numbers and index mappings into `config.py`. MediaPipe's pose model produces 33 landmarks, each identified by a zero-based integer index. Scattering those integers across the codebase would make things unmaintainable:\n\n```python\nclass Landmarks:\n    NOSE = 0\n    LEFT_SHOULDER = 11\n    RIGHT_SHOULDER = 12\n    LEFT_ELBOW = 13\n    RIGHT_ELBOW = 14\n    LEFT_WRIST = 15\n    RIGHT_WRIST = 16\n    LEFT_HIP = 23\n    RIGHT_HIP = 24\n    LEFT_KNEE = 25\n    RIGHT_KNEE = 26\n    LEFT_ANKLE = 27\n    RIGHT_ANKLE = 28\n```\n\nThe skeleton connection list pairs landmark indices for drawing bones between joints:\n\n```python\nPOSE_CONNECTIONS = [\n    (Landmarks.LEFT_SHOULDER, Landmarks.RIGHT_SHOULDER),\n    (Landmarks.LEFT_SHOULDER, Landmarks.LEFT_ELBOW),\n    (Landmarks.LEFT_ELBOW, Landmarks.LEFT_WRIST),\n    ...\n]\n```\n\nTwo thresholds matter a lot operationally. `VISIBILITY_THRESHOLD = 0.5` determines when a landmark is considered reliable enough to use — MediaPipe emits a confidence score alongside each (x, y, z) coordinate, and anything below 0.5 gets treated as missing. `MAX_FRAMES = 300` caps the analysis at roughly ten seconds of footage at 30 fps, which keeps processing time under a minute even on a laptop CPU.\n\n## Stage 2: Math Utilities\n\nAll pure math lives in `utils/math_helpers.py` with no project-level imports. The most important function is the joint angle calculator:\n\n```python\ndef angle_between_three_points(a, b, c):\n    ba = np.array(a) - np.array(b)\n    bc = np.array(c) - np.array(b)\n    cos_angle = np.dot(ba, bc) / (np.linalg.norm(ba) * np.linalg.norm(bc))\n    cos_angle = np.clip(cos_angle, -1.0, 1.0)\n    return float(np.degrees(np.arccos(cos_angle)))\n```\n\nThis computes the angle at vertex *b* formed by rays towards *a* and *c*. For the elbow, *b* is the elbow landmark, *a* is the shoulder, and *c* is the wrist. The `np.clip` is essential — floating-point rounding can push the dot product just outside [-1, 1], causing `arccos` to return `nan`.\n\nSwing detection relies on `find_peaks`, which scans wrist speed values for local maxima above a threshold with a minimum distance between peaks:\n\n```python\ndef find_peaks(values, threshold, min_distance=10):\n    filled = [v if v is not None else 0.0 for v in values]\n    peaks = []\n    last_peak = -min_distance - 1\n    for i in range(1, len(filled) - 1):\n        if (filled[i] > threshold\n                and filled[i] >= filled[i-1]\n                and filled[i] >= filled[i+1]\n                and (i - last_peak) >= min_distance):\n            peaks.append(i)\n            last_peak = i\n    return peaks\n```\n\nThe `min_distance` guard prevents two adjacent frames at the peak of a swing from both registering as separate events.\n\n## Stage 3: Frame Extraction\n\n`video_io.py` extracts frames from the uploaded file using OpenCV:\n\n```python\ndef extract_frames(video_path, max_frames=MAX_FRAMES, stride=1):\n    cap = cv2.VideoCapture(video_path)\n    fps = cap.get(cv2.CAP_PROP_FPS) or 30.0\n    total = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))\n\n    if total * stride > max_frames:\n        indices = set(np.linspace(0, total - 1, max_frames, dtype=int).tolist())\n    else:\n        indices = set(range(0, total, stride))\n    ...\n```\n\nThe evenly-spaced subsampling with `np.linspace` matters for longer videos. Naively skipping every *N*-th frame can create a biased sample if the stroke happens to fall in a skipped region. `linspace` distributes the sample budget uniformly across the full duration.\n\n### H.264 Re-encoding\n\nAfter annotation, the frames need to go back into a video the browser can play. OpenCV's `VideoWriter` defaults to the `mp4v` codec, but Streamlit's `st.video()` component requires H.264 (`libx264`) for browser compatibility. The solution is to write an intermediate file with `mp4v` and then re-encode it using a subprocess call to `ffmpeg`:\n\n```python\nsubprocess.run([\n    \"ffmpeg\", \"-y\",\n    \"-i\", raw_path,\n    \"-vcodec\", \"libx264\",\n    \"-pix_fmt\", \"yuv420p\",\n    \"-preset\", \"fast\",\n    \"-crf\", \"23\",\n    output_path,\n], check=True, capture_output=True)\n```\n\n`yuv420p` is the pixel format most widely supported by browsers. If `ffmpeg` isn't installed, the pipeline falls back to serving the raw `mp4v` file as a download rather than an inline player.\n\n## Stage 4: Pose Detection\n\nMediaPipe's `0.10.x` release replaced the older `mp.solutions.pose` API with a new Tasks API. The new API takes a `.task` model file rather than downloading weights implicitly, so the detector manages model files itself:\n\n```python\n_MODEL_URLS = {\n    0: \"https://storage.googleapis.com/.../pose_landmarker_lite.task\",\n    1: \"https://storage.googleapis.com/.../pose_landmarker_full.task\",\n    2: \"https://storage.googleapis.com/.../pose_landmarker_heavy.task\",\n}\n\ndef _ensure_model(complexity):\n    path = os.path.join(_MODELS_DIR, _MODEL_NAMES[complexity])\n    if not os.path.exists(path):\n        urllib.request.urlretrieve(_MODEL_URLS[complexity], path)\n    return path\n```\n\nThe model is downloaded once into a `models/` directory and reused on subsequent runs. For a batch of frames processed sequentially, the detector runs in `VIDEO` mode rather than `IMAGE` mode — this enables temporal tracking across frames which significantly improves landmark stability:\n\n```python\noptions = PoseLandmarkerOptions(\n    base_options=mp.tasks.BaseOptions(model_asset_path=model_path),\n    running_mode=RunningMode.VIDEO,\n    num_poses=1,\n    min_pose_detection_confidence=0.5,\n    min_tracking_confidence=0.5,\n)\n```\n\nIn `VIDEO` mode the landmarker requires monotonically increasing timestamps. Since the frames are extracted from a fixed-fps video, a 33ms increment per frame (approximating 30fps) works reliably.\n\nThe result of detection is a `LandmarkResult` dataclass that wraps the raw landmark list and provides two convenience methods: `get_point` returns normalized (x, y) coordinates or `None` if visibility is below threshold, and `get_pixel` converts normalized coordinates to integer pixel coordinates:\n\n```python\ndef get_pixel(self, idx, width, height):\n    pt = self.get_point(idx)\n    if pt is None:\n        return None\n    return (int(pt[0] * width), int(pt[1] * height))\n```\n\nAny frame where MediaPipe returns no landmarks gets `None` in the results list. The downstream stages handle this gracefully — `None` results simply produce `None` metric values, which are excluded from aggregated statistics.\n\n## Stage 5: Metrics Computation\n\n`metrics.py` is where the actual analysis happens. For each frame, `compute_frame_metrics` extracts six joint angles, torso rotation, stance width, centre of mass, and wrist speed.\n\n### Joint Angles\n\nEach angle follows the same pattern — three landmark indices, one of which is the vertex:\n\n```python\n# Right elbow angle: shoulder → elbow → wrist\nrs = px(Landmarks.RIGHT_SHOULDER)\nre = px(Landmarks.RIGHT_ELBOW)\nrw = px(Landmarks.RIGHT_WRIST)\nif rs and re and rw:\n    fm.right_elbow_angle = angle_between_three_points(rs, re, rw)\n```\n\nAll six angles (both elbows, both shoulders, both knees) are computed only when all three required landmarks are visible at or above the threshold. This means a frame where the player's left side is occluded still produces valid right-side metrics.\n\n### Torso Rotation\n\nTorso rotation captures how much the upper body turns during the swing — a key metric in tennis since proper shoulder rotation drives power:\n\n```python\nshoulder_vec = np.array(right_shoulder) - np.array(left_shoulder)\nhip_vec = np.array(right_hip) - np.array(left_hip)\ntorso_rotation = angle_between_vectors(shoulder_vec, hip_vec)\n```\n\nWhen the shoulders and hips are parallel (no rotation), this angle is near 0°. A full shoulder turn produces angles in the 30–60° range depending on the shot type.\n\n### Swing Event Detection\n\nSwing events are detected by finding peaks in wrist speed. Wrist speed is computed frame-over-frame as a Euclidean distance normalized by the frame diagonal:\n\n```python\ndiag = np.sqrt(frame_width**2 + frame_height**2)\nfm.right_wrist_speed = euclidean_distance(current_rw, prev_rw) / diag\n```\n\nNormalizing by the diagonal makes the threshold (`WRIST_SPEED_THRESHOLD = 0.02`) independent of video resolution. The combined speed — taking the max of left and right wrist at each frame — is passed through `find_peaks`. Each peak corresponds to a swing contact event.\n\n### Aggregation\n\n`aggregate_metrics` collapses the per-frame lists into statistics:\n\n```python\n@dataclass\nclass AngleStat:\n    mean: Optional[float]\n    min: Optional[float]\n    max: Optional[float]\n    std: Optional[float]\n```\n\nThe standard deviation is particularly useful for coaching — a high std on elbow angle means the player's technique varies significantly across swings, which is worth flagging even if the mean looks reasonable.\n\n## Stage 6: Annotation\n\nThe annotator draws onto a copy of each frame, maintaining state across frames for the wrist trail:\n\n```python\nclass Annotator:\n    def __init__(self, show_angles, show_trail):\n        self._right_trail = deque(maxlen=TRAIL_LENGTH)\n        self._left_trail = deque(maxlen=TRAIL_LENGTH)\n```\n\nUsing a `deque` with a fixed `maxlen` is a clean way to maintain a sliding window of the last 15 wrist positions without manually managing list slicing.\n\nThree layers are drawn in order:\n\n**Skeleton** — lines connecting landmark pairs from `POSE_CONNECTIONS`, drawn only when both endpoints are visible:\n\n```python\nfor start_idx, end_idx in POSE_CONNECTIONS:\n    if start_idx in pixels and end_idx in pixels:\n        cv2.line(out, pixels[start_idx], pixels[end_idx], SKELETON_COLOR, 2, cv2.LINE_AA)\n```\n\n**Angle labels** — text printed offset from each joint. Each label is prefixed with an abbreviation (`RE` for right elbow, `LK` for left knee etc.) so the viewer doesn't need to guess which angle they're reading:\n\n```python\ntext = f\"{label}:{angle:.0f}°\"\ncv2.putText(frame, text, (px+6, py-6),\n            cv2.FONT_HERSHEY_SIMPLEX, FONT_SCALE,\n            ANGLE_TEXT_COLOR, FONT_THICKNESS, cv2.LINE_AA)\n```\n\n**Wrist trail** — older positions are drawn darker by scaling the color by `i / len(pts)`. This creates a fade effect that makes the direction and speed of the wrist path visually obvious.\n\nFrames identified as swing events get an orange border and a `SWING` label in the top-right corner, making it easy to scrub to the key moments in the annotated video.\n\n## Stage 7: Claude Coaching Feedback\n\nThe coaching call is the most interesting stage to get right. The goal is feedback that references specific numbers — not \"bend your knees more\" but \"your right knee averages 162° at impact; recreational players typically aim for 130–145° for a stable base.\"\n\n### Prompt Design\n\nThe system prompt establishes the persona and constraints:\n\n```\nYou are an expert tennis coach with 20+ years of experience coaching players\nat all levels. You analyze video-based biomechanical data and deliver precise,\nactionable coaching feedback.\n\nRULES:\n- Always reference specific numbers from the provided metrics.\n- Be direct and avoid generic advice like \"bend your knees more\" without a target angle.\n- Respond ONLY with valid JSON matching the requested schema — no prose outside the JSON.\n```\n\nThe user prompt is a structured markdown document that compresses all the computed metrics into a compact table:\n\n```\n## Joint Angle Statistics (mean / min / max / std)\n- Right elbow:    142.3° / 98.1° / 175.2° / 18.4°\n- Left elbow:     156.7° / 121.0° / 179.8° / 14.2°\n- Right shoulder: 67.4° / 34.2° / 98.1° / 22.1°\n...\n\n## Body Mechanics\n- Torso rotation (mean/max): 24.3° / 47.8°\n- Stance width (normalized):  1.43\n- CoM lateral range: 0.12 (normalized 0-1)\n\n## Swing Events\n- Wrist speeds at peaks: 0.031, 0.028, 0.035\n```\n\nAsking the model to respond only in JSON is effective for structured output. The response is parsed with a three-step fallback: try `json.loads` directly, then extract from a markdown code fence, then search for the first `{...}` block. If all three fail, the raw text goes into the `swing_mechanics` field so the user at least sees the response rather than a silent failure.\n\n### Error Handling\n\nDifferent API errors need different messages. A `RateLimitError` is a transient condition the user can retry; an `AuthenticationError` means the key is wrong:\n\n```python\nexcept anthropic.AuthenticationError:\n    report.swing_mechanics = \"❌ Authentication failed — check your Anthropic API key.\"\nexcept anthropic.RateLimitError:\n    report.swing_mechanics = \"❌ Rate limit exceeded — please wait and retry.\"\nexcept anthropic.APIConnectionError:\n    report.swing_mechanics = \"❌ Network error — check your internet connection.\"\n```\n\nImportantly, a failed Claude call does not abort the pipeline. The annotated video is always returned regardless of whether the coaching call succeeds.\n\n## Stage 8: Streamlit UI\n\nThe UI is straightforward. The sidebar holds the API key input, display toggles, and a stride slider. The main area has a file uploader and a single Analyze button.\n\nProgress is communicated through Streamlit's native `progress` bar updated at each stage:\n\n```python\nprogress = st.progress(0, text=\"Initializing…\")\n\n# Step 1\nprogress.progress(10, text=\"Extracting frames…\")\nframes, fps, total_frames = extract_frames(input_path, stride=stride)\n\n# Step 2\nprogress.progress(30, text=\"Running MediaPipe pose detection…\")\npose_results = detector.detect_batch(frames)\n...\nprogress.progress(100, text=\"Done!\")\n```\n\nResults are split into two columns. The left column shows the annotated video with a download button. The right column uses Streamlit's `st.tabs` for the five coaching categories:\n\n```python\ntab_swing, tab_foot, tab_stance, tab_tactics, tab_prio = st.tabs(\n    [\"Swing\", \"Footwork\", \"Stance\", \"Tactics\", \"Priorities\"]\n)\n```\n\nBelow both columns, an expandable raw metrics table shows the joint angle statistics as a pandas DataFrame, useful for players who want to track numbers across multiple sessions.\n\n## What I Learned\n\nA few things that weren't obvious upfront:\n\n**MediaPipe's Tasks API is a breaking change.** The old `mp.solutions.pose.Pose` class still exists but is no longer the recommended path. The new Tasks API requires an explicit `.task` model file and a monotonically increasing timestamp in video mode. Missing either of these silently produces zero detections.\n\n**OpenCV and browser video compatibility don't mix out of the box.** The `mp4v` codec works for local playback but most browsers refuse to play it inline. Running a `ffmpeg` subprocess to re-encode to `libx264 + yuv420p` is the reliable solution.\n\n**Swing detection without a ground truth is hard.** Wrist speed peaks work well for forehand and backhand groundstrokes but can miss serves (where the wrist speed profile is different) or produce false positives on defensive scrambles. A more robust approach would fine-tune a classifier on labeled swing data.\n\n**Claude's JSON reliability depends heavily on the system prompt.** Adding `\"no prose outside the JSON\"` to the system prompt and providing the exact schema in the user prompt eliminated nearly all cases where the model wrapped its output in explanatory sentences.\n\n## Running It\n\n```bash\npip install -r requirements.txt\ncp .env.example .env   # add ANTHROPIC_API_KEY\nstreamlit run app.py\n```\n\nUpload any tennis video under 10 seconds, click Analyze, and the five-step pipeline completes in roughly 30–90 seconds depending on CPU speed and video length. The annotated video shows the skeleton and joint angles on every frame, with orange flashes marking detected swing events.\n\n## Conclusion\n\nThe project chains four off-the-shelf tools — MediaPipe, OpenCV, ffmpeg, and Claude — each used for exactly what it's designed for. MediaPipe handles the hard computer vision problem. OpenCV handles frame I/O. ffmpeg handles codec compatibility. Claude handles the reasoning over numbers. None of them are stretched beyond their core purpose.\n\nThe interesting engineering is in the glue: the visibility threshold logic that keeps partial occlusions from poisoning the stats, the normalized wrist speed metric that makes peak detection resolution-independent, and the prompt design that reliably produces structured JSON with metric-referenced feedback rather than generic coaching clichés.\n\nThe code is at [github.com/gsarmaonline/tennis-coach](https://github.com/gsarmaonline/tennis-coach).\n\nHappy learning!\n\nPlease reach out to me [here](https://gauravsarma.com/ping) for more ideas or improvements.\n",
            "url": "https://gauravsarma.com/posts/2026-02-25_building-ai-tennis-coach-mediapipe-claude",
            "title": "Building an AI Tennis Coach with MediaPipe and Claude",
            "summary": ". [Building an AI Tennis Coach with MediaPipe and Claude](building-an-ai-tennis-coach-mediapipe-claude-cover...",
            "date_modified": "2026-02-25T00:00:00.000Z"
        },
        {
            "id": "https://gauravsarma.com/posts/2026-02-24_auto-generating-synced-diagram-overlays",
            "content_html": "\n![Auto-generating Synced Diagram Overlays](auto-generating-synced-diagram-overlays-cover.png)\n\nTechnical video content has a persistent problem: a speaker verbally describes a system architecture, an algorithm, or a database schema while the screen shows nothing that helps the viewer build a mental model. Adding visual overlays manually requires authoring diagrams, timing them to the transcript, and re-editing the video — work that rarely gets done.\n\ntechnify-motions automates this end-to-end. Given a video file, it produces an output where animated diagram slides — flowcharts, bullet-point summaries, code snippets — appear in sync with the exact moment each technical concept is explained. The entire pipeline runs locally except for two LLM calls, costing roughly $0.20 per 30-minute video.\n\n![Demo: auto-generated synced diagram overlays on a technical video](openai_technify.mp4)\n\n## The Pipeline\n\nThe system is a six-stage Python pipeline:\n\n```\nVideo/Audio → Audio Extraction → Transcription → Scene Classification\n    → Slide Generation → Rendering → Video Composition\n```\n\nEach stage writes its output to a `./work/` directory. Stages can be individually cached and re-run, which matters during development when only one stage is changing.\n\n### Stage 1: Audio Extraction\n\nFFmpeg extracts and normalizes the audio track to 16 kHz mono PCM WAV:\n\n```bash\nffmpeg -i lecture.mp4 -vn -acodec pcm_s16le -ar 16000 -ac 1 work/audio.wav\n```\n\nWhisper's models expect this format. Normalizing upfront avoids silent failures where a model receives a stereo 44.1 kHz stream and produces degraded results.\n\n### Stage 2: Transcription with faster-whisper\n\nfaster-whisper is a CTranslate2-based reimplementation of OpenAI Whisper that runs entirely locally. The transcription call enables two key options:\n\n- **word_timestamps=True** — produces segment-level `start`/`end` times in seconds, used later to anchor diagram clips to the timeline\n- **vad_filter=True** — applies Silero VAD with a 500 ms minimum silence threshold, stripping silence before segments reach the model\n\nOn CPU, the model runs with `int8` quantization; `float16` is used when a CUDA device is available. The `small` model balances accuracy and speed for CPU-only environments; `large-v3` is recommended when a GPU is available.\n\nEach segment is serialized as `TranscriptSegment(start, end, text)` and written to `work/transcript.json`. On subsequent runs with `--use-cache`, this file is deserialized directly, avoiding a re-transcription pass.\n\n### Stage 3: Technical Scene Classification\n\nThe full transcript — with indices, timestamps, and text — is sent to Claude in a single API call. The prompt instructs the model to identify consecutive segments that discuss something worth visualizing: systems, algorithms, data models, architecture decisions, workflows. The scope is deliberately wide; a speaker critiquing a flawed database schema is as worth visualizing as one teaching a new one.\n\nClaude returns a JSON array of scenes:\n\n```json\n[\n  {\n    \"start\": 142.3,\n    \"end\": 198.7,\n    \"segment_indices\": [23, 24, 25, 26],\n    \"content_type\": \"architecture\",\n    \"description\": \"Explaining how requests flow through the service mesh\"\n  }\n]\n```\n\nThe response is parsed with three fallback strategies: direct JSON parse, extraction from a markdown code fence, and a regex search for the first `[...]` array in the response. This tolerates outputs where the model wraps its JSON in explanatory prose.\n\n### Stage 4: Slide Generation\n\nEach scene is sent to the LLM with its transcript text and a request for 1–3 typed slides. Three slide types are supported:\n\n- **Graph** — nodes and directed edges for flowcharts, architectures, and system relationships\n- **Bullets** — key points, trade-offs, comparisons, or summaries\n- **Code** — concrete syntax, SQL, CLI commands, YAML config\n\nThe model outputs a JSON array. Each slide object is validated against a strict schema before being accepted. A graph slide, for example, requires every edge's `from` and `to` fields to reference a valid node `id`. If validation fails, the error is appended to the prompt — `\"Your previous attempt was invalid: slides[0] edges[1].to 'cache' is not a known node id. Please fix it.\"` — and the model retries. Up to three attempts are made per scene.\n\nOnce a valid response is received, the scene's time window is divided evenly across the number of slides. Each slide receives a `slide_start` and `slide_end` that override the parent scene timestamps, allowing a single 60-second scene to be split into, say, a 20-second graph, a 20-second bullets slide, and a 20-second code slide.\n\n### Stage 5: Rendering with Remotion\n\nRemotion renders each slide to a duration-matched MP4 by driving a headless Chromium instance through React. Three TypeScript compositions handle the three types: `FlowchartAnimation`, `BulletsSlide`, and `CodeSlide`. Each composition receives the slide payload as props and drives its own animation using Remotion's spring and interpolate primitives.\n\nNode layout for graph slides is computed by dagre, which handles node positioning and edge routing given only the graph topology.\n\nProps are written to a temp file rather than passed as a shell argument, avoiding length limits on large graph payloads:\n\n```bash\nremotion render src/index.ts FlowchartAnimation output.mp4 \\\n  --props=props.json \\\n  --concurrency=4 \\\n  --log=error\n```\n\nRendering is parallelized across slides using a `ThreadPoolExecutor` with four workers. Each Remotion call spawns its own Chromium process, so parallelism is kept low to avoid exhausting memory.\n\nnpm dependencies for the Remotion project are installed automatically on the first run via `npm install --prefer-offline`, gated behind a thread lock so concurrent renders do not trigger multiple installs.\n\n### Stage 6: Video Composition\n\nThe final stage overlays each rendered MP4 clip onto the source video at its exact time window. Three modes are supported.\n\n**PIP (picture-in-picture)** — the default. The diagram is scaled to 40% of the source width and positioned 20 pixels from the bottom-right corner. All overlays are expressed in a single `filter_complex` chain so ffmpeg makes one pass over the entire video:\n\n```\n[0:v][1:v]scale=<pip_w>:-2[pip0];\n[0:v][pip0]overlay=W-w-20:H-h-20:enable='between(t,142,199)'[v0];\n[v0][2:v]scale=<pip_w>:-2[pip1];\n[v0][pip1]overlay=W-w-20:H-h-20:enable='between(t,310,365)'[vout]\n```\n\nThe `between(t,start,end)` expression in the overlay filter controls visibility — the diagram clip loops if shorter than the window, and the original video shows through outside the window.\n\n**Side-by-side** — the video is segmented at diagram boundaries. Non-diagram segments are re-encoded at source resolution. Diagram segments composite the source video on the left half and the diagram on the right half, both padded to `half_w × src_h`. All segments are concatenated with `ffmpeg -f concat`. Segments are re-encoded rather than stream-copied because stream copy seeks to the nearest keyframe, which shifts segment boundaries and breaks alignment at the concat stage.\n\n**Replace** — the source is spliced out entirely during technical scenes. The timeline is built as a list of `(timestamp, source_or_diagram_path)` events, sorted and deduplicated into non-overlapping segments, each trimmed and re-encoded, then concatenated.\n\n### Caching and Iteration\n\nStages serialize their outputs to `./work/`:\n\n| File                     | Contents                                |\n| ------------------------ | --------------------------------------- |\n| `transcript.json`        | Array of `TranscriptSegment` objects    |\n| `scenes.json`            | Array of `TechnicalScene` objects       |\n| `diagrams.json`          | Array of slide payloads with timestamps |\n| `diagrams/diagram_*.mp4` | Rendered Remotion clips                 |\n\nRunning with `--use-cache` skips any stage whose output file already exists. Classification and generation are the only stages that cost money, so caching them is important when iterating on rendering or composition.\n\n### Cost Profile\n\n| Stage            | Tool                        | Cost        |\n| ---------------- | --------------------------- | ----------- |\n| Transcription    | faster-whisper (local)      | Free        |\n| Classification   | Claude API                  | ~$0.01–0.05 |\n| Slide generation | Claude API                  | ~$0.10–0.15 |\n| Rendering        | Remotion + Chromium (local) | Free        |\n| Composition      | ffmpeg (local)              | Free        |\n\nTotal is approximately $0.20 per 30-minute video. The LLM is used narrowly: one call to identify scene boundaries, one call per scene to generate structured slide data. Everything else is deterministic local computation.\n\n## Conclusion\n\nThe system chains five off-the-shelf tools — faster-whisper, Claude, Remotion, dagre, and ffmpeg — each handling exactly the problem it was designed for. None of them are used beyond their core purpose. The LLM is constrained to producing structured JSON rather than free-form content, and its outputs are validated and retried programmatically. The result is a pipeline that converts a raw lecture recording into a diagram-annotated video with no manual authoring.\n\nHappy learning!\n",
            "url": "https://gauravsarma.com/posts/2026-02-24_auto-generating-synced-diagram-overlays",
            "title": "Auto-generating Synced Diagram Overlays for Technical Videos",
            "summary": ". [Auto-generating Synced Diagram Overlays](auto-generating-synced-diagram-overlays-cover...",
            "date_modified": "2026-02-24T00:00:00.000Z"
        },
        {
            "id": "https://gauravsarma.com/posts/2026-01-09_how-lucene-segments-affect-elasticsearch",
            "content_html": "\n![How Lucene Segments Affect Elasticsearch](how-lucene-segments-affect-elasticsearch-cover.png)\n\nBack in 2016 or 2017, I first encountered a problem where our system just wasn’t handling the load we were throwing at it. That was my first real \"deep dive\" into Elasticsearch (ES). I was looking for a way to distribute load across clusters, and while ES handled the load beautifully, it also opened my eyes to the beautiful complexity of how data is actually laid out under the hood.\nIf you’ve ever wondered why your write patterns affect performance or how ES manages to be so fast, you have to look at Lucene segments. Let’s break down the internals.\n\n### The Anatomy: Shards, Indices, and Segments\n\nTo understand ES, you first have to understand its relationship with Lucene. In the ES world, an index is like a table in Postgres. But physically, an index is made up of shards.\nA shard is the functional, scalable unit of your data—it’s a physical container that, internally, is actually a complete Lucene index. Because it’s a full Lucene index, it can conduct searches independently without needing extra metadata from other sources.\nIf we go one level deeper, a Lucene index is made of multiple segments. For Lucene, the segment is the most atomic and granular unit of the data store.\n\n```\n┌─────────────────────────────────────────────────────────────────────────────┐\n│ ELASTICSEARCH SHARD ARCHITECTURE │\n├─────────────────────────────────────────────────────────────────────────────┤\n│ 1. The Shard Unit (Physical Container) │\n│ [Storage] ──► A directory on disk containing Lucene files │\n│ [Immutability] ──► Composed of multiple immutable Segments (.seg) │\n│ [Scaling] ──► Smallest unit moved during Cluster Rebalancing │\n│ │\n├─────────────────────────────────────────────────────────────────────────────┤\n│ 2. Primary Shard (The Writer) │\n│ [Write Path] ──► Validates request ──► Buffers ──► Syncs Replicas │\n│ [Sequencing] ──► Assigns sequence numbers for consistency │\n│ [Status] ──► One per shard group; Must be active for writes │\n│ │\n├─────────────────────────────────────────────────────────────────────────────┤\n│ 3. Replica Shards (The Readers) │\n│ [Redundancy] ──► Exact copies of the Primary on different nodes │\n│ [Read Throughput]──► Parallelizes search queries across multiple nodes │\n│ [Failover] ──► Promoted to Primary if the original node fails │\n│ │\n├─────────────────────────────────────────────────────────────────────────────┤\n│ 4. Internal Shard Components │\n│ [Inverted Index] ──► Text search engine (Terms ──► Docs) │\n│ [BKD Tree] ──► Numeric/Geo spatial index │\n│ [Global Checkpt] ──► Tracks synchronization state between replicas │\n│ [Translog] ──► Local WAL for recovering uncommitted segments │\n└─────────────────────────────────────────────────────────────────────────────┘\n```\n\n### Multiple Representations of Data\n\nOne of the coolest things about Lucene is that it doesn’t just store your data once. It stores it in multiple formats within a segment to support different query types:\n• Inverted Index: The classic search engine structure (popularised by Google) used to find terms across documents.\n• Doc Values: A columnar store used for aggregations (like calculating totals or bucketing data).\n• BKD Trees: K-dimensional trees used for complex geospatial or multidimensional searches.\n\n```\n┌─────────────────────────────────────────────────────────────────────────────┐\n│ LUCENE SEGMENT (Immutable) │\n├─────────────────────────────────────────────────────────────────────────────┤\n│ 1. Inverted Index (Search Core) │\n│ [Term Dictionary] ──► [Term Index (FST)] │\n│ [Postings Lists] ──► {DocID, TermFreq, Positions, Offsets, Payloads} │\n│ │\n├─────────────────────────────────────────────────────────────────────────────┤\n│ 2. Stored Fields (Document Storage) │\n│ [Field Index (.fdx)] ──► Pointer to Document Row │\n│ [Field Data (.fdt)] ──► {Field1, Field2, ...} (Row-based storage) │\n│ │\n├─────────────────────────────────────────────────────────────────────────────┤\n│ 3. DocValues (Columnar Storage) │\n│ [Field A] ──► [Val 1, Val 2, Val 3, ...] (Optimized for Sorting/Aggr) │\n│ [Field B] ──► [Val 1, Val 2, Val 3, ...] │\n│ │\n├─────────────────────────────────────────────────────────────────────────────┤\n│ 4. Metadata & Auxiliary Structures │\n│ [Term Vectors] ──► Per-document Inverted Index │\n│ [Norms] ──► Normalization factors for Scoring │\n│ [Live Documents] ──► Bitset for Deletions (.del file) │\n│ [Points/Dimensions] ──► BKD Tree for Numeric/Geo spatial data │\n└─────────────────────────────────────────────────────────────────────────────┘\n```\n\n### The Power of Immutability\n\nIn most databases, if you update a row, the engine modifies it in place. Lucene does things differently: segments are immutable.\nWhen you update a document, Lucene doesn't change the old one. Instead, it performs an append-only operation. It marks the old document as deleted using a bitset operation and then inserts the updated version into a new segment.\nBecause we are constantly creating new segments, Lucene performs segment merging in the background. This prevents the number of segments from exploding, which would otherwise make the parallelism required for searching too resource-intensive.\n\nYou might wonder: If Lucene is reshuffling data in the background during a merge, how do searches stay consistent?\nLucene uses reference counters. When a query starts, it identifies exactly which segments it needs to touch. If a merge happens mid-query, Lucene maintains a shadow representation of the old segments on disk. The active query finishes using the old \"shadow\" segments, while any new queries are redirected to the newly merged segment. Once the reference counter for the old segment hits zero, it’s finally deleted.\n\n```\n┌─────────────────────────────────────────────────────────────────────────────┐\n│ LUCENE SEGMENT MERGING (Maintenance) │\n├─────────────────────────────────────────────────────────────────────────────┤\n│ 1. The Trigger (Merge Policy) │\n│ [Tiered Policy] ──► Monitors segment count, sizes, and delete % │\n│ [Threshold] ──► Triggered when too many small segments accumulate │\n│ [Goal] ──► Maintain a logarithmic number of segments │\n│ │\n├─────────────────────────────────────────────────────────────────────────────┤\n│ 2. The Selection Phase │\n│ [Candidates] ──► Picks N small segments (often similar in size) │\n│ [Exclusions] ──► Extremely large segments are often left alone │\n│ [Deletions] ──► Prioritizes segments with many \"marked\" deletes │\n│ │\n├─────────────────────────────────────────────────────────────────────────────┤\n│ 3. The Execution (Compact & Purge) │\n│ [New Segment] ──► A fresh, larger segment is built from candidates │\n│ [Data Transfer] ──► Re-indexes Inverted Index, BKD Trees, & DocValues │\n│ [Purge] ──► Documents marked in .del files are NOT copied │\n│ │\n├─────────────────────────────────────────────────────────────────────────────┤\n│ 4. The Switchover (Atomic Commit) │\n│ [Warm-up] ──► New large segment is fsync'd and opened │\n│ [Atomic Swap] ──► Shard metadata updates to point to the new segment │\n│ [Cleanup] ──► Old small segment files are deleted from disk │\n└─────────────────────────────────────────────────────────────────────────────┘\n```\n\n### The Write Path: Translog vs. Memory Buffer\n\nWhen you write to ES, it takes two parallel paths to ensure both speed and durability:\n\n1. Translog (Write Ahead Log): An immutable, append-only record written directly to disk. This ensures that even if the system crashes, your data is persisted.\n2. Internal Memory Buffer: Simultaneously, data is written to a buffer so it can be searched almost immediately, even before it is officially persisted into a disk-based Lucene segment.\n\n```\n┌─────────────────────────────────────────────────────────────────────────────┐\n│ ELASTICSEARCH WRITE PATH (Logical Flow) │\n├─────────────────────────────────────────────────────────────────────────────┤\n│ 1. Ingestion Point (Primary Shard) │\n│ [Document Input] ──┬──► [In-Memory Indexing Buffer] │\n│ └──► [Translog (Transaction Log / WAL)] │\n│ │\n├─────────────────────────────────────────────────────────────────────────────┤\n│ 2. The Refresh (Searchability - Default: 1s) │\n│ [Indexing Buffer] ──► [New Lucene Segment Creation] │\n│ [Segment Files] ──► [OS Filesystem Cache (RAM)] │\n│ [Status] ──► DATA BECOMES SEARCHABLE │\n│ │\n├─────────────────────────────────────────────────────────────────────────────┤\n│ 3. The Flush (Durability - Default: 30m or 512MB) │\n│ [FSCache Segments]──► [Lucene Commit (fsync to Physical Disk)] │\n│ [Translog Status] ──► [Purge/Trim Old Log Entries] │\n│ [Status] ──► DATA IS HARD-PERSISTED │\n│ │\n├─────────────────────────────────────────────────────────────────────────────┤\n│ 4. Background Maintenance │\n│ [Merge Policy] ──► Combine Small Segments (.seg) into Large Ones │\n│ [Cleanup] ──► Reclaim space from Deleted Docs (.del markers) │\n│ [Structure] ──► Build/Update BKD Trees & Inverted Index │\n└─────────────────────────────────────────────────────────────────────────────┘\n```\n\n### Practical Takeaways: Denormalisation vs. Normalisation\n\nIn a recent project, we were storing data in a denormalised format—everything in one document. This is fantastic for read performance because the entire block is fetched at once.\nHowever, if you have large documents (e.g., 1MB) that update frequently, you’ll put massive pressure on the JVM memory and disk I/O because you’re constantly creating new 1MB segments for every small update. In those cases, using a normalised format or a parent-child relationship might save your memory, though you’ll pay a 5x to 10x cost in query performance because you'll have to fire multiple queries and correlate them.\n\n```\n┌───────────────────────────────────────────────────────────────────────────────────┐\n│ ELASTICSEARCH RELATIONSHIP PERFORMANCE BENCHMARK (2026) │\n├───────────────────────────────────────────────────────────────────────────────────┤\n│ │\n│ READ SPEED (Query Throughput) │\n│ FAST ◄──────────────────────────────────────────────────────────────► SLOW │\n│ ┌──────────────────────┐ ┌──────────────────────┐ ┌─────────────────┐ │\n│ │ DENORMALIZED │ │ NESTED FIELDS │ │ JOIN RELATION │ │\n│ │ (Flat Documents) │ │ (Hidden Sub-Docs) │ │ (Parent-Child) │ │\n│ └──────────────────────┘ └──────────────────────┘ └─────────────────┘ │\n│ [1x] [2x - 5x] [5x - 10x+] │\n│ │\n├───────────────────────────────────────────────────────────────────────────────────┤\n│ WRITE SPEED (Indexing Latency) │\n│ FAST ◄──────────────────────────────────────────────────────────────► SLOW │\n│ ┌──────────────────────┐ ┌──────────────────────┐ ┌─────────────────┐ │\n│ │ JOIN RELATION │ │ DENORMALIZED │ │ NESTED FIELDS │ │\n│ │ (Independent Docs) │ │ (Full Doc Update) │ │ (Mapping Bloat) │ │\n│ └──────────────────────┘ └──────────────────────┘ └─────────────────┘ │\n│ │\n└───────────────────────────────────────────────────────────────────────────────────┘\n```\n\n### Conclusion\n\nWhether it’s the Master node managing cluster state, the Data node holding your segments, or the Coordinating node acting as a router for \"scatter and gather\" operations, every part of ES is designed for scale.\nUnderstanding these internals isn't just academic—it directly impacts how you should design your index structures for your next project.\nHappy learning!\n",
            "url": "https://gauravsarma.com/posts/2026-01-09_how-lucene-segments-affect-elasticsearch",
            "title": "Effect of Lucene Segments on Elasticsearch",
            "summary": ". [How Lucene Segments Affect Elasticsearch](how-lucene-segments-affect-elasticsearch-cover...",
            "date_modified": "2026-01-09T00:00:00.000Z"
        },
        {
            "id": "https://gauravsarma.com/posts/2026-01-02_disabling-allow-partial-results-in-elasticsearch",
            "content_html": "\n![Disabling allow_partial_search_results in Elasticsearch](disabling-allow-partial-results-in-elasticsearch-cover.png)\n\nJust heard of an incident in a company where a dev turned off the config `allow_partial_search_results` in their Elasticsearch queries. This config is turned on by default.\nWhen they explicitly turned it off, the queries started failing more often and long running queries took double the time to complete.\nI have been bit by the same issue previously. It sounds nice when you can say that you don't want partial search results.\nBut in distributed systems, things aren't black and white.\nThe config says that if it's disabled, then if it's unable to fetch all the results from a specific, then the entire query for that shard would be failed.\nSo if you are trying to fetch 1000 documents, and only 900 are fetched, it would fail the query. It sounds all good logically.\nBut in the real workloads, you aren't the only person running queries on the system. So there will be cases when a particular shard is under load. When a shard is under load, it may not have the required resources to run a large query. Since it wasn't able to fetch the required resources multiple times, there were multiple query failures on multiple shards.\n\nSolution: Instead of using the config to ensure that there is no partial update, monitor the ES server's response where it mentions the number of rows updated and shard failures as well. This means that even if there are partial updates, there is still significant progress which happened and a retry for the same query will result in reduced results and will finish subsequently without any errors.\n\nThis is a common approach called \"Write and Verify\" which is prevalent in distributed systems.\n\nThe config should be turned off when you cannot afford data integrity issues at all and you don't want to handle it yourself.\n",
            "url": "https://gauravsarma.com/posts/2026-01-02_disabling-allow-partial-results-in-elasticsearch",
            "title": "Disabling partial results in Elasticsearch",
            "summary": ". [Disabling allow_partial_search_results in Elasticsearch](disabling-allow-partial-results-in-elasticsearch-cover...",
            "date_modified": "2026-01-02T00:00:00.000Z"
        },
        {
            "id": "https://gauravsarma.com/posts/2025-12-11_how-iouring-improves-database-performance",
            "content_html": "\n![How io_uring Improves Database Performance](how-iouring-improves-database-performance-cover.png)\n\n## What is io_uring?\n\nio_uring is a high-performance asynchronous I/O (Input/Output) interface introduced in the Linux kernel (version 5.1).\n\nIts primary goal is to overcome the performance bottlenecks of older I/O methods (like read()/write() and the older Linux AIO) by drastically reducing the overhead caused by frequent system calls and memory copying.\n\n### How it Works\n\nThe core of io_uring is built around two lockless ring buffers that are mapped into shared memory between the user application and the kernel: the Submission Queue (SQ), where the application writes I/O requests (SQEs), and the Completion Queue (CQ), where the kernel writes the results of completed I/O operations (CQEs) asynchronously. This design allows applications to queue multiple requests in a single batch, amortizing the cost of a single system call (io_uring_enter) over many operations.\n\n### Key Benefits\n\nThe system achieves its high speed by offering features that result in minimal system calls and zero-copy capability. By using shared rings, applications can queue I/O requests and reap completions with far fewer (or even zero, in polling mode) jumps between user space and kernel space. Furthermore, it supports features like Registered Buffers, which allows the storage hardware to use Direct Memory Access (DMA) to place data directly into the application's memory. This eliminates costly data copying via the kernel. It also provides a unified API for a wide range of operations, including file I/O, network I/O (sockets), and various control operations.\n\nIn essence, io_uring provides a complete, modern, and highly efficient way for high-throughput applications (like databases and web servers) to maximize the speed of modern hardware.\nThe key to io_uring's performance boost, especially for databases talking to lightning-fast storage like NVMe SSDs, is simple: We eliminate the major friction points created by the traditional Linux I/O stack.\n\n### Submitting Tasks (SQPoll)\n\nNormally, when your database wants to start reading or writing data, it has to execute a system call. That's a performance penalty because the CPU has to jump from your application's user space into the operating system's kernel space just to hand over the job.\n\nio_uring avoids this with SQPoll (Submission Queue Polling). We set up a dedicated helper thread running inside the kernel that does nothing but constantly check the submission ring buffer. Your database simply drops its I/O request into this shared memory queue. Because the kernel thread is always looking, it picks up the request instantly, and your application never has to waste time on a system call just to start the I/O.\n\n### Completing Tasks (IOPoll)\n\nWhen the NVMe drive finishes a data transfer, the standard way it communicates is by sending an interrupt to the CPU. The CPU has to stop what it's doing, save its state, handle the interrupt, and then resume. This interrupt overhead adds noticeable latency, particularly under high load.\n\nWith IOPoll (I/O Polling), this goes away. Instead of waiting for the NVMe device to interrupt the CPU, the system (or a special kernel thread) actively and continuously checks the hardware's completion queue. This constant polling bypasses the interrupt mechanism entirely. While this is often used in specialized scenarios where the application is talking close to the hardware, it's a huge win for cutting down latency on I/O completion.\n\n### Eliminating Data Copying (Registered Buffers and DMA)\n\nThis is the big game-changer for moving massive amounts of data. When your database needs a chunk of data, the old process required two copies: first, the data moved from the NVMe device into the kernel's memory space, and second, it was copied again from the kernel's space into your database application's memory space.\n\nio_uring solves this with Registered Buffers and Direct Memory Access (DMA). Your application tells the kernel, right at the start, exactly which specific memory regions it will use for I/O. Since the kernel has this map, it can instruct the NVMe controller to use DMA, allowing the hardware to pump the data directly from the drive into the application's pre-registered memory locations. This completely eliminates the costly intermediate copy and the overhead associated with memory page management, resulting in maximum throughput.\n\n#### References\n\n- https://arxiv.org/html/2512.04859v1\n",
            "url": "https://gauravsarma.com/posts/2025-12-11_how-iouring-improves-database-performance",
            "title": "How io_uring improves database performance",
            "summary": ". [How io_uring Improves Database Performance](how-iouring-improves-database-performance-cover...",
            "date_modified": "2025-12-11T00:00:00.000Z"
        },
        {
            "id": "https://gauravsarma.com/posts/2025-09-20_designing-a-hiearchical-authorisation-system",
            "content_html": "\n![Designing a Hierarchical Authorisation System](designing-a-hierarchical-authorisation-system-cover.png)\n\n# Authorisation\n\nIn this section, we will cover how Authorisation works in Goiter.\nEvery authorisation service has to deal with the following elements:\n\n- Accessor/Actor/User\n- Resource/Object\n- Action\n\nThe underlying statement for an authorisation service is if an accessor should\nbe allowed to perform an action on the resource.\n\nThere are hierarchical concepts which also apply to all the elements in the system.\nHowever, we will try to define a flat structure for now and talk about hierarchical\nelements or groups in the future.\n\n## Flat map representation\n\nThe easiest way to do this is to have a flat map of all accessors, objects and actions.\nTo store the mapping, the `RoleAccess` model will be used.\n\nSo if we have a mapping with the following\n\n```bash\naccessor_id,object_type,object_id,action_type\n```\n\nthen we can define all possible rules with this structure.\n\nHowever, the number of rows in the `RoleAccess` model would be tremendously high in this case\nand since all the columns are supposed to be indexable, any kind of scan would result in high\nresource and time consumption.\n\nFor example, if there are 1000 users trying to access 1000 objects, the number of rows would be\na million rows. 1000 users is not that big of a number and anything with more numbers would be\ndisastrous. On top of it, this model would be used everytime an user tries to access an object.\nThis means any kind of bottleneck on this model would affect all the APIs.\n\n## Hierarchical representation\n\nTo mimic real life scenarios and also to prevent the bloat of the number of rows as mentioned in the\n`Flat map` representation, the concept of hierarchies can be brought in.\n\nThis representation signifies that every element can be present as a hierarchical `Group` entity.\nAny `RoleAccess` rule which matches a group that the accessor belongs to means that they are eligible\nto access the object. A group can also belong to another group, which effectively allows it to form a\ntree of rules.\n\nEach group can have multiple parents. Each group inherits the properties/rules of the parent groups.\nFor example, if you don't find the exact match for a specific group, then you can search in the `RoleAccess`\nmodel for the parents of the specific group recursively.\n\nAn important assumption is that the depth of recursion to unravel to a matching group is not more than 10.\n\nHow does this help? Let's look at an example.\n\n### Example 1\n\nLet's take an example of a case where you want only the finance team to be able to access the billing\nsection of your app. If your team has 20 people and the number of resources you want to control is\nmore than 50, then the overall number of rules based on the flat map representation would be 1000.\n\nNow let's look at the example using the grouping or hierarchical representation.\n\n- Create an Object group called `billing` where all the billing objects are placed inside it.\n- Create an User group called `finance` where all the finance team members are allocated to.\n- Create one rule which allows the `finance` user group to access `billing` objects.\n\nLet's put another restriction where only the executives in the finance team can change the records.\nEverybody else in the finance team can only read the records.\n\n- Create an User group called `finance_execs` with `finance` as the parent group.\n- Alter the previous rule to allow the `finance` user group to only be able to read the `billing` objects.\n- Create one rule which allows the `finance_execs` to perform all operations on the `billing` objects.\n\n### Cons\n\nThis system is a whitelisting system. This means that if you don't have any rule which mentions that you\ncan access the resource, then you can't access the resource.\nIn future, if we have to support black listing as well, then there may be conflicts between whitelisted\nand blacklisted groups and may have to bring on the concept of priorities to the rules.\n\nAnother problem is that since an element can belong to groups, for every API, we have to fetch the groups\nassociated with the elements in a recursive manner till we reach the root or a matching rule. This can lead\nto multiple calls, but the scale required for groups would be far lesser than the scale required for the\nflat map representation.\n\n## Comparing per element groups or a single group model\n\nIn this section, we compare how having a single group model for all element differs from having different\ngroup models for every element.\n\nThe first type is having one single model to store Groups of all different elements.\nThe second type is to have different Group models for every element.\n\nHaving the Group elements in different models means that there may be drastically different number of groups\nfor every element. However, having the right indexes in a single model will result in a similar experience.\n\nHaving different Group elements can allow us to store different metadata per element.\n\n## Ownership\n\nThe Ownership of an object can be defined as an user which can perform all operations on the object.\nHow is this different then the above mentioned rules?\n\nIn most apps, there are implicit rules that regard the creator of an object to be the owner of the object\nand should have all possible actions on the object.\n\nBased on the above representation, if we can define specific `RoleAccess` rules for the user who created the\nobject. However, this means that we need to have 1 additional rule for every object.\n\nThis brings us to the topic of having explicit rules vs implicit rules.\nSince this is an app which tries to leverage convention over configuration, implicit rules should be available\nto the users as well.\n\n### Implicit rules\n\nImplicit rules should be checked in the `canAccess` method which is the entrypoint for the authorisation checks.\nImplicit rules can also be defined as more of a configuration than a rule.\n\nList of implicit rules\n\n- Enable ownership access\n\n## Scope\n\nIn every app, there are different kinds of scopes like projects, accounts, etc which provides the encapsulation required\nfor that level.\nFor example, if we take `Account` into instance, it's similar to a tenant and the border of this area shouldn't be\ncrossed by the user accounts.\nThen we also have more subscoping in an account using projects. The access rules of a project can be different.\nTaking the example of the billing table again, a developer may be able to query some tables but they shouldn't be able\nto query the `billing` model of the project.\n\nThere are some rules which apply to the entire system as a whole. For example, if I want to create a super user which\ncan mimic the login of any of the accounts, then there should be a a `RoleAccess` rule applicable for all accounts.\n\nLet's add `scope_type` and `scope_id` to the `RoleAccess` model.\nTo define rules at the `Account` level for a specific, we add the `scope` as `Account` and `scope_id` of the account.\n\nNow, let's try to define rules on who can create a `Project` and who can add members to the project.\nThe `canAccess` method receives the user and the object.\nFrom the object, the `scope` of the object can be fetched. For `Project`, the scope has to be defined as `Account` and the\nproject should also return the account it is linked to via the `scope_id`.\nThe `canAccess` method searches for the matching rule with `scope_type` of `Account` and the account's `scope_id` as an additional filter.\n\nIf we are adding `scope` to the `RoleAccess` model, we also need to add it to the `Group` model. This allows us to\ncreate sub groups for different scopes as well.\n\nRoot scopes are the scopes which applies to all objects, irrespective of the tenancy.\n\n### Identifying the Scope of an object\n\nThis section explains the process that can be used in identifying the scope of an object.\n\nThere are 2 possible options here:\n\n- Every object refers to the actual scope directly\n- They refer to the parent object as the scope and the actual scope is recursively\n  discovered by going up the stack of the ancestors\n\n## Using Groups to define Ownership and scope\n\nHow do you fetch a list of objects specific to the user in a multi-tenant system?\nMost apps add an `owner_id` or `account_id` in every resource and then add it in the `WHERE` clause whenever the list of\nobjects have to be fetched.\n\nIt works for most cases, but few cons are:\n\n- The underlying call to fetch the resources isn't aware of the tenancy. If you forget to omit the inclusion anywhere,\n  then there are chances of the entire list of all tenants being fetched.\n- The scopes are too restrictive. There are multiple scopes that can access the resource. In some cases, the scope is\n  per account and in some cases, you want a lower ownership.\n- Users/Accessors need to have well defined rules for all kinds of memberships.\n- Ownership is too limited and cannot be shared or changed on the fly.\n\nIn this section, we will try to use the concept of `Group`s to formulate a strategy to fetch a list of objects belonging\nto a specific scope.\n\nLet's talk about the different types of scope first. Above, we talk about scopes like `Account` and `Project` which are the\nmost commonly used across most apps.\nThere are mini scopes as well which require them to store the scope of the parent.\n\nProposing a different way of looking at ownership.\n\nWhat if `Account` or `Project` also leveraged the `Group` model for its memberships instead of maintaining their own memberships?\nThis would result in an automatic grouping which can be used for authorisation naturally.\n\nFurthering this thought, what if all ownerships or belonging to another object is also done via groups?\nWould this result in a more natural and implicit way of deciding authorisation?\n\n```\nAny kind of membership or belonging to another object should happen via Groups.\n```\n\n### Groups for Accounts and Projects\n\nIf the memberships for accounts and projects are stored in Groups, how would it change our access pattern?\n\nWhen an user tries to access any object belonging to an account, the user first gets added to the account's group.\nOnce it's added, it should try to unravel the groups it belongs to from the leaf group to the root node.\n\nSimilar to the authorisation logic, if it finds any `RoleAccess` rule which allows it to access the object, then\nthe traversal can stop and the user can have access to it.\n\nWe need to define the membership of the object to the account or some other sub account as well.\n\n- Explicit definition of which group the object belongs to.\n- Use the user's groups to define the group it belongs to.\n\nThis would allow the user to access the object without having any kind of logic on the object models.\n\nFor projects, the process is similar.\nLet's assume that we want to scope an user only to a specific project.\nWe add the user to the project's group and the user shouldn't be in the account's group.\nFrom the perspective of groups, the project is also a member of the account's group, i.e they are generic objects\nwith custom attributes, which we will come to later.\n\nOne problem with this approach is to design a system where an user has access to all projects except one. This can be\nsolved by\n\n- Adding `allow/deny` as an attribute to the `RoleAccess` model.\n- Removing the user from the account and instead, adding them to all the individual projects.\n\nLet's not add `deny` to the PRD for now and continue with the 2nd option.\n\n### Groups for child and parent objects\n\nNow let's take an example of a parent object called `Parent` and few children objects named `ChildA`, `ChildB` and `ChildC`.\nThe children objects exist only in the scope of the parent.\n\nInstead of having a `parent_id` in all the child objects, they can be stored as group members of the `Parent` group\nthat we can form when the parent is created.\n\nIf a parent needs to find the children associated with it, it can perform a query for the `group_type` which will be `Parent`\nalong with its `id`. It can also have a filter based on the member type if a particular parent owns multiple types of objects.\nThis will be particularly useful for any of the larger scope groupings.\n\n### Adding an object to a Group\n\nIn this section, we will cover how an object can be referred to in a Group, both as a scope and as a member.\nAnytime an object needs to be added to a group, it can be done on the fly by directly invoking the Group API or DB write.\n\nSince we are discussing this in the context of `Goiter`, we may need to define more implicit ways of referring to a group.\n\n### Identifying the Scope\n\nAnother problem to solve is the ability to figure out or define the scope of an API while fetching an object.\nThe possible data available to figure out the scope:\n\n- The User/Accessor\n- Groups the user belongs to\n- Using the URL's params\n\nA no-brainer is compelling the developer to pass in the explicit scope everytime.\nCan the scope be implicitly discovered?\n\nOnce the scope is defined, the system should check if the user can access the scope.\nIf it can access the scope, then from the group, fetch the members of the group recursively based on the type of the\nobject required.\n\n### Cons of Grouping everything\n\nFew cons of this approach are as follows:\n\n- Ownership is more fluid here. Since there is no direct owner of an object, multiple can own it and that may not be desirable in\n  some cases.\n- If groups are the sole decider of ownership, it will get bloated quite early on. There are different possible optimisations which\n  we will discuss later on.\n\nHope you liked reading the article.\n\nPlease reach out to me [here](https://gauravsarma.com/ping) for more ideas or improvements.\n",
            "url": "https://gauravsarma.com/posts/2025-09-20_designing-a-hiearchical-authorisation-system",
            "title": "Designing a hierarchical Authorisation system",
            "summary": ". [Designing a Hierarchical Authorisation System](designing-a-hierarchical-authorisation-system-cover...",
            "date_modified": "2025-09-20T00:00:00.000Z"
        },
        {
            "id": "https://gauravsarma.com/posts/2025-08-15_streaming-with-webrtc-and-mediasoup",
            "content_html": "\n![Streaming with WebRTC and Mediasoup](streaming-with-webrtc-and-mediasoup-cover.png)\n\nSomeone posted on Twitter about the exorbitant price that Zoom charges if one wants to\norganise a single session with more than 10,000 attendees.\n\nIt's around $6000.\n\nThat kind of led me to wonder what goes behind building something of this scale and why\nwould it be charged so high. And that's where I encountered Mediasoup as a SFU.\n\n## What is a SFU?\n\nFirst, of all, let's cover what an SFU is and what are the alternatives to it.\nSFU stands for Selective Forwarding Unit.\nSFU is a centralised media server which receives streams from multiple entities and forwards\nit to one or more receivers. The SFU controls how and what data should be sent to the receivers.\n\nFor example, if you want to build a multi-producer and multi-consumer app (which we will be building\nlater below), then you can send video and audio streams from the producers to only the consumers. You\ncould degrade or enhance the experience of a candidate based on their plan.\n\nIs SFU the only option?\nThere are other options such as MCU or p2p.\n\nMCU or Multipoint Control Unit servers receives multiple streams from multiple sources, merges them to\na single stream and sends them to all destinations. This means that the degree of control over what data\nto send is lesser in this scenario.\n\n## Introduction to Mediasoup\n\nMediasoup is an open-source, server-side WebRTC (Web Real-Time Communication) library designed for building\nscalable real-time communication applications. It functions as a SFU.\nMediasoup can be extended and used in multiple client libraries of NodeJS and Rust.\n\nIn this blog, we will demonstrate a basic producer consumer setup using Mediasoup in NodeJS.\n\nAs mentioned above, SFU is a centralised entity which can receive and send streams to other clients.\nSo we need to have one server, acting as a SFU and at least one client in our example.\nIn our case, the client will act as a producer and it will also consume it's own stream.\n\nWe will use the `SocketIO` NodeJS module for the client and server to interact with each other.\nWe will not go deep into how `SocketIO` works but it's mainly a module which internally uses websockets or http\npolling to send and fetch data. We need websockets to establish a bi-directional channel as the server\nalso has to send information to the client.\n\n### Basic client server connection\n\n```js\n// On the server, to start the SocketIO server\nio = new Server(appServer, {\n  cors: {\n    origin: \"http://localhost:3000\", // Replace with your Next.js client's origin\n    methods: [\"GET\", \"POST\"], // Allow necessary HTTP methods\n    credentials: true, // Allow sending cookies, if needed\n  },\n});\n\n// To receive the connection\nio.on(\"connection\", (socket: Socket) => {\n  console.log(\"New client connected\", socket.id);\n\n  socket.on(\"disconnect\", () => {\n    console.log(\"Client disconnected\", socket.id);\n    removeNode(socket.id);\n  });\n});\n\n// On the client\nsocket = io(\"http://127.0.0.1:8080\");\nsocket.on(\"connect\", async () => {\n  console.log(\"Connected to signaling server\");\n});\n\n// To send a message on the connection,\nsocket.emitWithAck(\"consumer-resume\", { consumerId: consumer.id });\n```\n\n### Using your webcam and audio\n\nTo enable both video and audio on your machine, you can set `vide: true`\nand `audio: true` accordingly.\nOnce the streams are enable, then you can assign the stream to `localVideoRef`\nwhich is a reference to `HTMLVideoElement`.\n\n```js\nlet localVideoRef = useRef < HTMLVideoElement > null;\n\nconst getMedia = async () => {\n  try {\n    const localStream = await navigator.mediaDevices.getUserMedia({\n      video: true,\n      audio: true,\n    });\n\n    if (localVideoRef.current) {\n      localVideoRef.current.srcObject = localStream;\n    }\n    if (!localVideoRef.current) {\n      console.error(\"Local video element is not available\");\n      return;\n    }\n\n    localVideoRef.current.srcObject = stream;\n    const track = stream.getVideoTracks()[0];\n\n    console.log(\"Got MediaStream:\", localStream);\n  } catch (error) {\n    console.error(\"Error accessing media devices.\", error);\n  }\n};\n```\n\nThe `localVideoRef` can then be displayed in the DOM in this way:\n\n```bash\n<video\n  className=\"mx-5\"\n  width=\"40%\"\n  ref={localVideoRef}\n  autoPlay\n  muted\n></video>\n```\n\n### Setting up the Mediasoup server\nIn this section, we cover the server initialisation steps that it does to be ready\nto receive any kind of streaming traffic.\nBefore listening to any kind of incoming traffic, the server sets up the Mediasoup\nentities which start the required background processes and workers to process these\nconnections.\n\nSince the Mediasoup server is a standalone NodeJS server, we need to start a Mediasoup\n`Worker`. The worker represents a C++ subprocess that handles the heavy lifting of media processing.\nIt is the core component responsible for managing and manipulating audio and video streams.\n\n\n```js\ncreateWorker = async () => {\n  mediasoup.observer.on(\"newworker\", (worker: types.Worker) => {\n    console.log(\"new worker created [pid:%d]\", worker.pid, worker.appData);\n  });\n  const worker = await mediasoup.createWorker({\n    logLevel: \"debug\", // Set the general log level to debug\n    //logTags: [\"ice\", \"dtls\"],\n    appData: { foo: 123 },\n    //dtlsCertificateFile: \"./keys/cert.pem\",\n    //dtlsPrivateKeyFile: \"./keys/key.pem\",\n  });\n  return worker;\n};\n```\n\nThe Mediasoup `Router` is a core component that acts as the SFU for real-time media streams.\nIts primary function is to manage and route audio and video RTP packets between different\nparticipants (producers and consumers) within a given media session, often analogous to a \"multi-party conference room.\"\n\n```js\ncreateRouter = async () => {\n    this.worker.observer.on(\"newrouter\", (router) => {\n      console.log(\"new router created [id:%s]\", router.id);\n    });\n    const router = await this.worker.createRouter({ mediaCodecs });\n    return router;\n  };\n```\n\nIn the above snippet, the `createRouter` method is being passed a `mediaCodecs` hash.\nWe will cover that in the later sections.\n\n### Sending the RTP capabilities\n\nIn a previous snippet, we saw that the `router` was being passed a `mediacodecs` hash.\n\nRTP capabilities define the media formats and features that a WebRTC endpoint, like a mediasoup router,\ncan handle. They are essential for a server and client to negotiate and agree upon a common set of options\nfor transmitting real-time audio and video.  The capabilities describe what an endpoint is able to receive,\nwhile the RTP parameters specify what a producer endpoint is actually sending. The receiver's capabilities constrain the sender's parameters.\n\nThe process is:\n- The mediasoup router exposes its RTP capabilities via the router.rtpCapabilities property.\n- Client requests for the capabilities and the server sends the router's RTP capabilities to the client, which is running mediasoup-client.\n- The client-side mediasoup-client device is loaded with the server's capabilities. It then uses its own browser capabilities and the\nrouter's capabilities to determine the final, negotiated capabilities for the session.\n\n```js\n// On the client\nconst initConnectionWithServer = async (socket: Socket) => {\n  routerRtpCapabilities = await socket.emitWithAck(\"getRouterRtpCapabilities\");\n  deviceRef.current = new Device();\n  await deviceRef.current.load({ routerRtpCapabilities });\n};\n```\n\nThe `Device` is the central client-side object that represents a users local endpoint to connect \nto a mediasoup Router. It acts as the bridge between your client application and the mediasoup\nserver, handling the browser-specific WebRTC details for you.\n\nThe media codecs defined on the server:\n```js\nconst mediaCodecs: types.RouterOptions[\"mediaCodecs\"] = [\n  {\n    kind: \"video\",\n    mimeType: \"video/vp8\",\n    preferredPayloadType: 100, // Example payload type\n    clockRate: 90000,\n    parameters: {},\n    rtcpFeedback: [\n      { type: \"nack\" },\n      { type: \"nack\", parameter: \"pli\" },\n      { type: \"ccm\", parameter: \"fir\" },\n      { type: \"goog-remb\" },\n    ],\n  },\n  {\n    kind: \"video\",\n    mimeType: \"video/H264\",\n    clockRate: 90000,\n    parameters: {\n      \"packetization-mode\": 1,\n      \"profile-level-id\": \"42e01f\",\n      \"level-asymmetry-allowed\": 1,\n    },\n  },\n];\n```\n\nIn the above example, we have 2 blocks for video codecs. The negotiation will try to check which codec capability\nis available in both the client and the server and is chosen.\n\n### Setup the `sendTransport` and `recvTransport` methods\n\nWe are now getting to the fun stuff.\nPrior to this, we have setup the server side entities like the router and the worker. On the client side, we have\nsetup the mediasoup device which takes care of incoming connections.\nThese entities handle the connections that we will be making hereforth.\n\nThe communication between a Mediasoup client and server and vice-versa is unidirectional. This means, that for every stream\nthat the client sends to the server, there needs to be a specific transport and if the client wants to receive a stream\nfrom the server, the server needs to open up another transport.\n\nAnother point of note is that every transport has a sender and a receiver.\nTo send a stream, the specific entity has to call `sendTransport` and to receive a stream, it has to call `recvTransport`.\n\nOne point of differentiation from other networking libraries that I have worked with, is the Mediasoup requires both the server\nand the client to have it's own version of `sendTransport` and `recvTransport` for every stream it works on.\n\nThe below code snippet in the client emits a `createWebRtcTransport` call to the server.\nThe server creates a `WebRtcTransport` and passes back the `transport.id` to the client.\nIt also sends back `ice` and `dtls` parameters back to the client.\n\nUsing these information, the client's `Device` object also creates a corresponding transport entity using the\n`createSendTransport` method.\n\nOnce the transport is created on both the client and the server, the client calls `sendTransport.produce` with the track information. \nCalling `produce` on the `sendTransport` on the client emits the `connect` and the `produce` messages on which the server also\ncreates the required transport connections.\n\n```js\n// In the client\n\nconst createSendTransport = (socket: Socket) => {\n    socket.emit(\n      \"createWebRtcTransport\",\n      { sender: true },\n      ({ params }: { params: any }) => {\n        if (params.error) {\n          console.log(params.error);\n          return;\n        }\n        if (deviceRef.current == null) {\n          console.error(\"Device is not initialized yet in createSendTransport\");\n          return;\n        }\n        sendTransport = deviceRef.current.createSendTransport(params);\n\n        sendTransport.on(\n          \"connect\",\n          async ({ dtlsParameters }, callback, errback) => {\n            try {\n              await socket.emit(\"transport-connect\", {\n                dtlsParameters,\n                transportId: sendTransport.id,\n              });\n\n              // Tell the transport that parameters were transmitted.\n              callback();\n            } catch (error: any) {\n              errback(error);\n            }\n          }\n        );\n\n        sendTransport.on(\"produce\", async (parameters, callback, errback) => {\n          try {\n            await socket.emit(\n              \"transport-produce\",\n              {\n                kind: parameters.kind,\n                rtpParameters: parameters.rtpParameters,\n                appData: parameters.appData,\n                transportId: sendTransport.id,\n              },\n              ({ id }: { id: any }) => {\n                callback({ id });\n                // Uncomment if you want to create a client receiver\n                //createClientReceiver(socket, id);\n              }\n            );\n          } catch (error: any) {\n            errback(error);\n          }\n        });\n        connectSendTransport();\n      }\n  )}\n\nconst connectSendTransport = async () => {\n  producer = await sendTransport.produce(params);\n  console.log(\"Producer created:\", producer.id, producer.kind);\n\n  producer.on(\"trackended\", () => {\n    console.log(\"track ended\");\n  });\n\n  producer.on(\"transportclose\", () => {\n    console.log(\"transport ended\");\n  });\n};\n```\n\n\n```js\n// In the server\nsocket.on(\"createWebRtcTransport\", async (data, callback) => {\n  console.log(\"Received createWebRtcTransport\", data, callback);\n  const transport: mediasoupTypes.WebRtcTransport = await rtc.createWebRtcTransport();\n  callback({\n    params: {\n      id: transport.id,\n      iceParameters: transport.iceParameters,\n      iceCandidates: transport.iceCandidates,\n      dtlsParameters: transport.dtlsParameters,\n    },\n  });\n});\n\nsocket.on(\n  \"transport-connect\",\n  async ({ dtlsParameters, transportId }) => {\n    console.log(\"Received transport-connect\");\n    await sendTransports[transportId].connect({ dtlsParameters });\n  }\n);\n\nsocket.on(\n  \"transport-produce\",\n  async ({ kind, rtpParameters, appData, transportId }, callback) => {\n    console.log(\"Received transport-produce\");\n    const producer: mediasoupTypes.Producer = await sendTransports[\n      transportId\n    ].produce({\n      kind,\n      rtpParameters,\n    });\n\n    console.log(\"Producer ID: \", producer.id, producer.kind);\n\n    producer.on(\"transportclose\", () => {\n      console.log(\"transport for this producer closed \");\n      producer.close();\n    });\n\n    // Send back to the client the Producer's id\n    callback({\n      id: producer.id,\n    });\n    registerNewProducer(producer);\n  }\n);\n```\n\nThis sets up the stream connection from the client to the server.\nAs mentioned previously, we also want the client to receive the video stream from the server.\n\nThis requires the client and server to create receiver transports respectively.\nTo create a receiver transport, both the client and the server have to call `createRecvTransport`\non their respective connections.\n\nOne major difference in the sender and the receiver flow is the receiver also needs the `producer`'s\ninformation to start receiving traffic. The producer information is the same information that the server\ncaptured in the first flow when the client was sending streams to the server.\n\nRemaining of the receiving traffic is pretty similar to the sending traffic flow.\n\n\n### Dump of the entire Flow for easier visualisation\n\n```bash\n// Initialisation\nclient -> connect websocket -> server\nclient -> getRtpCapabilities -> server\nclient -> createDevice\nclient -> initiate mediastream and reference the stream in the video tag\n\n// Sending data from client to server\nclient -> createWebRtcTransport -> server\nserver -> createSendTransport\nclient -> createSendTransport\nclient -> sendTransport.produce\nclient -> transport-connect -> server\nclient -> transport-produce -> server\nserver -> sendTransport.connect\nserver -> sendTransport.produce\n\n// Receiving data from server to client\nclient -> createWebRtcTransport -> server\nserver -> createRecvTransport\nclient -> createRecvTransport\nclient -> consume -> server\nserver -> recvTransport.consume\nclient -> recvTransport.consume\nclient -> transport-recv-connect -> server\nserver -> recvTransport.connect\nclient -> consumer-resume -> server\nserver -> consumer.resume\n\n```\n\n## Conclusion\n\nMediasoup is a pretty nifty module to setup as a SFU.\nThe code snippets in the post can be found [here](https://github.com/gsarmaonline/mediasoup-basic).\n\nAnother interesting point that this post doesn't cover is how to debug when your webrtc streams don't work\nas expected. I plan to create another post where I capture the best way to debug if a webrtc stream has been\nsetup the right way.\nFor example, I did face multiple issues while setting up the `dtls` parameter because of a server configuration.\nIdentifying the actual problem is important to solve the issue for which webrtc has great tooling support.\n\n## References\n\n- https://github.com/gsarmaonline/mediasoup-basic\n\nHope you liked reading the article.\n\nPlease reach out to me [here](https://gauravsarma.com/ping) for more ideas or improvements.",
            "url": "https://gauravsarma.com/posts/2025-08-15_streaming-with-webrtc-and-mediasoup",
            "title": "Streaming with WebRTC and Mediasoup",
            "summary": ". [Streaming with WebRTC and Mediasoup](streaming-with-webrtc-and-mediasoup-cover...",
            "date_modified": "2025-08-15T00:00:00.000Z"
        },
        {
            "id": "https://gauravsarma.com/posts/2025-07-01_experimenting-with-the-swap-space",
            "content_html": "\n![Experimenting with Swap Space](experimenting-with-the-swap-space-cover.png)\n\n## What exactly is the Swap Space?\n\nLinux system uses the Virtual File System (VFS) to abstract the physical memory available from the memory\nvisible to the linux processes. The VFS is responsible for allocating, freeing and managing memory for the\nmultiple processes running on linux.\n\nToday's machines have RAMs of more than 2GB in almost all devices. However, memory was pretty constrained a few\ndecades ago and it is still the case in embedded systems. In these devices, one has to deal with memory of less than\na few MBs. \n\nThe VFS ensures that memory is efficiently allocated to the processes. However, in cases when there is no more memory\navailable and the system is under intense memory pressure, the entire system may crash or the kernel may throw an Out of Memory panic.\n\nTo guard against this and to provide a little bit of a breating space to the VFS to react accordingly, the Swap space was created.\n\nThe swap space is to temporarily use the physical disk as well for memory usage.\nEverything happens transparently to the process. The process requests for memory as usual, and the VFS utilises the Swap space for memory,\nwhich is actually stored on the disk.\n\nYou can look at the swap space by running\n```bash\nfree -m\n```\nor by looking at the `swapfile`.\n```bash\ncat /proc/swaps\n```\n\n`vmstat` is another command to track your swap space activity.\n\n\n## Anonymous memory\n\nWhen a process reads or writes to a file, the kernel loads the content from the disk and stores it in the page cache memory.\nIf the page isn't accessed for sometime, the kernel usually flushes the changes to the disk and removes it from the main memory.\n\nNow let's assume that your process allocates some memory on the heap and it doesn't get used very frequently. In this case, the kernel\ncannot flush it to any file since there is no backing file. \nThis is another area where the swap space is of use. The swap space can be used to back up less frequently used anonymous memory to the swap\nspace as well.\n\nThis frees up the actual main memory for more important use cases.\n\nTo check the anonymous memory regions of a process, use this:\n```bash\nless /proc/<PID>/smaps\n```\nand grep by the `anon` section.\n\n## Swapiness configuration\n\nYou can also configure how aggressively the data is stored in swap space via the `swappiness` parameter.\nJust run the below command to do so:\n```bash\nsudo bash -c \"echo 'vm.swappiness = 15' >> /etc/sysctl.conf\"\n```\n\n## Performance impact of using the Swap\n\nThe performance of storing anything on a swap space is as bad as disk performance compared to memory. \nIt becomes especially bad if the memory is in constant churn, i.e you are continously allocating and deallocating memory on the swap space.\n\nOne important thing to note that we observed while allocating on swap space, is that once something gets allocated on the swap, it is not removed\nfrom the swap space even if there is free memory lying around. \nWhat this means is that if you have spiky memory traffic where you routinely consume the entire memory, you may tend to store things on the swap\nmore often than you would expect.\n\n## Do we actually need swap?\n\nComing from the database world, it is advised that the swap space should be disabled for all nodes, especially since the memory available to the\nproduction machines are at least more than 16GB.\nMost data oriented processes usually tend to store as much data in memory as possible for faster access.\nThey also keep track of the memory usage of the system to not allocate above a certain percentage of available memory.\n\nFor example, Mongo doesn't use more than 80% of the available RAM.\n\nAlso, if data does get into the swap space, the variability of the performance across memory and disk would be difficult to debug and may result\nin edge cases more easily.\n\nTo turn the swap on or off, you can run\n```bash\n# To turn swap on\nswapon -a\n\n# To turn swap off\nswapoff -a\n```\n\n## References\n\n- [Mmap in database](https://db.cs.cmu.edu/papers/2022/cidr2022-p13-crotty.pdf)\n- [Linux Address space](https://www.gauravsarma.com/posts/2018-03-02_Linux-Address-Space-45e1d0aa8c86)\n\nHope you liked reading the article.\n\nPlease reach out to me [here](https://gauravsarma.com/ping) for more ideas or improvements.\n",
            "url": "https://gauravsarma.com/posts/2025-07-01_experimenting-with-the-swap-space",
            "title": "Experimenting with the Swap Space",
            "summary": ". [Experimenting with Swap Space](experimenting-with-the-swap-space-cover...",
            "date_modified": "2025-07-01T00:00:00.000Z"
        },
        {
            "id": "https://gauravsarma.com/posts/2025-04-22_mmap-effects-in-databases",
            "content_html": "\n![mmap Effects in Databases](mmap-effects-in-databases-cover.png)\n\nOnce upon a time, I was working on an in-memory datastore and creating the memory pool for the datastore.\nAs usual, there were multiple design ideas that people put forth to solve it.\n\nSomebody mentioned about using `mmap` to keep data in memory backed by a file on the disk. Having worked on a\nnetworking system using `mmap` for its ring buffers, I was enthusiastic about using it.\nHowever, folks mentioned that `mmap` shouldn't be used for databases. I remembered that DBs like `MongoDB` and\n`LevelDB` had moved away from using `mmap` based storage engines to managing the memory themselves.\n\nThis blog will cover how most databases store their data on the operating system and their memory usage and\nhow `mmap` would interact with the database operations.\n\n## Data layers for a write operation in the database\n\nEvery system needs a storage device like a `SSD`, `HDD`, `NAS`, etc to actually store data.\nGiven the plethora of operating systems and different storage devices, database systems don't interact with\nthe storage device directly.\nInstead, the database system leverages the `Virtual File System (VFS)` provided by the operating system\nto interact with the actual storage layer.\nEvery layer has their own optimisations to provide better performance to the user.\nFor example, when you write to a file in normal mode, the data is not written to the storage device synchronously.\nInstead, the operating system has data buffers which is flushed to the storage device periodically.\nThis means that if you emit a `write` system call and there is a system crash, there are chances of data failure.\n\nThat's one of the main reasons why developers call `fsync` after every major write operation when they want to\nensure the data resiliency of the system. A great note I received on Twitter was that one should also call `fsync`\nwhen they read the data as there may be a discrepancy in the data buffer and the actual storage file.\n\nThe operating system writes data to the storage device in chunks or blocks called `pages`. The common block size for\nmost operating systems is 4kB. \n\nThe database layer also has its own blocks of storage called `pages` and they are different from the operating system's pages.\nThe data belonging to different tables are usually stored in different `pages`. Most pages follow the data layout of\n`slotted pages` to support variable length tuples with different compaction strategies.\n\n## Introducing `mmap` to the mix\n\n`mmap` is used to map contents from the secondary storage directly to main memory, thus leading to faster access and better\nperformance since the contents are accessed from the main memory if there is sufficient space in the operating system\npage cache.\n\nThe operating system removes or adds the data from the main memory based on the usage. This means that mmaped data\ncan be removed transparently in case the memory is needed for other use cases.\n\nDatabase systems try to optimise query performance by various methods like data prefetching of sequential data,\nscan sharing, etc. To do the same thing in `mmap`, the `madvise` system call can be used for different kinds of data\naccess like sequential or random access using `MADV_SEQUENTIAL` or `MADV_RANDOM` flags. However, it is not sufficient for\ncommon SQL queries operating on ranges and orders since it doesn't have context of where the data is stored in the pages.\n\nAnother disadvantage is that `mmap` is not designed to have asynchronous reads which means that the data access blocks\nthe fetching which can be easily circumvented in a traditional buffer pool using `io_uring` or `libaio`.\n\nIn certain cases, the database may want to keep certain pages in memory for a certain duration. To do this with `mmap`,\nthere is a system call `mlock` which attempts to keep the page locked in memory. However, in cases of operation system\nmemory overload, the pages can still be removed which leads to page faults, thus leading to slower IO.\n\nThe database system usually uses the buffer pool to store data temporarily till it decides to commit the data to\ndisk. Using `mmap`, the control over this process is minimal since the operating system can decide to flush the dirty\npage to the secondary system anytime. In cases of concurrent transactions, the database system using `mmap` would be\nunable to detect data conflicts to either commit or rollback the data.\n\nThe [paper](https://db.cs.cmu.edu/papers/2022/cidr2022-p13-crotty.pdf) does a performance comparison between random\nand sequential reads for `mmap` and buffer pools implemented by the database. In both cases, it was seen that the\nperformance is comparable in the beginning of the load test. However, when the operating system cache becomes full\nand the pages are removed from the cache, the performance of `mmap` becomes quite bad compared to a traditional\nbuffer pool, especially when multiple storage devices are involved.\n\n## Conclusion\n\nMost popular databases like MongoDB and LevelDB have moved away from `mmap` based memory management as it remains\nunpredictable on how the operating system is going to behave in different scenarios.\nThere are certain smaller use cases within database systems to use `mmap` for small data transfers which are not\nperformance and transaction dependent. \nBut if you are implementing a serious traditional database, it's better to stay away from `mmap`.\n\n\n## References\n\n- [Mmap in database](https://db.cs.cmu.edu/papers/2022/cidr2022-p13-crotty.pdf)\n- [CMU Database Systems playlist](https://www.youtube.com/playlist?list=PLSE8ODhjZXjbohkNBWQs_otTrBTrjyohi)\n\nHope you liked reading the article.\n\nPlease reach out to me [here](https://gauravsarma.com/ping) for more ideas or improvements.\n",
            "url": "https://gauravsarma.com/posts/2025-04-22_mmap-effects-in-databases",
            "title": "Mmap effects in databases",
            "summary": ". [mmap Effects in Databases](mmap-effects-in-databases-cover...",
            "date_modified": "2025-04-22T00:00:00.000Z"
        },
        {
            "id": "https://gauravsarma.com/posts/2025-04-04_understanding-and-scaling-raft",
            "content_html": "\n![Understanding and Scaling Raft](understanding-and-scaling-raft-cover.png)\n\nIf you have worked with distributed systems, you must have come across this famous quote:\n\n```\nIn a distributed system, the only thing two nodes can agree on is that they can't agree on anything.\n```\n\nThis quote has stuck with me because it perfectly captures the complexity of building distributed systems. The challenge of getting multiple machines to agree on a single value is so fundamental that it has spawned entire fields of research. Today, we'll look at one elegant solution to this problem - the Raft consensus algorithm.\n\n## Understanding Raft: A Simplified Approach to Consensus\n\nRaft was designed with understandability as its primary goal. Unlike more complex consensus algorithms, Raft divides the consensus problem into three relatively independent subproblems:\n\n### Leader Election\n\nIn Raft, a single server acts as the leader, managing client requests and serving as the source of truth:\n\n- Each server starts as a Follower\n- If no leader is heard from, a Follower becomes a Candidate\n- The Candidate requests votes from others\n- If it gets majority votes, it becomes the Leader\n- If two candidates split the vote, a new election starts\n\n### Log Replication\n\nOnce a leader is elected, it manages all client requests:\n\n- The leader receives all client requests\n- It adds them to its log\n- It replicates this log to followers\n- Once a majority confirm, the entry is committed\n\nThis approach ensures that all operations happen in the same order across all servers, providing consistency in the distributed system.\n\n### Safety Guarantees\n\nRaft ensures several critical safety properties:\n\n- Election Safety: at most one leader can be elected in a given term\n- Leader Append-Only: a leader never overwrites or deletes entries in its log\n- Log Matching: if two logs contain an entry with the same index and term, they are identical\n- Leader Completeness: if an entry is committed, it will be present in the logs of all future leaders\n\n## The Cost of Consensus\n\nThere's a reason why we don't use Raft for everything. Each write operation requires:\n\n- 1 round-trip to leader\n- 1 round-trip to followers\n- 1 round-trip back to client\n\nIn a 5-node cluster, a single write operation involves at least 3 network round-trips, making it relatively expensive for high-throughput systems.\n\n## Scaling Raft: The Multi-Raft Pattern\n\nTo scale Raft to larger systems, the Multi-Raft pattern has emerged as a key architecture in distributed databases like CockroachDB and TiDB:\n\n- Data is split into ranges/shards\n- Each range has its own independent Raft group\n- Different leaders for different ranges allow parallel operations\n\nThis pattern allows systems to scale horizontally while maintaining strong consistency guarantees.\n\n### Key Optimizations in Multi-Raft Systems\n\n#### Resource Sharing \n\nMultiple Raft groups on the same node can share resources:\n\n- Thread pools for processing\n- Batched disk writes across groups\n- Unified network connections\n- Shared memory pools\n\n#### Message Batching\n\nTo reduce network overhead:\n\n- Multiple heartbeats combined into single packets\n- Log entries from different groups bundled together\n- Responses aggregated for efficient network usage\n\n#### Range Leases\n\nTo optimize read operations:\n\n- Long-term read delegation to a single node\n- Reads don't need full Raft consensus\n- Significantly reduces read latency\n\n#### Dynamic Range Management\n\n- Hot ranges split automatically\n- Cold ranges merge to reduce overhead\n- Load-based splitting for better distribution\n\n## Challenges with Multi-Raft\n\nWhile Multi-Raft solves scaling problems, it introduces complexity:\n\n### Operational Complexity\n\nRunning thousands of Raft groups means:\n\n- More state to track and debug\n- Complex failure scenarios\n- Increased monitoring overhead\n\n### Resource Management\n\nEach Raft group consumes resources, and managing thousands requires careful planning:\n\n- Memory for in-flight operations\n- Disk space for logs\n- Network bandwidth for replication\n- CPU for leader election and log processing\n\n### Cross-Range Transactions\n\nWhen operations span multiple ranges:\n\n- Atomic commits across Raft groups become necessary\n- Coordination protocols add complexity\n- Higher latency for distributed transactions\n- Increased chance of conflicts and retries\n\n## Best Practices for Using Raft\n\n### When to Use Raft\n\nUse Raft when:\n- Strong consistency is required\n- The system can tolerate some latency\n- The dataset can be partitioned effectively\n\n### When Not to Use Raft\n\nConsider alternatives when:\n- Eventual consistency is sufficient\n- Ultra-low latency is critical\n- The system has extremely high write throughput\n\n### State Machine Management\n\nYour Raft implementation needs to consider:\n- Log compaction strategies\n- Snapshot mechanisms for large state machines\n- Efficient recovery processes\n\n### Handling Network Partitions\n\nThe system should be designed to handle:\n- Temporary partitions without data loss\n- Leader isolation scenarios\n- Recovery from network failures\n\n## Conclusion\n\nThe beauty of Raft lies in its simplicity. While other consensus algorithms might be more efficient in specific scenarios, Raft's understandability makes it the go-to choice for many distributed systems.\n\nRemember:\n```\nConsensus is expensive. Use it only when you absolutely need it.\n```\n\nTake the time to understand when you need strong consistency versus when eventual consistency is good enough. Your system's scalability might depend on it.\n\nHope you liked reading the article.\n\nPlease reach out to me [here](https://gauravsarma.com/ping) for more ideas or improvements.\n",
            "url": "https://gauravsarma.com/posts/2025-04-04_understanding-and-scaling-raft",
            "title": "Understanding and scaling Raft",
            "summary": ". [Understanding and Scaling Raft](understanding-and-scaling-raft-cover...",
            "date_modified": "2025-04-04T00:00:00.000Z"
        },
        {
            "id": "https://gauravsarma.com/posts/2025-02-20_how-safe-is-your-fsync",
            "content_html": "\n![How Safe Is Your fsync?](how-safe-is-your-fsync-cover.png)\n\nHave you ever wondered how durable are your writes? Do you expect calling `fd.write` would persist the data across crashes\nand reboots?\nOh, you use `fsync` after your writes and you don't have any chance of losing your data now?\nThis post is going to break your trust on `fsync`, just like it broke mine and it's going to be fun!\n\n## The journey of a write operation on Linux\nBefore getting into the details, let's understand the components which are involved in your write operation.\nWhen you issue a `write` command on your file descriptor, the data is mainly copied from the user space to the kernel space\ninto the operating system's buffers.\nThe kernel doesn't write the data directly to storage on receiving the `write` operation. It just marks the pages as dirty\nand returns back a success to the user.\nThe kernel periodically detects that there is dirty data in the kernel's page buffers and writes the data lazily in batches trying\nto optimise the write throughput.\nWhile flushing data to the storage, the kernel uses the virtual file system `ext4`, `xfs` to write the data to the actual\nstorage device (`hdd`, `ssd`). \nOnce the VFS flushes the data to the storage, it returns a success response to the kernel. When the kernel receives the\nsuccess response, it marks the dirty buffer as clean.\nThe storage device also maintains a chache\nThis is the journey of a successful write request in the kernel.\n\n## Different types of write configurations\nThere are cases when the application wants to ensure that the data is written to the storage device. \nThere are different options for persisting the data to a certain layer as per the write operation journey. Also, different\nfilesystems and operating systems may have different behaviours as well.\n\nThe following image is taken from [this blog](https://transactional.blog/how-to-learn/disk-io) and it has far more information\nregarding the different configurations.\n\n\n| Operation                         | Application Memory | Page Cache        | Disk Cache        | Disk Storage     |\n|-----------------------------------|-------------------|-------------------|-------------------|------------------|\n| **File Integrity**                |                   |                   |                   |                  |\n| `write()`                         | ● →               | →                 |                   |                  |\n| `O_DIRECT`                        | ● --------------- | ................  | →                 |                  |\n| `+ O_SYNC`                        | ● --------------- | ................  | →                 | →                |\n| `fsync()`                          | ● →               |                   | →                 | →                |\n| **Data Integrity**                |                   |                   |                   |                  |\n| `O_DIRECT + O_DSYNC`              | ● --------------- | ................  | →                 | →                |\n| `fdatasync()`                      | ● →               |                   | →                 | →                |\n| `sync_file_range (btrfs, zfs)`     | ● →               |                   | →                 | →                |\n| `sync_file_range (ext4, xfs)`      | ● →               |                   | →                 | →                |\n\n\nInterpreting the above table, when you perform a `write`, it is written only to the `page cache` layer.\nIf you pass the `O_DIRECT` flag, it skips the page cache layer and directly writes it to the `disk cache` layer.\n`O_SYNC` ensures that all metadata and data contents of a file is synced to the disk. Calling `O_SYNC` ensures the disk\nstorage is also updated.\n`O_DSYNC` ensures only the data content of a file is synced to the disk. The metadata of the file is not synced immediately.\n\nWhen data is appended to a file, the size of the file increases and the page blocks representing the file also increases.\n`sync_file_range` ensures that the additional blocks are also synced to disk.\n\n## When `fsync` fails\n\nWe have seen above what happens when `fsync` is performed successfully.\n```\nThe fsync man pages [9] report that fsync may fail for\nmany reasons: the underlying storage medium has insufficient\nspace (ENOSPC or EDQUOT), the file descriptor is not valid\n(EBADF), or the file descriptor is bound to a file that does not\nsupport synchronization (EINVAL)\n```\n\n\nLet's see what happens when `fsync` doesn't execute successfully.\n\n\n### Enter `fsyncgate` on PostgreSQL\nIn 2018, there was a critical `fsync` bug discovered by the PostgreSQL developers due to mishandling and vague understanding of what\nthe `fsync` command does. You can read the entire thread [here](https://danluu.com/fsyncgate/).\n\nOne PostgreSQL user mentioned that a storage error resulted in data corruption on XFS.\nWhat was observed during investigation was that the PG wrote some data to the kernel where the pages were dirtied which was then written to the\nstorage device. The storage device returned an error which resulted in the writeback page being marked as failed (EIO) by the XFS layer. \nWhen the PG layer tries to call the `fsync` operation, it receives an `EIO` error to mention the previous error had failed.\nOnce the error was received, the kernel clears `AS_EIO` page error. This means that when PG retries the checkpointing process, the `fsync` operation\nreturns back a success response.\nThis results in an error where the checkpointing returns back a success response without actually writing the data to the disk, thus leading to data loss.\n\nThe above problem would have been dealt with differently in `ext4` with `errors=remount-ro` as the filesystem would have forced a remount when a storage\ndevice error is encountered.\n\nPostgreSQL solved this issue subsequently by mimicking the `ext4`'s implementation where any `fsync` error would crash the process, thus forcing it to\nreread from the checkpointed file with fresh memory pages without having to worry about the if the failed pages would still be in memory.\n\nRedis periodically updates the `aof` file to keep track of the updates. When flushing data to the `aof` file, it doesn't check for the fsync status code\nitself, thus allowing corrupt keys to be stored in the page cache and in memory, whereas the `aof` file is not written to successfully. This will result in data corruption\nwhen the server has to restart the process and reads the `aof` file.\n\n\n### fsync Failure Analysis on different filesystems\n\nThis [paper](https://www.usenix.org/system/files/atc20-rebello.pdf) experiments different types of workfloads on different filesystems to uncover how\n`fsync` failures are handled.\n\nWe will cover the experiments on 3 filesystems\n- `ext4`\n- `xfs`\n- `btrfs`\n\nWhen a write operation is done, the data is put in the page cache and `ext4` marks the pages as dirty. On calling `fsync`, the data is written to the storage\nblock and the metadata, which is the inode with the new updated at time, is updated. The pages are then marked as clean and no errors were encountered.\n\n#### How different filesytems behave with a failed `fsync`\n\nFor `ext4`, when the `fsync` call fails, the metadata is not updated but then dirty pages are still marked as clean. Since the pages are marked as clean, the subsequent\n`fsync` is able to update the inode entry with the updated at time as well. If the application reads the data before booting up and the pages are still in the\ncache, it will see the newly updated information even though the `fsync` operation failed. If the application reads the same data after booting up after the pages\nare removed from the cache, the application would see the older data since the actual write was never persisted to the disk.\n\nThe `xfs` filesystem behaves similarly with `ext4` except that when a `fsync` failure happens, it shuts down the filesystem entirely, thereby blocking all read and write \noperations. It also retries the metadata updates when it encounters a checkpointing fault.\n\n`btrfs` which is a copy-on-write filesystem, writes to a log tree to record the `fsync` changes instead of updating the journal in-place. Instead of overwriting on the same\nblock, `btrfs` creates a new block and then updates the block links in the root.\nGiven that it maintains a different copy of the old and new data, `btrfs` is able to revert back to the old state when a `fsync` failure is encountered, unlike `xfs` and `ext4`.\n`btrfs` does not persist metadata after a data-block failure. However, because the process filedescriptor offset is incremented, future writes and fsyncs cause a hole in the middle of the file.\n\nThe FreeBSD VFS layer chooses to re-dirty pages when there is a failure (except when the device is removed) while Linux hands over the failure handling responsibility to the individual file\nsystems below the VFS layer.\n\nAll the filesystems mentioned above were affected by `fsync` failures by either the wrong data being read, or incorrect state, or filesystem unavailability.\nBelow is a tabulation of how the different filesystems were impacted by the `fsync` failures.\n\n| Filesystem | Mode    | Q1 (Which block failure causes fsync failure?) | Q2 (Is metadata persisted on data block failure?) | Q3 (Which block failures are retried?) | Q4 (Is the page dirty or clean after failure?) | Q5 (Does the in-memory content match disk?) | Q6 (Which fsync reports the failure?) | Q7 (Is the failure logged to syslog?) | Q8 (Which block failure causes unavailability?) | Q9 (What type of unavailability?) | Q10 (Holes or block overwrite failures? If yes, where?) | Q11 (Can fsck help detect holes or block overwrite failures?) |\n|-----------|---------|---------------------------------|-------------------------------------|---------------------------------|----------------------------------|----------------------------------|---------------------------------|------------------------------|----------------------------------------|------------------------------|-------------------------------------------------|-----------------------------------------------|\n| ext4      | ordered | data, jrnl                     | yes (A)                            | -                               | clean (B)                        | no (B)                          | immediate                       | yes                          | jrnl                                       | remount-ro                  | NOB, anywhere (A)                             | no                                           |\n| ext4      | data    | data, jrnl                     | yes (A)                            | -                               | clean (B)                        | no (B)                          | next (C)                        | yes                          | jrnl                                       | remount-ro                  | NOB, anywhere (A)                             | no                                           |\n| XFS       | -       | data, jrnl                     | yes (A)                            | meta                           | clean (B)                        | no (B)                          | immediate                       | yes                          | jrnl, meta                                 | shutdown                    | NOB, within (A)                              | no                                           |\n| Btrfs     | -       | data, jrnl                     | no                                 | -                               | clean                            | yes                             | immediate                       | yes                          | jrnl, meta                                 | remount-ro                  | HOLE, within (D)                             | yes                                          |\n\nNotes\n- **(A)** Non-overwritten blocks (Q10) occur because metadata is persisted despite data-block failure (Q2).\n- **(B)** Marking a dirty page clean (Q4) even though the content does not match the disk (Q5) is problematic.\n- **(C)** Delayed reporting (Q6) of fsync failures may confuse application error-handling logic.\n- **(D)** Continuing to write to a file after an fsync failure is similar to writing to an offset greater than file size, causing a hole in the skipped portion (Q10).\n\n\n## Conclusion\nWhile fsync is commonly trusted to ensure data durability, real-world cases like fsyncgate and studies on different filesystems show that its behavior is far from foolproof.\nThe handling of fsync failures varies significantly across filesystems—some may silently lose data, others may shut down entirely, and a few, like btrfs, attempt to mitigate\nfailures through copy-on-write mechanisms. This complexity underscores the need for applications to be aware of how their underlying storage behaves and to implement additional\nsafeguards where necessary. Understanding these intricacies can help prevent unexpected data loss and improve system resilience in the face of storage failures.\n\n\n## References\n- https://www.usenix.org/system/files/atc20-rebello.pdf\n- https://transactional.blog/how-to-learn/disk-io\n- https://danluu.com/file-consistency/\n- https://danluu.com/fsyncgate/\n\nHope you liked reading the article.\n\nPlease reach out to me [here](https://gauravsarma.com/ping) for more ideas or improvements.\n",
            "url": "https://gauravsarma.com/posts/2025-02-20_how-safe-is-your-fsync",
            "title": "How safe is your fsync?",
            "summary": ". [How Safe Is Your fsync...",
            "date_modified": "2025-02-20T00:00:00.000Z"
        },
        {
            "id": "https://gauravsarma.com/posts/2025-02-16_Integrating-Snapshotter-with-a-memory-datastore-in-Golang",
            "content_html": "\n![Integrating a Snapshotter with a Go Memory Datastore](integrating-snapshotter-with-a-memory-datastore-in-golang-cover.png)\n\nThis is a follow up article to [this](https://gauravsarma.com/posts/snapshotting-in-a-high-throughput-shared-nothing-database/) post that\nI wrote about comparing Copy on Write and Redirect on Write mechanisms.\n\nThis post will cover a more practical example of how we can integrate Copy on Write techniques on a Golang in-memory datastore.\n\nA point in time snapshot refers to the copy of the existing data which is representative of\nthe data in the memory at that specific time. \n\n## Goals\n- Don't affect the throughput performance of the current request processing layer.\n- Ability to take multiple snapshot instances simultaneously.\n- Ability to snapshot and restore on systems with different shards\n- Shouldn't depend on existing data files apart from the in-memory data structures\n\n## Design\n\n### Dummy Store\nAs an example, we will implement a `DummyStore` which is a simple wrapper on top of the `map` store.\n\n### Implementing Copy on write\nThe snapshotting technique would be similar to the copy-on-write mechanism, ie, additional data\nwouldn't have to be stored till the data has to be modified. This means additional memory would\nonly be required if there are changes to the underyling data.\n\n### Impact on current latency benchmarks\n- For reads, there should be minimal latency change since there are no references to the `get`\nmethods even when snapshotting is running. One thing which may impact the read latency is that\nit has to iterate through all the keys, so an implicit lock inside the datastructure may be\nrequired.\n- For writes, if a snapshot is going on, then it has to write in 2 places and an additional read\nto a map.\n\n### Flow\n\nThe initiation flow:\n```bash\nShardThread::CallSnapshotter -> Snapshotter::Start -> Store::StartSnapshot -> SnapshotMap::Buffer\n-> PITFlusher::Flush\n```\n\nWhen the iteration is over\n```bash\nStore::StopSnapshot -> SnapshotMap::FlushAllData -> PITFlusher::FlushAllData -> Snapshotter::Close\n```\n\n### Changes for ShardThread and Store\nThe snapshot would start on every `ShardThread` and fetch the `Store` object. Every `Store` object\nneeds to implement the interface `SnapshotStore` which is contains the `StartSnapshot` and `StopSnapshot`\nmethods.\nThe `StartSnapshot` and `StopSnapshot` methods would be called on the store from the snapshotter object.\n\n#### StartSnapshot\nWhen the `StartSnapshot` method is called, the `Store` should keep note of the `SnapshotID` in a map.\nThere can be multiple instances of snapshots for every store as well.\nFor any read or write operation which is performed, the `Store` object should check if a snapshot is being\nrun at that instance. If no snapshot is being run, then continue as usual.\nIf a snapshot is being run, then for any subsequent write operation, store the previous data in the snapshot's\nobject, maybe a map. Let's call this the `SnapshotMap`. If there are multiple write operations to the same object\nand the data already exists in the `SnapshotMap`, then skip doing anything for the snapshot.\nSimilarly, for reads, if a snapshot is being run, if the incoming request is from a snapshot layer, then check\nif there is anything in the `SnapshotMap` for the key. If no, then return the current value from the `Store`.\n\nIt should fetch the list of keys in its store attribute and iterate through them.\n\n#### StopSnapshot\nWhen the iteration through all the keys by the `Store` object is done, the `StopSnapshot` method is called by the\n`Store`. The `StopSnapshot` lets the `SnapshotMap` know that there are no more updates coming. The `SnapshotMap`\nthen talks to the `PITFLusher` to finish syncing all the chunks to disk and then closes the main snapshot\nprocess.\n\n### Point-in-time Flusher\nThe `PITFlusher` serializes the store updates from the `SnapshotMap` to binary format, currently `gob`.\nIt serializes and appends to a file.\n\n\n## Implementation\nLet's write some code now.\n\nThe main snapshot object is defined as follows:\n```go\ntype (\n\tSnapshotStore interface {\n\t\tStartSnapshot(uint64, Snapshotter) error\n\t\tStopSnapshot(uint64) error\n\t}\n\tPointInTimeSnapshot struct {\n\t\tctx        context.Context\n\t\tcancelFunc context.CancelFunc\n\n\t\tID uint64\n\n\t\tstore SnapshotStore\n\n\t\tSnapshotMap *SnapshotMap\n\n\t\tflusher *PITFlusher\n\n\t\tStartedAt time.Time\n\t\tEndedAt   time.Time\n\n\t\texitCh chan bool\n\t}\n)\n```\n\nEvery snapshot object has an ID which is used as the identity of the snapshot.\nThe `snapshot` object has two underlying tasks:\n- SnapshotMap\n- Flusher\n\nThe `SnapshotMap` is a temporary data store which the actual store object is provided a reference to.\nThe `DummyStore` object adds data to the `SnapshotMap` using `TempAdd` if there are any writes during a snapshotting process.\nWhile taking a snapshot, it checks if the data is present in the `SnapshotMap` because of any writes after the snapshotting\nprocess has been started.\n\n```go\ntype (\n\tSnapshotMap struct {\n\t\ttempRepr  map[string]interface{}\n\t\tbuffer    []StoreMapUpdate\n\t\tflusher   *PITFlusher\n\t\tclosing   bool\n\t\tmLock     *sync.RWMutex\n\t\ttotalKeys uint64\n\t}\n\tStoreMapUpdate struct {\n\t\tKey   string\n\t\tValue interface{}\n\t}\n)\n```\n\nThe `SnapshotMap` batches the writes in the form of array of `StoreMapUpdate` objects and passes it to the `Flusher` when\nthe batch size of `1000` updates is achieved.\n\nThe `Flusher` receives data in batches from the `SnapshotMap`. It serializes the data into a binary format. In the example,\nI am using `encoding/gob` to convert it to a binary encoding. I am planning to move it to Protobuf.\nThe `Flusher` then appends an open snapshot OS file with the serialized updates.\n```go\ntype (\n\tPITFlusher struct {\n\t\tctx        context.Context\n\t\tsnapshotID uint64\n\t\tupdatesCh  chan []StoreMapUpdate\n\t\texitCh     chan bool\n\t\tdlq        [][]StoreMapUpdate\n\n\t\ttotalKeys uint64\n\t\tflushFile *os.File\n\t}\n)\n```\n\nOnce the process is completed, the `Flusher` closes and syncs the file.\n\n### Test cases and benchmarks\n- Snapshot data less than the buffer size without any subsequent writes\n- Snapshot data less than the buffer size with localized subsequent writes\n- Snapshot data less than the buffer size with spread out subsequent writes\n- Snapshot data more than the buffer size without any subsequent writes\n- Snapshot data more than the buffer size with localized subsequent writes\n- Snapshot data more than the buffer size with spread out subsequent writes\n- Ensure current `get` path is not affected\n\n\n## Results\n\n```bash\n=== RUN   TestSnapshotWithoutChangesWithNoRangeAccess\n2025/02/17 09:10:38 Closing snapshot 281711000 . Total time taken 203.7495ms for total keys 1000000\n--- PASS: TestSnapshotWithoutChangesWithNoRangeAccess (0.67s)\n=== RUN   TestSnapshotWithoutChangesWithLowRangeAccess\n2025/02/17 09:10:39 Closing snapshot 896770000 . Total time taken 248.569833ms for total keys 1000000\n--- PASS: TestSnapshotWithoutChangesWithLowRangeAccess (0.66s)\n=== RUN   TestSnapshotWithChangesWithLowRangeAccess\n2025/02/17 09:10:40 Closing snapshot 554695000 . Total time taken 837.278666ms for total keys 1000000\n--- PASS: TestSnapshotWithChangesWithLowRangeAccess (1.25s)\n=== RUN   TestNoSnapshotWithChangesWithLowRangeAccess\n--- PASS: TestNoSnapshotWithChangesWithLowRangeAccess (1.15s)\n```\n\nRuns pretty fast when there are no writes while snapshotting.\nTakes around 190-230ms to snapshot and write 1000000 keys to disk without simultaneous writes.\nDefinitely slows down when there are writes in the system. \nTakes around 800-900ms to snapshot and write 1000000 keys to disk with low range simultaneous writes.\nAssumed as much in the below post since there are 2 additional read and write operations to maintain both the copies.\nIt can be improved as the temporary snapshot store I am using is a little inefficient. But may not matter since it finishes under a second most of the times.\n\n## References\n- The code for the above tests is available [here](https://github.com/gsarmaonline/pitsnapshot)\n- Previous blog in the series available [here](https://gauravsarma.com/posts/snapshotting-in-a-high-throughput-shared-nothing-database/)\n\n\nHope you liked reading the article.\n\nPlease reach out to me [here](https://gauravsarma.com/ping) for more ideas or improvements.\n",
            "url": "https://gauravsarma.com/posts/2025-02-16_Integrating-Snapshotter-with-a-memory-datastore-in-Golang",
            "title": "Integrating Snapshotter with a memory datastore in Golang",
            "summary": ". [Integrating a Snapshotter with a Go Memory Datastore](integrating-snapshotter-with-a-memory-datastore-in-golang-cover...",
            "date_modified": "2025-02-16T00:00:00.000Z"
        },
        {
            "id": "https://gauravsarma.com/posts/2025-02-03_How-Request-processing-has-changed-over-the-years",
            "content_html": "\n![How Request Processing Has Changed Over the Years](how-request-processing-has-changed-over-the-years-cover.png)\n\nWhile browsing Twitter (it will never be X.com for me), I came across a tweet where it was being discussed\nhow an engineer from a big tech company was unable to explain how `async/await` works after\nworking on a particular language for 3+ years.\nThough I agree with the OP on this being an important topic, most engineers working in larger\ncompanies seldom get to work on the processing layer which may make it an opaque topic for most\nengineers out there.\nBack in 2016, I was working quite a lot with Ruby and Python services in an early age startup\nand I was trying to improve the requests per second metric. Somebody mentioned how NodeJS is \nmagnitudes faster than Ruby and I was both hurt (being a Ruby fanboy) and intrigued on why was\nNodeJS so much faster. \nI came across Ryan Dahl's (NodeJS creator) talk on how the event loop was architected to handle a far higher\nRPS compared other interpreted languages. \n\n\n## Anatomy of a request\nWhat are the tasks that are carried out during processing of a request?\n- Serialization/Deserialization\n- Validation\n- Mathematical calculations\n- High iteration count\n- File operations\n- Database read/write queries\n- Other API calls\n\nBefore we go deeper into the different processing architectures, we need to understand the different\nkinds of tasks and the implications they have.\nThe above tasks can be categorised into Blocking or Non Blocking calls.\nNon blocking calls are those which don't need to wait for or get blocked on other tasks. They carry\non to the end of the computation and don't release the CPU unless the kernel forcefully evicts them.\nBlocking calls are those tasks which needs to wait to get acknowledgement or result from another task.\nSome examples are matrix multiplication, loop iteration, serialisation/deserialisation, etc.\nDatabase calls, network calls, sleep calls, file operations are usually examples of those where the process is put into\na waiting state unless the result is returned back by the other task. Since the process is in a waiting\nstate, it itself gives up control of the CPU and waits for itself to get rerun when the result is returned\nback.\n\n### Cooperative vs Preemptive scheduling\nThis is one of the more important topics in Computer Science whenever concurrency is discussed.\nFrom chatgpt,\n```\nCooperative scheduling relies on tasks voluntarily yielding control back to the system, meaning a\nrunning task must explicitly pause (e.g., using yield or await) for others to run. This approach is\nsimpler but can lead to issues if a task doesn’t yield, causing the system to become unresponsive.\nPreemptive scheduling, on the other hand, allows the system to forcibly interrupt a task and switch\nexecution to another, ensuring fair CPU usage and preventing any single task from monopolizing resources.\n```\nWe will circle back to this concept as we discuss the different request processing architectures.\n\n## Timeline of request processing\n\n### Single process server\nI deployed my first Ruby on Rails app by just running `rails s` on a Ubuntu server. Rails came bundled\nwith a WEBrick server, which should ideally be used only for development purposes. Running `rails s` ideally\nwould spawn up a WEBrick server unless configured otherwise. WEBrick by default, starts a single process to handle\nthe requests received by it. \nDeploying it to production using the same approach showed how slow the application was when multiple people used the app simultaneously.\nOf course there was no asset pipeline and productisation of other parts, but I had just started out with webdev\nand I was unaware of the pitfalls.\nTo go into more details, why did the server slow down when there were multiple users?\nHaving multiple users meant that there were multiple concurrent requests to the server. \nSince WEBrick is a single process server, anytime a request comes to the server, it reads the requests, processes it and sends\nback the response. Once the response is sent back, it can handle the next request. So it can only process one\nrequest at a time.\nThis means that if the server takes 100ms on an average to complete the request processing, then the total time\ntaken by the server to handle 100 requests is 10 seconds. Pretty bad performance.\nI remember I faced a nasty bug where a particular HTTP API was calling another API on the same port. It was mainly because\nthe server was already blocked on answering the first request and the internal request was blocked because the process\nnever got freed up.\n\n### Multi process server\nThen came [Unicorn](https://rubygems.org/gems/unicorn/versions/5.1.0), which used to pre-fork multiple\nprocesses on booting up, depending on the configuration and the number of cores. This meant that there\nwere multiple processes which could process concurrent requests at any time.\nLet's take up the previous example and check how Unicorn improves upon it.\nIf you have 4 cores on your machine, Unicorn would be able to start at least 4 processes to handle the\nrequests which means service 100 requests of 100ms RTT would take around 2.5 seconds.\nThis brings us to an important question.  How many processes can you have on a system?\nThe number of processes that can operate concurrently is the RPS that can be achieved on the system.\n\n#### A quick segue to Context Switching\nMost kernels use preemptive scheduling to decide which processes to run. Usually, each CPU can run only\none process at a time. This is where context scheduling comes into the picture. The kernel depending on the\nnumber of processes queued up forces certain processes to sleep for some time and runs another waiting\nprocess. This is to ensure that no single process can hog the resources for a prolonged duration.\n\nBack to Multi processing. So if you have double the processes than the number of cores, there is a slight chance\nthat there would be an increase in the requests processed per second. This enhancement comes because the request\nmay be waiting on IO operations which means context switching the process in place of another process would help\nserve more requests.\nHowever, if the ratio of processes to cores is more than 4-5x, you would see no further improvement or deterioration\nsince the time spent in context switching would actually be higher compared to the time spent waiting for a single\nprocess, which means more contention for a finite number of cores.\n\nLet's assume the best improvement can be seen when the ratio of processes to cores is 2. This brings our\ncalculation to 1.25 seconds to process 100 requests. Still not as good.\n\n### Multi process multi thread server\nWhat would be the next level of optimisation possible? Adding more processes clearly has its limits as the load\nof context switching increases on the kernel.\nThis is where [Puma](https://github.com/puma/puma) comes into the picture.\nPuma supported Multi Process Multi thread request processing.\nWe already saw how multiprocessing was able to increase the throughput. Multi processing is the ability to spawn\nmultiple threads within the process itself. Like the previous processing architecture, every thread can process\na request. So, threads running within a process allows more concurrency for request processing.\nThis means that for a 4 core machine, we can have 4 processes with 4 threads running, which gives a concurrency\nof 16 requests per second. This means 100 requests of 100ms can be finished within 625ms.\n\nWhy can we increase the number of threads but not the number of processes?\nThreads are more lightweight compared to processes. They consume lesser resources since the threads share\ncertain sections of memory with the parent process itself. Since threads are lighter, the load of context\nswitching is also signficantly lower. Since threads share the few of the same memory sections, the process\nalso have more visibility and control over scheduling of the threads. This means that processes can\ncooperatively schedule threads to oeprate when the other thread is blocked or waiting on a resource.\n\n### Event loop server\nNow we get to the fun part. NodeJS. Whenever you would ask the reason behind NodeJS's speed, everyone would\nsay it uses the `single threaded asynchronous event loop` to process requests.\nWhat does an asynchronous event loop actually mean?\nNodeJS runs a single threaded server whose main responsibility is to handle the input and output of the requests\nand schedule IO operations.\nWhy is it encouraged to use Promises and callbacks in Javascript for `setInterval`, `setTimeout` and `http` methods?\nWhy is the concept of closures more widely seen in Javascript? (bias detected here)\n\nInstead of executing the callback in the same thread where it received the request, NodeJS delegates the task to other\nthreads. Once it delegates the task, it moves on to the other request without waiting for the answer from the delegated task.\nWhen the delegated task is complete, it returns back the response to the NodeJS main thread in the same IO queue.\nWhen the main thread encounters the returned response from the delegated task, it executes the callback defined in the\noriginal call. The main NodeJS thread only handles the IO mechanism of it.\nTo delegate tasks, NodeJS uses `libuv`, which is an asynchronous processing abstraction layer available on major OSes.\n\nSince the NodeJS runtime is single thread and mainly supposed for IO tasks, if you don't use callbacks, it will block\non the IO thread, thus disabling the ability to process further requests till the task has finished.\n\n```\nWhen an I/O operation (like reading a file or querying a database) starts, NodeJS does not block the main thread.\n`libuv` registers the operation with `epoll`, which waits for the OS to signal when the operation is ready.\n`epoll` notifies `libuv` when the operation is complete, and the NodeJS event loop processes the result.\nA callback is executed in the JavaScript runtime, allowing NodeJS to continue handling requests efficiently.\n```\n\n### Goroutine based processing \nWhen Golang released, there were lots of benchmarks done with NodeJS and it was observed that Go beat NodeJS in most\nof the benchmarks. Go was designed with concurrency as a first class object, leading to the inclusion of Goroutines,\nchannels, waitgroups, etc from the very beginning. \nGoroutines are extremely light weight threads with a stack size of 2kB by default with the potential to grow as the\nneed arises. Threads, comparatively, are far heavier with a default stack size of 1MB.\nThis lightweight nature of Goroutines meant that the Go runtime could very easily support thousands of requests per\nsecond by default since it doesn't have to limit its IO operations to a single thread as required by the NodeJS event loop. \n\n```\nFun fact: Goroutines were cooperative in the first few versions of Go. Goroutines would be scheduled out of the CPU\nonly when IO operations, sleep, channel operations, etc were encountered. Go's fully preemptive scheduler was introduced in Go 1.14.\n```\n\n### Bonus reading\nAs discussed in this colorful and wonderful [blog](https://journal.stuffwithstuff.com/2015/02/01/what-color-is-your-function/), most\nlanguages with a `async` and `await` keyword defines the need for the result to be returned, ie the stack needs to unwind completely\nfor the function to return. The runtime usually doesn't support running another task till the ongoing task is completed. This is\nalways going to result in a bottleneck sometime or the other as anytime a task overshoots its usage, other tasks are going to be\naffected.\nC# implements `await` by letting the compiler break the function into multiple halves whenever it encounters the `await` keyword.\nLanguages like Python are more cooperative and have a single thread listener. There are also GIL (Global Interpreter Lock) limitations\nwith languages like Python, which prevents true concurrency. \n\n## References\n\n- [Ryan Dahl, Introduction to NodeJS](https://www.youtube.com/watch?v=EeYvFl7li9E)\n- [What color is your function](https://journal.stuffwithstuff.com/2015/02/01/what-color-is-your-function/)\n\nHope you liked reading the article.\n\nPlease reach out to me [here](https://gauravsarma.com/ping) for more ideas or improvements.",
            "url": "https://gauravsarma.com/posts/2025-02-03_How-Request-processing-has-changed-over-the-years",
            "title": "How Request processing has changed over the years",
            "summary": ". [How Request Processing Has Changed Over the Years](how-request-processing-has-changed-over-the-years-cover...",
            "date_modified": "2025-02-03T00:00:00.000Z"
        },
        {
            "id": "https://gauravsarma.com/posts/2025-01-27_Snapshotting-in-a-high-throughput-shared-nothing-database",
            "content_html": "\n![Snapshotting in a High-Throughput Shared-Nothing Database](snapshotting-in-a-high-throughput-shared-nothing-database-cover.png)\n\n## Snapshotting in a high throughput shared-nothing database\n\nWhile working on a Golang based in-memory database, I recently had to implement point in time\nsnapshots for the datastore. The in-memory database has a shared nothing architecture allowing\nit to run multiple goroutines, usually based on the number of available cores and the keys are\nallocated accordingly to the shard goroutines.\n\nHere are the requirements and considerations:\n1. Copy data from memory to disk periodically\n2. Ensure minimal disruption to the actual write operation\n3. Take into consideration the availability of multiple shards and the change in\nthe existing number of shards as well\n4. The snapshot doesn't have to keep up with the incoming writes after the snapshot\nprocess has started\n\nRight off the top of my mind, the most simplest approach seems to be having another\nprocess/goroutine reading from the existing memory where the data is being stored.\n\nFew problems with this approach is:\n1. Data has to be copied entirely to another process\n2. Memory can get to a corrupted state since the process is copying data\nwhile writes are still coming in.\n3. High bursts of CPU and memory on both the writer and the snapshotter process\n\n### How does Redis do it?\n\nRedis, being a single threaded server, uses `Copy on Write` (COW) mechanisms between parent\nand child processes to perform snapshots using `BGSAVE` or `SAVE`.\n\n`Copy on write` is a mechanism of sharing data across different actors where the data\nhas to be copied only when there is a change in the original memory and only in the memory\nsections where the change has happened. \nThis means that there is minimal change in the memory as write operations come in\nwhile snapshotting is taking place.\nThis also prevents the need to copy data from one process to another at one go, which prevents\nhigh resource consumption spikes in both the processes.\n\nWhenever a snapshot process has to be started, the Redis server calls a `fork()` and `exec()`\nwhich creates a child process which can then write it to the disk without affecting the\nparent process's resources.\n\n\n#### Testing CoW on a Golang process\nSince the in-memory database I am working on is in Golang, I tested the above hypotheses\nto check if this would work for snapshotting the data for a single thread/shard. We will come\nto the point where this may need to be modified to support multiple shards.\n\nThe test will contain the following steps:\n1. Initiate a Go process which allocates a large chunk of memory, let's say 1GB. Note the memory\nspike using `free -m` or `htop` or any other memory monitoring tool.\n2. Call `sycall.ForkExec()` and operate on the same large chunk of memory, just print it or\nhave some calculation on top of it. The memory should not spike apart from the newly allocated\nmemory.\n3. Start modifying the existing data in the parent/child process. As the size of the modified\ndata starts increasing, the size of memory consumption should also see a similar spike.\n\n#### Problems with Copy on Write operations\n- When an object changes, there are 2 writes instead of 1 write. One write is copied to the\nchild process's memory and the other is written to the parent. This has the ability to be\nsignficant in cases of high throughput systems\n- For every operation, an extra read operation has to be done to find the right block to\nread the updated data from\n- Varying implementations on different platforms like Windows, Unix and Linux.\n\n\n### Redirect on write (RoW)\nIn RoW systems, instead of copying data, there is a layer of pointers which can be replaced\nto point to the snapshot or the underlying data block. The snapshot system keeps track of the locations of\nall blocks that make up a snapshot. In other words, it maintains a list of pointers and knows where each\npointer's corresponding block is stored. When a process requests access to a snapshot, it uses these pointers\nto retrieve the blocks from their original locations.\nChanges to blocks, which result in them being replaced and referenced by new pointers, have no\nimpact on the snapshot process. In a redirect-on-write system, reading a snapshot incurs no computational overhead.\n\nWhen modifying a protected block, the redirect-on-write approach requires only one-third of the I/O operations\ncompared to other methods, and it does not add any extra computational cost when reading a snapshot. As a result,\ncopy-on-write systems can significantly affect the performance of the protected entity. The more snapshots that\nare created and the longer they are retained, the greater the impact on performance. This is why copy-on-write\nsnapshots are typically used as temporary backups—they are created, backed up, and promptly deleted.\n\nIn contrast, redirect-on-write snapshots are often generated every hour or even every few minutes and can be\nretained for days or even months, only being deleted when storage capacity becomes a concern. The longer a\nsnapshot is kept, the more storage is needed to maintain previous versions of modified blocks.\n\n### References\n\n* [Portworx Redirect on write](https://www.youtube.com/watch?v=_Bw9jgULnm8&t=4s)\n\n**_I hope you liked the article. Please let me know if you have any queries regarding the article. Happy reading!!_**",
            "url": "https://gauravsarma.com/posts/2025-01-27_Snapshotting-in-a-high-throughput-shared-nothing-database",
            "title": "Snapshotting in a high throughput shared nothing database",
            "summary": ". [Snapshotting in a High-Throughput Shared-Nothing Database](snapshotting-in-a-high-throughput-shared-nothing-database-cover...",
            "date_modified": "2025-01-27T00:00:00.000Z"
        },
        {
            "id": "https://gauravsarma.com/posts/2025-01-06_Why-I-switched-from-Medium",
            "content_html": "\n![Why I Switched from Medium](why-i-switched-from-medium-cover.png)\n\nI have been writing on Medium for more than 5 years on various technology topics.\n\nI have finally decided to switch from Medium to a self hosted setup.\nThough the self hosted setup is not what I was looking for, I did end up hosting everything on a\nsmall Digitalocean droplet costing me less than $5 per month, which I also use for multiple other things.\n\n## Some reasons for moving away from Medium\n\n### Writing in the terminal\n\nAs a developer, I spend a lot of time in the terminal and the IDE and I want to be able to write directly from it.\nSwitching back and forth while I am experimenting with something while writing any post is not very seamless.\nI am fast in the terminal with a lot of keystrokes in my muscle memory.\nI believe writing should be as easy as your thoughts. \nRemoving any restriction or hindrance in writing is something that I am going to optimize for, given the amount of\neffort it does require to write a well crafted post.\nAs you read this, I am comfortably in my terminal, with my own theme, with everything of my own choice.\n\n### Support for markdown\n\nMedium doesn't have native support for markdown. There are multiple libraries or github actions which allow you\nto translate markdown to medium supported format. However, given the pace at which things change or become outdated,\nI don't want to depend on any other plugin.\n\n\n### Version control\n\nAll my blogs are in version control allowing me to revisit or structure them in whatever format I want. This\nallows me to switch to any other platform or format effortlessly, given the vast compatibility with the format.\nI can also effortless change multiple blogs or things at once and have automation on top of version control.\n\n\n### Custom features like series, searches and more\n\nOver the years, I have felt a need to have specific criteria for certain kind of posts which require a Series\nkind of a structure. In certain cases, I have wanted to search something across all my blogs instead of iterating\none blog at a time. All this becomes very easy with owning the pipeline.\n\nThere are many more smaller requirements which will come to light as the pipeline becomes more mature.\n\n## What did I move my blog to?\n\nI was hoping to find a static site generator from markdown files with a very light weight theme.\nI came across Hugo and it seemed to fit the bill perfectly.\n\nYou can select your own themes and modify them using Hugo.\n\nHugo is not the end goal, I have other modifications that I want to make to the site.\nHugo is a good starting point for my requirements at the time. I am getting comfortable\nwith the layout and am planning to make changes to the actual framework for my use cases.\n\nFor example, one thing I noticed was the Hugo's recommended searches are client based and not\nserver based. The fact that it's a static site does have some restrictions. I am planning to figure\nout a way to support it properly on the server.\n\n### Steps to migrate from Medium\n- You can download your content from Medium which exports all the content in HTML and other static assets like images\n- I used [medium2hugo](https://github.com/zuzuleinen/medium2hugo) to convert the HTML files into markdown. I did encounter some issues here which is listed below.\n- Setup Hugo and decide which theme you would want to use. I went with the [terminal-theme](https://github.com/panr/hugo-theme-terminal).\n- Copy the generated markdown content to the `content/posts` folder inside the Hugo folder.\n- Configure the Hugo site accordingly and run the `hugo` command which generates a static site in the `public` folder.\n- Configure nginx to route the requests as per your requirement.\n\n### Few issues with medium2hugo\nThis section lists the issues encountered with the `medium2hugo` converter.\n- Images were not rendered, it doesn't account for the right location for images. \n- All headers were converted to h3, which may have been a medium issue as well, but the converter\nshould have accounted for it, given that it's specific to Medium.\n- URLs automatically added the username as displayed in Medium, which is not something people would\ngenerally want from their personally hosted blog where the URL itself is an identifier.\n- Conversion of github gists and other blobs are not properly done. It shows up in a weird syntax or\nno styling at all in the generated markdown page.\n\nI understand these bugs may have been specific to my use case or I may have skimped through additional\nconfiguration which provides the above features. Nevertheless, I was hesitant to spend more time on\nthis step.\n\n\n### Other parts of the blog\nThe website is self hosted on Digitalocean, and everything is backed by Terraform with Nginx for TLS termination.\nThe CI/CD is currently taken care of by Github Actions.\n\nI have setup analytics for user behaviour using Google analytics and planning to integrate with a mailer service\nas well.\n\nThis means that if my site ever goes down, I can switch to a new machine and have the site up and running in\na matter of seconds.\n\nI want to introduce a few other sanity and liveness checks on the blog as well, but haven't yet decided\non the right checkmarks.\n\nHopefully, you find the new blog faster to load and helpful in your journey.\n\nVisit [gauravsarma.com](https://gauravsarma.com) and enjoy more tidbits in the future.\n\n## Conclusion\nI already feel the decision is worth it as I write this first blog from the terminal. I am finally using my\ndomain name I purchased and parked for more than 6 years.\n\nHope you liked reading the article.\n\nPlease reach out to me [here](https://gauravsarma.com/ping) for more ideas or improvements.\n\n\n\n\n\n",
            "url": "https://gauravsarma.com/posts/2025-01-06_Why-I-switched-from-Medium",
            "title": "Why and how I switched from Medium",
            "summary": ". [Why I Switched from Medium](why-i-switched-from-medium-cover...",
            "date_modified": "2025-01-06T00:00:00.000Z"
        },
        {
            "id": "https://gauravsarma.com/posts/2024-11-06_MVCC-and-serializability-in-HyPer-564430884c9a",
            "content_html": "\n![MVCC and Serializability in HyPer](mvcc-and-serializability-in-hyper-cover.png)\n\nIn the realm of database management systems, the ability to handle concurrent transactions efficiently is paramount. As applications demand higher performance and seamless user experiences, traditional locking mechanisms often fall short, leading to bottlenecks and reduced throughput. Enter Multi-Version Concurrency Control (MVCC) — a technique that allows multiple transactions to access data concurrently without interference.\n\nBefore going deeper into MVCC, let’s cover Two Phase Locking and why MVCC is a better choice to maintain concurrent transactions.\n\n#### Two phase locking\n\nTwo-Phase Locking (2PL) is a widely used concurrency control protocol in database management systems that ensures conflict-serializability. It operates in two distinct phases: the growing phase and the shrinking phase. Here’s a detailed explanation of how 2PL works, its types, and its implications in transaction processing.\n\n**Growing Phase**\n\n*   In this phase, a transaction can acquire locks but cannot release any locks. The transaction continues to request and obtain the necessary locks on data items until it reaches a point known as the lock point, where it has acquired all the locks it needs for its operations.\n*   The primary goal during this phase is to ensure that the transaction has exclusive access to the data it needs to read or write.\n\n**Shrinking Phase**\n\n*   Once the transaction reaches its lock point and completes its operations, it enters the shrinking phase. In this phase, the transaction can release locks but cannot acquire any new locks.\n*   This ensures that once a transaction starts releasing locks, it cannot interfere with other transactions by acquiring new locks.\n\nSome common issues of 2PL are as follows:\n\n*   Deadlocks — A deadlock occurs when two or more transactions are waiting for each other to release locks, creating a cycle of dependencies that prevents any of them from proceeding.\n*   Cascading Rollbacks — This occurs when one transaction fails and rolls back, causing other dependent transactions to also roll back\n*   Loss in Serializability — While 2PL ensures serializability, it can limit concurrency because transactions must wait for locks to be released before they can proceed. This waiting can lead to lower throughput in high-concurrency environments.\n\nAs we understand the limitations of 2PL, let’s investigate on how MVCC can solve these problems.\n\n### HyPer’s MVCC Implementation\n\nIn this post, we will mainly look at the MVCC patterns and techniques used in `HyPer` , which has inspired different concepts for other popular databases like DuckDB.\n\nMVCC is the process of performing concurrent operations by maintaining different versions of the data in transaction linked buffers.\n\n#### Snapshot Isolation\n\nTo maintain Snapshot Isolation, each transaction operates on a snapshot of the database at a specific point in time. This prevents read-write conflicts and allows concurrent reads without blocking writes. However, while SI improves concurrency, it does not guarantee serializability — the highest level of isolation between transactions.\n\n#### Serializability\n\nSerializability is a critical concept in database management systems (DBMS) that ensures the correctness and consistency of concurrent transactions. It guarantees that the outcome of executing multiple transactions concurrently is equivalent to some serial execution of those transactions, meaning they would produce the same results as if executed one after the other.\n\n#### Undo and Redo buffers\n\nLet’s take the following example where we have 2 tables, Table 1 and Table 2 with their respective attributes in their initial state. The first column in both the tables are unique identifiers. In the diagram, we have defined the list of transactions that need to carry out. The format of the transactions is `<timestamp> | [<table_name>:<attribute_identifier>:<attribute_value>]` . So TS1 | T1:A:32 | T1:B:22 | T2:J:43 stands for changes to attribute `A` and `B` belonging to `Table 1` and attribute `J` belonging to `Table 2` at a timestamp of `TS1` .\n\n![](/img/mvcc_1.png)\n\nAs we start performing the transactions, the attributes are changed in place and the version delta is stored in the **_Undo buffer_** of the transaction along with the timestamp at which the transaction was received. The Undo buffer is the representation of the attribute before the transaction operation was performed on the attribute. There are different formats of timestamps to differentiate between a version data change and a committed timestamp, which we will explain below.\n\nIn case the transaction has to be rolled back, the data in the Undo buffer is replaced in the attribute’s value. This results in a performance gain compared to other techniques as other techniques may result in a larger table lock in cases of transactions and rollbacks, whereas in this scenario, the scope of change is limited to only the changed attributes.\n\nIn cases of a SCAN or read operation during an ongoing transaction, the timestamp of the read operation is noted and the version matching the data before the timestamp is sent back to the user.\n\n#### Transaction Validation\n\nWrite-Write transactions are avoided entirely by looking at the version change of an attribute. In cases of conflicts for the same attribute, the write operation is delayed till the previous transaction has been validated and committed.\n\nOnce the transaction has made the required validations changes, the changes need to be committed. The validations needed to be done are again scoped to the changes in the transaction and doesn’t affect other attributes. In order to ensure a minimal scope of change, it uses Precision locking. A predicate tree is built for the different reads of the transaction. The read operations of a transaction are converted into predicates and is checked against the changed object’s undo buffers. If there are no overlaps of the predicates and the object’s unfo buffers, the transaction can proceed.\n\nIn order to avoid the read operations being affected by other recently committed transactions, which have been committed during the start time and commit time of the current transaction, the predicates of the ongoing transaction are cross checked with the before and after values of the other transactions as well. If there are no conflicts, then the transaction is deemed to be validated.\n\nThe changes are written to to a Redo Log which can then be used to rollback a transaction if required.\n\n#### Garbage Collection\n\nAs we have seen in the example above, the timestamp of the operations are used to determine which version of the change is to be used for a read or write operation. Depending on the latest committed timestamp, any buffers before that timestamp is garbage collected. Since the scope of the transactions are quite small, the garbage collection utilises minimal resources to removed the unreferenced and invalid versions.\n\n#### Precision Locking\n\nTo address the limitations of traditional MVCC implementations, researchers have introduced Precision Locking — an adaptation that enhances serializability validation without compromising performance. This technique focuses on validating that the extensional writes (actual changes) of recently committed transactions do not intersect with the intensional read predicates of a committing transaction.\n\nLet’s consider a simplified example involving a bank database with a table named `Accounts`, which tracks customer account balances.Database Table: `Accounts`\n\n![](/img/mvcc_2.png)\n\n1.  Transaction T1: A customer wants to transfer $200 from Account 1 to Account 2.\n2.  Transaction T2: Another customer wants to check if any accounts have a balance greater than $1200 and apply a bonus if they do.\n\n**Transaction T1 (Transfer)**\n\n*   T1 reads the balance of Account 1 and Account 2.\n*   Before modifying the balances, T1 locks both accounts using precision locking.\n*   It checks that no other transactions are modifying these accounts.\n*   After confirming, it deducts $200 from Account 1 and adds it to Account 2.\n\n**Locked Data**\n\n*   Locks are applied precisely to Accounts 1 and 2, ensuring that only these records are locked for the duration of the transaction.\n\n**Transaction T2 (Bonus Check)**\n\n*   T2 queries for accounts with balances greater than $1200.\n*   Using precision locking, T2 locks only the relevant accounts that meet this criterion (in this case, Accounts 2 and 3).\n*   T2 reads the balances and finds that Account 2 now has $1700 (after T1’s transfer) and Account 3 has $2000.\n*   T2 applies a bonus to both accounts.\n\n**How Precision Locking Works**\n\n*   Locking Mechanism: When T1 locks Accounts 1 and 2, it ensures that no other transaction can modify these records until T1 completes. However, other transactions can still read these records if they are not in conflict with the locked state.\n*   Conflict Detection: If another transaction (say T3) attempts to modify Account 1 while T1 is still executing, precision locking will prevent T3 from proceeding until T1 has completed. Conversely, if T3 is only reading Account 1, it can do so without being blocked.\n*   Reduced Lock Scope: Unlike traditional locking methods that might lock an entire table or broader set of records, precision locking focuses only on the necessary tuples (records) required for maintaining consistency. This allows for higher concurrency as more transactions can operate simultaneously without interference.\n\n#### Synopses-Based Approach\n\nIn addition to Precision Locking, HyPer employs a synopses-based approach using Versioned Positions to enhance scan performance for read-heavy workloads. By maintaining summaries of versioned record positions, HyPer can efficiently determine which data ranges to scan during analytical queries.\n\n**Few benefits of Synopses based approach**\n\n*   High Scan Performance: By leveraging synopses, HyPer retains the high scan performance typical of single-version systems while still supporting concurrent transactions.\n*   Optimized Resource Utilization: The combination of positional delta trees and synopses allows for efficient memory usage and faster query processing times.\n\nTo illustrate how the synopses-based approach works in Multi-Version Concurrency Control (MVCC), let’s consider a simplified example involving a database table that stores information about products in an e-commerce application. We will walk through a scenario involving multiple transactions, focusing on how the synopses-based approach enhances read-heavy operations.\n\nLet’s take a small example here.\n\nDatabase Table: `Products`\n\n![](/img/mvcc_3.png)\n\nAssume we have three products in our database, each with an initial version (Version 1).\n\n**Transaction A: Update Product Price**\n\nTransaction A starts and updates the price of “Widget A” from $10.00 to $12.00.\n\n1.  In-Place Update: The system updates the price in place but also creates a before-image delta in an undo buffer to maintain the previous version.\n2.  New Version Creation: The new version of “Widget A” is created, and the database now looks like this:\n\n![](/img/mvcc_4.png)\n\n**Transaction B: Read Products**\n\nWhile Transaction A is still ongoing, Transaction B starts and wants to read the prices of all products.\n\n1.  Snapshot Isolation: Transaction B sees the database as it was at the start of its execution, which means it reads the original version of “Widget A” (Version 1) priced at $10.00, along with the current versions of “Widget B” and “Widget C”.\n2.  Using Synopses-Based Approach:\n\n*   The synopses-based approach maintains metadata (synopses) about the positions of versions. For example, it might keep track of which versions are valid for each product based on their update timestamps.\n*   When Transaction B queries for all products, it uses this metadata to quickly access the relevant records without scanning through all versions unnecessarily.\n\n**Result of Transaction B**\n\nTransaction B retrieves:\n\n*   Widget A: $10.00 (Version 1)\n*   Widget B: $15.00 (Version 1)\n*   Widget C: $20.00 (Version 1)\n\n**Transaction A Commits**\n\nAfter completing its operations, Transaction A commits its changes, making Version 2 of “Widget A” permanent.\n\n**Garbage Collection**\n\nThe system now performs garbage collection:\n\n*   It retains Version 2 of “Widget A” since it is the latest committed version.\n*   It can mark Version 1 for removal if no other transactions are accessing it.\n\n#### References\n\n*   [https://15721.courses.cs.cmu.edu/spring2019/papers/04-mvcc2/p677-neumann.pdf](https://15721.courses.cs.cmu.edu/spring2019/papers/04-mvcc2/p677-neumann.pdf)\n\n**_I hope you liked the article. Please let me know if you have any queries regarding the article. Happy reading!!_**\n",
            "url": "https://gauravsarma.com/posts/2024-11-06_MVCC-and-serializability-in-HyPer-564430884c9a",
            "title": "MVCC and serializability in HyPer",
            "summary": ". [MVCC and Serializability in HyPer](mvcc-and-serializability-in-hyper-cover...",
            "date_modified": "2024-11-06T00:00:00.000Z"
        },
        {
            "id": "https://gauravsarma.com/posts/2024-08-27_Comparison-between-Redis-and-DragonflyDB-s-data-stores-e9ecba1ef84c",
            "content_html": "\n![Redis vs DragonflyDB Data Store Comparison](redis-vs-dragonflydbs-data-stores-cover.png)\n\nIn this post, we will understand how Redis and DragonflyDB store the data in memory. Both of them are in-memory datastores. Redis uses a single threaded architecture which means there are no locks or synchronisation techniques required. DragonflyDB uses a multi-threaded architecture with a shared-nothing data architecture. This allows DragonflyDB to use all the cores of a system thus, allowing it to be more performant than Redis in real life use cases.\n\n### A few Primers\n\nA quick primer on some core concepts used in both Redis and DragonflyDB.\n\n#### Separate Chaining Scheme\n\nSeparate chaining is a collision technique where each key points to a linked list represented by the same hash value. So in cases where there are multiple keys pointing to the same hash value, the keys are inserted in the linked list.\n\nWhile fetching data from the hashtable with a separate chaining scheme, the key is hashed to find the linked list where the key is stored. Once the linked list is identified, the linked list has to be traversed till the key is found.\n\nThe worst complexity to fetch a key is O(n) from a store of n elements when all the keys point to a single hashed value.\n\n#### Open Addressing Scheme\n\nOpen addressing is a collision technique where the data is stored in the next free memory space based on an offset function. The offset functions are of different types, for eg: linear probing, quadratic probing, etc.\n\nIn the case of linear probing, when a collision occurs, the data is entered into the next address slot which is free. It keeps trying to insert the data in the next slot till the table is full. In the case of quadratic probing, when a collision occurs, the next address space is defined based on the `i^2th` slot in the `i th` iteration if the given hash value x collides in the hash table.\n\nIn few examples of linear probing, the worst time complexity to fetch a key can again amount to O(n) and there are different clustering problems faced by an Open addressing scheme as well.\n\n#### Extendible Hashing\n\nExtendible hashing is the process of using trie based key lookups to bucket pointers from the directory. This directory is flexible in size, allowing it to expand or contract based on the needs of the dataset. Each entry in the directory points to a bucket, which contains a fixed number of data entries. These buckets are where the actual data is stored. Extendible hashing uses a hash function that produces a bit string of sufficient length. However, only a portion of these bits is initially used, allowing for future expansion.\n\nThe directory is indexed by the first ‘i’ bits of the hash value, where ‘i’ is called the global depth. Each bucket has a local depth ‘d’, where d ≤ i. All entries in a bucket share the same ‘d’ bit prefix. The global depth refers to the size of the directory and the local depth refers to the key size of the bucket that it was previously mapped out to.\n\nWhen inserting data, if a bucket overflows, a new bucket can be introduced, and the data from the overflowed bucket is rehashed between the 2 buckets and the rehashed buckets are added to the directory list. Addressing the bucket overflow problem is scoped to the only the bucket that is full and the other buckets are not affected, thus allowing the data structure to grow efficiently and rapidly.\n\n### Redis\n\nEvery Redis DB contains a `dict` struct which holds the hash tables and other size related and hashing related information.\n\n```c\ntypedef struct dict {  \n    dictType *type;  \n    void *privdata;  \n    dictht ht[2];  \n    long rehashidx;  \n    unsigned long iterators;  \n} dict;  \n  \ntypedef struct dictht {  \n    dictEntry **table;  \n    unsigned long size;  \n    unsigned long sizemask;  \n    unsigned long used;  \n} dictht;  \n  \ntypedef struct dictEntry {  \n    void *key;  \n    union {  \n        void *val;  \n        uint64_t u64;  \n        int64_t s64;  \n        double d;  \n    } v;  \n    struct dictEntry *next;  \n} dictEntry;\n```\n\nThe above data structures can be represented diagrammatically in the following manner:\n\n![](/img/dragonfly_1.png)\n\nIn the `dict` struct, we see that there are 2 `dictht` entries. `dictht` represents the hashtable with separate chaining, where the data is actually stored. `dictht` contains a list of pointers to the `dictEntry` , which is equivalent to a bucket. When a request is received with a particular key, the key is hashed to point to a specific bucket. `dictEntry` is a linked list which contains a key-value pair and points to the next entry in the list.\n\nWhenever the hashtable `dictht` becomes full, Redis has to allocate more space to the hashtable and rehash the entire data set again. The `dict` struct has 2 `dictht` entries so that size of the `dictht` , which is not being used can be expanded, data can be copied over to the new `dictht` from the previous `dictht` and the data can be rehashed while bringing this up. This is an expensive operation which is conducted in a different manner in DragonflyDB’s dashtables that we will be covering later in this post.\n\n#### DragonflyDB\n\nDragonflyDB has a similar entry point as that of Redis where it stores an array of pointers for the request to be routed to for the hashed key. This layer is called the `Directory`.\n\n![](/img/dragonfly_2.png)\n\nHowever, instead of pointing to hashtables, the Directory points to `Segments` , which are fixed size hashtables, i.e they grow only to a fixed size. Whenever a Segment becomes full and there is no space in the immediate next Segment, a new Segment is added to the directory. The data from the existing Segment is rehashed over the existing Segment and the newly created Segment. Extendible hashing uses a trie based approach to hash the key to different Segments. Since it uses a trie, the blast radius of rehashing a Segment only falls to the existing Segment and the new Segment since other hashes won’t be affected. In such cases, Redis usually has to do full rehash of the existing data, which is an expensive operation.\n\nEach Segment consists of 60 buckets and each bucket contains 14 slots. This means that each Segment can hold a total of 840 items. Out of the 60 buckets, 56 buckets are regular buckets and 4 of them are stash buckets. Whenever there is a spillover from the regular buckets, the data is stored in the stash bucket.\n\n### Conclusion\n\nRedis has been the front runner for in-memory data stores for a long time. It’s great to see other competitors like DragonflyDB innovating on multiple aspects. I will try to bring more blogs on it’s shared nothing architecture, forkless saves, locking strategies, etc.\n\n#### References:\n\n*   [https://kousiknath.medium.com/a-little-internal-on-redis-key-value-storage-implementation-fdf96bac7453](https://kousiknath.medium.com/a-little-internal-on-redis-key-value-storage-implementation-fdf96bac7453)\n*   [https://github.com/dragonflydb/dragonfly/blob/main/docs/dashtable.md](https://github.com/dragonflydb/dragonfly/blob/main/docs/dashtable.md)\n\n**_I hope you liked the article. Please let me know if you have any queries regarding the article. Happy reading!!_**\n",
            "url": "https://gauravsarma.com/posts/2024-08-27_Comparison-between-Redis-and-DragonflyDB-s-data-stores-e9ecba1ef84c",
            "title": "Comparison between Redis and DragonflyDB’s data stores",
            "summary": ". [Redis vs DragonflyDB Data Store Comparison](redis-vs-dragonflydbs-data-stores-cover...",
            "date_modified": "2024-08-27T00:00:00.000Z"
        },
        {
            "id": "https://gauravsarma.com/posts/2024-08-14_Optimising-Stripped-Locks-using-Golang-arrays-34b45ef4e975",
            "content_html": "\n![Optimising Striped Locks with Go Arrays](optimising-stripped-locks-using-golang-arrays-cover.png)\n\n### Problems with a global lock space\n\nI was recently working on an in-memory datastore which had a flat layout of the underlying data structure. The datastore contained key value pairs, similar to a hash. In the MVP phase of the implementation, to control concurrent access to the datastore operations, it took a global lock on the entire datastore.\n\nHaving a global lock on the datastore meant that every datastore operation had to wait for the read/write lock to be available. This slowed down all the operations because of the additional wait time. This also led to expensive queries affecting other queries since the inexpensive queries also had to wait for the lock to be freed.\n\nThis required us to divide up the lock space into smaller different buckets/stores so that the blast radius of queries are scoped to the buckets alone. Having a higher number of buckets will allow a higher number of concurrent requests. However, a very high number of buckets will also lead to a large lock space, which means that there would be more CPU cycles being spent on the finding the relevant lock for the bucket.\n\nA high level approach that we will use here:\n\n1.  Define the number of buckets, let’s call it N.\n2.  For every key access to the datastore, hash the key.\n3.  Using the hash, find the allocated bucket by calculating the modulo with N.\n4.  Find the lock object (sync.RWMutex{}) in the bucket\n5.  Invoke the appropriate Lock method.\n\nFrom first glance, it’s a simple approach where we can use a hashing function, for eg: FNV (Fowler-Noll-Vo) hashing algorithm, to convert the provided string to a hash key.\n\nOnce the hash is calculated, check in the Golang map if there is a lock object already defined in the value of the map. If there is, then return the lock object.\n\n### Arrays or Maps?\n\nAs we can see from the above example, using a Golang map would result in a very simple implementation. However, let’s discuss on if a map is the right datastructure to use. Map is a generic dynamically allocated datastructure without much control on how the data in the map is laid out in memory.\n\nLet’s run a simple benchmark using Maps and arrays to see how much time does it take to fetch the keys respectively. We will define different number of keys to benchmark the below code.\n\n*   Small — 10 keys\n*   Medium — 1000 keys\n*   Large — 1000000 keys\n\n// Benchmarking Map key fetch  \nfunc BenchmarkSmallMapFetch(b \\*testing.B) {  \n var (  \n  smallMap = make(map\\[int\\]uint32, 10)  \n )  \n  \n for i := 0; i < 10; i++ {  \n  smallMap\\[i\\] = uint32(i)  \n }  \n  \n for i := 0; i < b.N; i++ {  \n  for innerIdx := 0; innerIdx < 10; innerIdx++ {  \n   \\_ = smallMap\\[innerIdx\\]  \n  }  \n }  \n}  \n  \n// Benchmarking Array key fetch  \nfunc BenchmarkSmallArrayFetch(b \\*testing.B) {  \n var (  \n  smallArray \\[10\\]uint32  \n )  \n  \n for i := 0; i < 10; i++ {  \n  smallArray\\[i\\] = uint32(i)  \n }  \n  \n for i := 0; i < b.N; i++ {  \n  for innerIdx := 0; innerIdx < 10; innerIdx++ {  \n   \\_ = smallArray\\[innerIdx\\]  \n  }  \n }  \n}\n\nThe results of the benchmarks are as follows:\n\nBenchmarkSmallArrayFetch-10             264653072                4.508 ns/op  \nBenchmarkMediumArrayFetch-10             3629518               335.8 ns/op  \nBenchmarkLargeArrayFetch-10                 3643            324678 ns/op  \n  \nBenchmarkSmallMapFetch-10               23385183                49.75 ns/op  \nBenchmarkMediumMapFetch-10                241534              5011 ns/op  \nBenchmarkLargeMapFetch-10                     27          39847242 ns/op324678\n\nIf you compare the results, fetching data from maps is **at least 10 times** slower than arrays. As the number of data points increase, maps become more slower compared to arrays, in some cases, **even 100x slower**.\n\nThere are various reasons to why arrays would be faster than maps in common use cases. Arrays are contiguous blocks of memory in the stack whereas Maps undergo uneven memory allocation, often resulting in **NUMA** (non uniform memory access), usually on the heap. Given the sequential layout in an array, it’s easier to align with the **cache line**, thus resulting in faster reads. Maps are dynamically sized, which also leads to an overhead on Golang’s runtime for memory allocation and garbage collection.\n\n### Implementing stripped locks using Arrays\n\nWe introduce the following models to the implementation:\n\n*   **Lock**\n*   **LockStore**\n*   **LockHasher**\n\n Lock struct {  \n  mutex    \\*sync.RWMutex  \n  name     LockName  \n  refCount uint32  \n }  \n  \n LockStore struct {  \n  hash      \\[32\\]\\*Lock  \n  lockCount uint8  \n }  \n  \n LockHasher struct {  \n  ctx         context.Context  \n  concurrency uint32  \n  lockStores  \\[DefaultLockConcurrency\\]\\*LockStore  \n }\n\n\\`Lock\\` refers to the encapsulation of the actual mutex. It also stores the name of the lock and the \\`refCount\\` refers to the active references for the lock.\n\nLockStore stores the locks in an array with the lock name indexed on an array.\n\nLockHasher is the object which controls and stores the number of buckets/slots. LockHasher uses the keys to point the request to a specific LockStore.\n\nAfter that, we expose simple methods for operations on all the above objects.\n\n// ==================   LockHasher. ==================  \nfunc (lockHsh \\*LockHasher) GetHash(strKey string) (hashSlot uint32, err error) {  \n var (  \n  hashFn hash.Hash32  \n )  \n if strKey == \"\" {  \n  strKey = DefaultLockIdentifier  \n }  \n hashFn = fnv.New32a()  \n if \\_, err = hashFn.Write(\\[\\]byte(strKey)); err != nil {  \n  return  \n }  \n hashSlot = hashFn.Sum32() % lockHsh.concurrency  \n return  \n}  \n  \nfunc (lockHsh \\*LockHasher) GetLockStore(strKey string) (lockH \\*LockStore, err error) {  \n var (  \n  hashSlot uint32  \n )  \n if hashSlot, err = lockHsh.GetHash(strKey); err != nil {  \n  return  \n }  \n if lockH = lockHsh.lockStores\\[hashSlot\\]; lockH == nil {  \n  err = fmt.Errorf(\"lock store not found for %s\", strKey)  \n  return  \n }  \n return  \n}\n\nfunc (lockSt \\*LockStore) setup() (err error) {  \n availableLocks := \\[\\]LockName{  \n  LockOne,  \n  LockTwo,  \n }  \n for \\_, lockName := range availableLocks {  \n  if \\_, err = lockSt.AddLock(lockName); err != nil {  \n   return  \n  }  \n }  \n return  \n}  \n  \nfunc (lockSt \\*LockStore) AddLock(name LockName) (lock \\*Lock, err error) {  \n lock = &Lock{  \n  mutex:    &sync.RWMutex{},  \n  name:     name,  \n  refCount: 0,  \n }  \n if lockSt.hash\\[uint8(lock.name)\\] != nil {  \n  err = fmt.Errorf(\"slot already filled for %d\", lock.name)  \n  return  \n }  \n lockSt.hash\\[uint8(lock.name)\\] = lock  \n lockSt.lockCount++  \n return  \n}  \n  \nfunc (lockSt \\*LockStore) GetLock(name LockName) (lock \\*Lock, err error) {  \n if lock = lockSt.hash\\[uint8(name)\\]; lock == nil {  \n  err = fmt.Errorf(\"lock not found for %d\", name)  \n  return  \n }  \n return  \n}  \n  \nfunc (lockSt \\*LockStore) RemoveLock(name LockName) (err error) {  \n if lockSt.hash\\[uint8(name)\\] == nil {  \n  err = fmt.Errorf(\"lock not found for %d\", name)  \n  return  \n }  \n lockSt.hash\\[uint8(name)\\] = nil  \n lockSt.lockCount--  \n return  \n}  \n  \nfunc (lock \\*Lock) WLock() {  \n lock.mutex.Lock()  \n lock.refCount++  \n}  \n  \nfunc (lock \\*Lock) WUnlock() {  \n lock.mutex.Unlock()  \n lock.refCount--  \n}  \n  \nfunc (lock \\*Lock) RLock() {  \n lock.mutex.RLock()  \n lock.refCount++  \n}  \n  \nfunc (lock \\*Lock) RUnlock() {  \n lock.mutex.RLock()  \n lock.refCount--  \n}\n\nThe above example is a naive implementation of the first phase of trying to have stripped locks. We will go into more complex implementations in the subsequent articles.\n\n### Next steps\n\nThis is the first blog in a series of articles where we will cover the process of building an in-memory datastore starting from basics to complex memory optimised datastores. Follow along!!",
            "url": "https://gauravsarma.com/posts/2024-08-14_Optimising-Stripped-Locks-using-Golang-arrays-34b45ef4e975",
            "title": "Optimising Stripped Locks using Golang arrays",
            "summary": ". [Optimising Striped Locks with Go Arrays](optimising-stripped-locks-using-golang-arrays-cover...",
            "date_modified": "2024-08-14T00:00:00.000Z"
        },
        {
            "id": "https://gauravsarma.com/posts/2024-03-16_Test-Images",
            "content_html": "\n# Testing Different Types of Images\n\n## Remote Image\nHere's a remote image from a URL:\n![Remote test image](https://picsum.photos/800/400)\n\n## Local Image\nFirst, let's create a local image directory and add a test image:\n![Local test image](test-image.jpg)\n\n## Testing Image Sizes\nAnother remote image with different dimensions:\n![Another remote image](https://picsum.photos/400/300) ",
            "url": "https://gauravsarma.com/posts/2024-03-16_Test-Images",
            "title": "Testing Images in Blog Posts",
            "summary": "Testing Different Types of Images Remote Image Here's a remote image from a URL: . [Remote test image](https://picsum...",
            "date_modified": "2024-03-16T00:00:00.000Z"
        },
        {
            "id": "https://gauravsarma.com/posts/2024-02-18_Demystifying-MongoDB-write-operations-dbac459c9d26",
            "content_html": "\nIn this post, we will try to understand the different factors which control the write operations in MongoDB. We will try to tie in the common concepts like checkpointing, journaling, replication that we hear so often in the context of write operations. There are different default configurations across different Mongo versions, so it’s important to check the default configurations before you modify the configurations.\n\n### A brief introduction to WiredTiger(WT)\n\nWiredTiger(WT) has been the default storage engine for MongoDB since 3.2. When WT is started on a node, it takes up 50% of the total memory. So in a system with 16GB memory, WT will take up around 7GB of memory. WT utilises the memory for both read and write operations. WT takes up 50% of the memory as it offloads the optimization operations to the OS. The data is stored as uncompressed in WT cache whereas on the OS, it is highly compressed, often to a 1:10 ration compared to the WT cache data size.\n\nTo understand more about the WT cache status, you can run the below command in the Mongo shell\n\n```bash\ndb.serverStatus().wiredTiger.cache\n```\n\nThis gives the following response:\n\n```bash\ndb.serverStatus().wiredTiger.cache  \n{  \n\"application threads page read from disk to cache count\" : 9,  \n\"application threads page read from disk to cache time (usecs)\" : 17555,  \n\"application threads page write from cache to disk count\" : 1820,  \n\"application threads page write from cache to disk time (usecs)\" : 1052322,  \n\"bytes allocated for updates\" : 20043,  \n\"bytes belonging to page images in the cache\" : 46742,  \n\"bytes belonging to the history store table in the cache\" : 173,  \n\"bytes currently in the cache\" : 73044,  \n\"bytes dirty in the cache cumulative\" : 38638327,  \n\"bytes not belonging to page images in the cache\" : 26302,  \n\"bytes read into cache\" : 43280,  \n\"bytes written from cache\" : 20517382,  \n\"cache overflow score\" : 0,  \n\"checkpoint blocked page eviction\" : 0,  \n\"eviction calls to get a page\" : 5973,  \n\"eviction calls to get a page found queue empty\" : 4973,  \n\"eviction calls to get a page found queue empty after locking\" : 20,  \n\"eviction currently operating in aggressive mode\" : 0,  \n\"eviction empty score\" : 0,  \n\"internal pages split during eviction\" : 0,  \n\"leaf pages split during eviction\" : 0,  \n\"maximum bytes configured\" : 8053063680,  \n\"maximum page size at eviction\" : 376,  \n\"modified pages evicted\" : 902,  \n\"modified pages evicted by application threads\" : 0,  \n\"operations timed out waiting for space in cache\" : 0,  \n\"overflow pages read into cache\" : 0,  \n\"page split during eviction deepened the tree\" : 0,  \n\"page written requiring history store records\" : 0,  \n\"pages currently held in the cache\" : 24,  \n\"pages queued for eviction post lru sorting\" : 0,  \n\"pages queued for urgent eviction\" : 902,  \n\"pages queued for urgent eviction during walk\" : 0,  \n\"pages read into cache\" : 20,  \n\"pages read into cache after truncate\" : 902,  \n\"pages read into cache after truncate in prepare state\" : 0,  \n\"pages requested from the cache\" : 33134,  \n\"pages seen by eviction walk\" : 0,  \n\"pages seen by eviction walk that are already queued\" : 0,  \n\"pages walked for eviction\" : 0,  \n\"pages written from cache\" : 1822,  \n\"pages written requiring in-memory restoration\" : 0,  \n\"percentage overhead\" : 8,  \n\"tracked bytes belonging to internal pages in the cache\" : 5136,  \n\"tracked bytes belonging to leaf pages in the cache\" : 67908,  \n\"tracked dirty bytes in the cache\" : 493,  \n\"tracked dirty pages in the cache\" : 1,  \n\"unmodified pages evicted\" : 0  \n}\n```\n\n### Different stages of a Write operation\n\nWhen a write operation is received, it is written to the WT’s dirty page cache and the journal’s in-memory buffer.\n\n#### Journaling\n\nJournaling refers to the process of appending every write operation to a [write ahead log](https://medium.com/@hnasr/what-is-wal-write-ahead-log-a-deep-dive-a2bc4dc91170) to ensure data recovery if the process fails. Whenever the Mongo process restarts, it checks the last checkpoint and the journal. If there are items in the journal which have not been checkpointed, Mongo creates a new checkpoint and proceeds with the initialisation.\n\nWhen a write operation is received, it is stored in the in-memory buffer of the WAL. Flushing to the on-disk WAL happens every 100ms. The advantage of using journaling is that since it’s an append only log, the writes are faster to complete.\n\nBy default, journaling happens every 100 ms.\n\n### Checkpointing\n\nCheckpointing is the process of flushing the data from in-memory buffers to the disk. When Mongo starts, it can pick off from the latest checkpoint reliably and start its operations. Checkpointing results in addition of the data to Mongo’s internal B+ tree.\n\nBy default, the checkpointing is done every 60 seconds.\n\nThe checkpointing process ensures that the dirty data in the WT cache is flushed to disk every 60 seconds or whenever the dirty cache ratio goes over a certain threshold.\n\n#### Configuring the Checkpointing process\n\nThere are different types of threads, called `eviction_threads` which are solely responsible for flushing the dirty pages to the data files. These threads usually run in the background, without affecting the application workflow. Having a higher number of eviction threads will flush the data faster, though it will also utilize more resources, thus making the application threads slower.\n\nWhen the dirty cache goes over a certain percentage of WT cache, the checkpointing process starts flushing the data to the data files. This is controlled via the `eviction_dirty_target` which is set to 5% by default. The checkpointing continues till the ratio is lower than 5% of the WT cache.\n\nWhen the dirty cache ratio is above 20%, the workload is also distributed amongst the application threads to clear the backlog, which in turn slows the application. This is configurable via `eviction_dirty_trigger` .\n\n#### Configuring block sizes in WiredTiger\n\nThe WT on-disk files are made up of blocks known as pages, which are then written to the disk. These blocks are often compressed before being written to the disk.\n\nWhen writing to Mongo, the pages are evicted from memory when the size of the pages crosses the maximum configured size. The pages are evicted, which then goes through a reconciliation process to convert the in-memory representation of the data to an on-disk representation.\n\nHigher the number of page evictions, more the cost of updating the block manager, initialising the reconciliation process, compressing the data and then writing it to the on-disk files. Depending on the size of the data being dealt with, this can be configured accordingly.\n\nThe size of the blocks are defined by `allocation_size` which can be edited by configuring the `storage.wiredTiger.collectionConfig.configString` before creating the collection.\n\n#### Changing WiredTiger’s cache size\n\nBy default, WT takes up 50% of the available memory on the system. The reason it takes up 50% is because it offloads multiple tasks to the kernel as well. The size of the cache can be increased or decreased depending on the type of load running on the system.\n\n### Conclusion\n\nBefore starting on the process to tune the database, it’s important to understand the patterns of the load the database is experiencing and to gather more performance metrics.\n\nEach configuration has to be changed with the understanding that there is a tradeoff in every configuration change, which must be predicted beforehand.\n\nThe above configurations can be applicable to most databases out there today. I wanted to take MongoDB as an example to provide a concrete example.\n\n**_I hope you liked the article. Please let me know if you have any queries regarding the article. Happy reading!!_**\n\n### References\n\n*   [https://source.wiredtiger.com/11.1.0/tune\\_page\\_size\\_and\\_comp.html](https://source.wiredtiger.com/11.1.0/tune_page_size_and_comp.html)\n",
            "url": "https://gauravsarma.com/posts/2024-02-18_Demystifying-MongoDB-write-operations-dbac459c9d26",
            "title": "Demystifying MongoDB write operations",
            "summary": "In this post, we will try to understand the different factors which control the write operations in MongoDB.  We will try to tie in the common concepts like checkpointing, journaling, replication that we hear so often in the context of write operations...",
            "date_modified": "2024-02-18T00:00:00.000Z"
        },
        {
            "id": "https://gauravsarma.com/posts/2023-09-17_Measuring-cost-of-spawning-Goroutines-4b0dab6f5bf7",
            "content_html": "\n> **2026 update.** The original version of this post compared three concurrency shapes by timing fixed batches of 1M / 10M / 100M / 1B operations and concluded \"always pre-allocate.\" After re-running the experiment with Go's `testing.B` framework — and adding a variant that doesn't use a channel — the conclusion I drew was wrong, or at least imprecise. The code and numbers below are the corrected version. Original tables left at the bottom for reference. Source: [github.com/gsarmaonline/golang-measurements](https://github.com/gsarmaonline/golang-measurements).\n\nDevelopers who learn Go are taught that goroutines are a very cheap version of threads. The minimum stack is 2 kB as of Go 1.19, and the standard library leans on this — `net/http` spawns a goroutine per accepted connection without apology. So is there ever a reason to *not* spawn one per request?\n\n![](/img/goroutines_spawn_1.png)\n\nThe original version of this post said yes: pre-allocate a pool of goroutines and reuse them, because it's \"more than 2× faster.\" That number is real, but it's the answer to a different question than the one I was asking.\n\n### Why the original setup was misleading\n\nThe work unit looked like this:\n\n```go\nfunc calculateSum(outputCh chan int, a, b int) {\n    outputCh <- a + b\n}\n```\n\nThe function does an integer add and writes the result to a channel. The benchmark was: spawn N of these, time the wall clock. The problem: the channel send is a synchronization operation. Every iteration was paying for goroutine creation *and* a channel send + receive. When I compared \"spawn per request\" against \"pool of 10 workers,\" I was comparing two shapes that both used channels — so I never saw how much of the gap was channels and how much was goroutines.\n\nThe second problem was using fixed N for timing. `testing.B` auto-tunes the iteration count until the run is long enough to measure reliably, and reports `ns/op` directly — no spreadsheet arithmetic, no \"how do I divide 435,440ms by 1,000,000,000 again.\" It also lets you compare against a serial baseline trivially, which turns out to matter a lot here.\n\n### The corrected setup\n\nFive shapes, same `a + b` work unit, all driven by `testing.B`:\n\n| Shape | What it measures |\n|---|---|\n| `Serial` | Baseline. No goroutines. |\n| `SpawnPerRequestChan` | One goroutine per call, result returned via channel. *Original post's \"approach 1.\"* |\n| `SpawnPerRequestWG` | One goroutine per call, sync via `sync.WaitGroup` + `atomic.AddUint64`. *Channel bottleneck removed.* |\n| `PoolSharedChan_N` | Pre-allocated pool of N workers reading from one channel. *Original post's \"approach 2.\"* |\n| `PoolPerWorkerChan_N` | Pre-allocated pool, one channel per worker. *Original post's \"approach 3.\"* |\n\nFull code at [github.com/gsarmaonline/golang-measurements](https://github.com/gsarmaonline/golang-measurements). Run with `go test -bench=. -benchmem -benchtime=3s`.\n\n### The numbers\n\nApple M4, Go 1.24, `benchtime=3s`, median of three runs.\n\n| Shape | Pool size | ns/op | B/op | allocs/op |\n|---|---|---|---|---|\n| `Serial` | — | **0.23** | 0 | 0 |\n| `SpawnPerRequestChan` | — | 280 | 160 | 3 |\n| `SpawnPerRequestWG` | — | **131** | 56 | 2 |\n| `PoolSharedChan` | 10 (= GOMAXPROCS) | **96** | 0 | 0 |\n| `PoolSharedChan` | 100 | 193 | 0 | 0 |\n| `PoolSharedChan` | 1000 | 244 | 0 | 0 |\n| `PoolPerWorkerChan` | 100 | 74 | 0 | 0 |\n\nA few things jump out:\n\n1. **The actual work is 0.23 ns.** Everything above that is overhead. We are measuring scheduler and synchronization costs, not arithmetic.\n2. **Removing the channel from the no-pool case nearly halves the cost.** `SpawnPerRequestChan` is 280 ns; `SpawnPerRequestWG` is 131 ns. So in the original post, *more than half* of what I was calling \"the cost of spawning a goroutine\" was actually the cost of the channel send.\n3. **A 1000-worker pool (244 ns) is slower than no pool at all + atomics (131 ns).** Pre-allocation isn't a free win.\n4. **The right-sized pool wins on allocations.** `PoolSharedChan_10` is 0 B/op, 0 allocs/op — the workers and channel exist for the lifetime of the benchmark. The spawn-per-request shapes allocate 56–160 B *per iteration*. For GC-sensitive workloads that's often a bigger deal than the ns/op.\n\n### What's actually going on\n\nPre-allocation isn't one thing — it's two things bundled together:\n\n- It amortizes goroutine creation across many work items.\n- It *bounds* the number of goroutines that exist concurrently.\n\nThe second one is doing most of the work. A 1000-goroutine pool gives you the first benefit (no spawn per request) without the second, and you can see it in the table: at 1000 workers the pool is barely better than spawning per request, and worse than the no-pool WaitGroup variant. The Go scheduler is multiplexing 1000 runnable goroutines onto ~10 OS threads, and that bookkeeping costs more than just doing the work serially across 10 long-lived workers.\n\nThe real rule, then, is: **keep the concurrent goroutine count near `GOMAXPROCS` for CPU-bound work.** Pre-allocation is one way to do that. So is a bounded semaphore over spawn-per-request. So is dispatching N requests across N workers manually. The strategy doesn't matter much; the count does.\n\n### So why is fasthttp faster than net/http?\n\nThis is the question worth asking, because it's where the microbenchmark becomes practical. `net/http` does `go c.serve(...)` per accepted connection — exactly the pattern this post was nominally about. `fasthttp` uses a worker pool. So fasthttp must be winning because it pre-allocates goroutines, right?\n\nA little, but not mostly. In rough order of impact:\n\n1. **`fasthttp.RequestCtx` is `sync.Pool`-recycled.** Every request in `net/http` allocates a fresh `*http.Request`, `*http.Response`, headers map, parsed URL parts — 20+ heap objects. fasthttp hands one `*RequestCtx` back to a pool when the handler returns. This is the same effect you see in the benchmark above when `PoolSharedChan_10` reports 0 B/op while `SpawnPerRequestChan` reports 160 B/op — just scaled up to a full request lifecycle.\n2. **`[]byte` instead of `string` in the API.** `ctx.Path()`, `ctx.Method()`, header values all return slices that point into the read buffer. `net/http`'s string-returning API forces a copy on every access, because strings are immutable.\n3. **Hand-rolled header parser**, no `textproto.MIMEHeader` map allocation per request.\n4. **Worker pool for connections.** This is the part the benchmark in this post is actually about. It's worth something — call it ~150 ns + 2 kB stack per request you don't spawn — but it's the smallest of the four.\n\nIf you took `net/http` and swapped only the connection-handling for a worker pool, you'd recover maybe 10–15% of the gap to fasthttp. The other 85% is `sync.Pool` on everything reusable plus an API that doesn't force copies. The goroutine pool gets the headlines because it's visually obvious; the object pooling does the work.\n\n### What I'd say now\n\nThree things:\n\n1. **Measure with `testing.B`, not wall clocks on fixed N.** It auto-scales and gives you per-op numbers that are directly comparable across shapes. Five minutes saved per experiment, plus you can't accidentally publish a result that depended on what else your laptop was doing that afternoon.\n2. **When you compare two concurrency shapes, hold the synchronization primitive constant.** Otherwise you're measuring two things at once and you won't know which one moved.\n3. **Pre-allocation is the wrong frame; bounded concurrency is the right frame.** And once you've bounded the concurrency, the next-biggest lever is almost always pooling the *objects* the goroutines touch, not the goroutines themselves.\n\n---\n\n### Appendix: original numbers (Apple M1 Max Pro, 2023)\n\nKept for reference; the takeaways from these are partially correct (more workers eventually hurts) and partially misleading (pre-allocation framed as the cause of the win).\n\n**Spawn per request:** 1B → 435,440 ms · 100M → 43,319 ms · 10M → 4,232 ms · 1M → 444 ms\n\n**Pool of 10 (shared channel):** 1B → 168,123 ms · 100M → 16,644 ms · 10M → 1,738 ms · 1M → 167 ms\n\n**Pool of 100 (shared channel):** 1B → 253,336 ms · 100M → 25,689 ms · 10M → 2,545 ms · 1M → 273 ms\n\n**Pool of 1000 (shared channel):** 1B → 329,467 ms · 100M → 32,022 ms · 10M → 3,165 ms · 1M → 386 ms\n\n**Pool of 100 (per-worker channels):** 100M → 23,016 ms · 10M → 2,138 ms · 1M → 228 ms\n\n**Pool of 1000 (per-worker channels):** 100M → 27,504 ms · 10M → 2,920 ms · 1M → 310 ms\n",
            "url": "https://gauravsarma.com/posts/2023-09-17_Measuring-cost-of-spawning-Goroutines-4b0dab6f5bf7",
            "title": "Measuring cost of spawning Goroutines",
            "summary": "> 2026 update.  The original version of this post compared three concurrency shapes by timing fixed batches of 1M / 10M / 100M / 1B operations and concluded \"always pre-allocate...",
            "date_modified": "2023-09-17T00:00:00.000Z"
        },
        {
            "id": "https://gauravsarma.com/posts/2023-06-30_Using-Python-AST-to-resolve-dependencies-c849bd184020",
            "content_html": "\nThis article covers how to resolve python dependencies using Python’s Abstract Syntax Trees (AST). There are different and maybe better ways to understand the scope of your python dependencies. This article tries to display how AST can be used for different types of scenarios.\n\n#### The use-case\n\nIn a complex python repository, there are different modules which all import objects from each other.\n\nAn increase in the number of cross module imports leads to multiple issues in a large codebase:\n\n*   Increase in time taken to run unit tests\n*   Increase in app startup time\n*   Cascading effects on upstream changes\n\nIn order to decouple these cross module imports, I am planning to pick objects of high contention and then start an effort to decouple those to another library of its own.\n\nThis functionality already comes built-in with most build tools like [https://www.pantsbuild.org/](https://www.pantsbuild.org/). However, this article tries to leverage ASTs to achieve the same goal.\n\n#### Solution\n\nThe solution proposed here for the cross module imports is to pick a high contention object/function, and then copy over all the dependencies of the module to a destination folder.\n\nThe input should be as follows:\n\n*   Provide the file path and the object name. The object can be of any format, i.e a class, function, variable, etc.\n*   Provide the folder path where the result of dependency resolver should construct the object and its dependencies, i.e the output folder\n\nThe solution operates based on the following steps:\n\n*   Read the file to check the imports already present in the file\n*   Copy the object text element to the provided input folder. The path inside the folder should be created dynamically. Ensure that the text element is added to the existing path and not overridden\n*   Get the text element of the object and convert it into tokens\n*   Match the tokens in the imported list and the tokens of the object. If there are cross module imports, then convert the imported object to a file path and object combination similar to the input\n*   Repeat the Processing steps again\n*   This should ensure all the dependencies are picked up\n\nKeeping in mind the above steps, we define the following elements with their required goals\n\n*   DependencyResolverManager\n*   FileVisitor\n*   ImportAnalyzer\n*   DependencyResolverIO\n\nBased on the above proposed steps, we define the `DependencyResolverManager` class which takes the object information, tokenizes the object code and figures out if any of the objects has been imported from another module. If it finds any matching imports, it recurses to that object and finds its imports. It continues the recursive loop till there are no more cross imports found.\n\n```python\nclass DependencyResolverManager:\n  def _resolve_matching_objects(self, object_code: str, imported_objects: dict[str, str]):\n    imported_objects_set: set[str] = set(imported_objects.keys())\n    tokenizer: \"Tokenizer\" = Tokenizer(code=object_code)\n    matching_objects: list[\"DependencyResolverManager\"] = []                                                                                                                                                                                   \n    for matched_object in imported_objects_set.intersection(tokenizer.tokenize()):\n        matched_module: str = imported_objects[matched_object]                                                                                                                                                                                 \n        file_path: str                                                                                                                                                                                                                         \n        try:                                                                                                                                                                                                                                   \n            file_path = ModuleHelper.get_module_file_from_name(matched_object, matched_module)                                                                                                                                                 \n            resolve_for_obj: str = f\"{file_path}::{matched_object}\"                                                                                                                                                                            \n            resolver: \"DependencyResolverManager\" = DependencyResolverManager(                                                                                                                                                                 \n                resolve_for_obj=resolve_for_obj, resolve_count=self.resolve_count                                                                                                                                                              \n            )                                                                                                                                                                                                                                  \n            resolver.process_obj()                                                                                                                                                                                                             \n        except Exception as e:                                                                                                                                                                                                                 \n            print(e, matched_object, matched_module, traceback.print_exc())                                                                                                                                                                    \n            continue\n```\n\nThe `Tokenizer` is a wrapper class on the `tokenize` module\n\n```python\nclass Tokenizer:\n    def __init__(self, *args, **kwargs):\n\n         self.code: str = kwargs[\"code\"]                                                                                                                                                                                                                              \n    def tokenize(self) -> set[str]:                                                                                                                                                                                                                                  \n        tokens = tokenize(BytesIO(self.code.encode('utf-8')).readline)                                                                                                                                                                                               \n        token_set = set()                                                                                                                                                                                                                                            \n        for toknum, tokval, _, _, _ in tokens:                                                                                                                                                                                                                       \n            if toknum == NAME:                                                                                                                                                                                                                                       \n                token_set.add(tokval)                                                                                                                                                                                                                                \n        return token_setReferences\n```\n\nNow we enter the most interesting part of the article. The AST module.\n\nWe use the AST module here to find the imported objects, and to find definitions of classes, functions, variables, etc anything which can be imported.\n\nHere is the `ImportAnalyzer` class which inherits from the `ast.nodeVisitor` class.\n\n```python\nclass ImportAnalyzer(ast.NodeVisitor):\n        def __init__(self, *args, **kwargs):\n          self.imported_modules: set[str] = set()\n          self.module_name: str = kwargs[\"module_name\"]\n          self.object_name: str = kwargs.get(\"object_name\", \"\")                                                                                                                                                                                              \n          self.defined_objects: dict[str, str] = {}                                                                                                                                                                                                          \n                                                                                                                                                                                                                                                             \n      def _filter_modules(self, node_module: str) -> bool:                                                                                                                                                                                                   \n          return False                                                                                                                                                                                                                                       \n                                                                                                                                                                                                                                                             \n      def visit_ImportFrom(self, node) -> None:                                                                                                                                                                                                              \n          if node == None or node.module == None:\n              return                                                                                                                                                                                                                                         \n          if self._filter_modules(node.module):                                                                                                                                                                                                              \n              return                                                                                                                                                                                                                                         \n                                                                                                                                                                                                                                                             \n          self.imported_modules.add(node.module)                                                                                                                                                                                                             \n          for node_name in node.names:                                                                                                                                                                                                                       \n              self.object_module_mapping[node_name.name] = node.module                                                                                                                                                                                       \n                                                                                                                                                                                                                                                             \n      def visit_FunctionDef(self, node) -> None:                                                                                                                                                                                                             \n          self.defined_objects[node.name] = node                                                                                                                                                                                                             \n                                                                                                                                                                                                                                                             \n      def visit_ClassDef(self, node) -> None:                                                                                                                                                                                                                \n          self.defined_objects[node.name] = node                                                                                                                                                                                                             \n                                                                                                                                                                                                                                                             \n      def visit_Assign(self, node) -> None:                                                                                                                                                                                                                  \n          for assigned_obj in node.targets:                                                                                                                                                                                                                  \n              curr_value: str = \"\"                                                                                                                                                                                                                           \n              if \"id\" not in assigned_obj.__dict__.keys():                                                                                                                                                                                                   \n                  curr_value = assigned_obj.value.id                                                                                                                                                                                                         \n              else:                                                                                                                                                                                                                                          \n                  curr_value = assigned_obj.id                                                                                                                                                                                                               \n                                                                                                                                                                                                                                                             \n              self.defined_objects[curr_value] = node                                                                                                                                                                                                        \n                                                                                                                                                                                                                                                             \n      def visit_AnnAssign(self, node) -> None:                                                                                                                                                                                                               \n          curr_value: str = \"\"                                                                                                                                                                                                                               \n          if \"id\" not in node.target.__dict__.keys():                                                                                                                                                                                                        \n              curr_value = node.target.value.id                                                                                                                                                                                                              \n          else:                                                                                                                                                                                                                                              \n              curr_value = node.target.id                                                                                                                                                                                                                    \n                                                                                                                                                                                                                                                             \n          self.defined_objects[curr_value] = node\n\n```\n\nThe main methods to look at there are the methods prefixed with `visit_*` string. The AST module calls has these methods as callbacks whenever the specified type of object is encountered. The list of different object types can be found [here](https://docs.python.org/3/library/ast.html).\n\nThe last step is taken care of by the `DependencyResolverIO` class which creates a similar folder structure as in the original repository.\n\nThe code for the above solution can be found [here](https://github.com/gsarmaonline/py-dep-resolver).\n\n**_I hope you liked the article. Please let me know if you have any queries regarding the article. Happy reading!!_**\n\n#### References\n\n*   [https://docs.python.org/3/library/ast.html](https://docs.python.org/3/library/ast.html)\n*   [https://github.com/gsarmaonline/py-dep-resolver](https://github.com/gsarmaonline/py-dep-resolver)\n",
            "url": "https://gauravsarma.com/posts/2023-06-30_Using-Python-AST-to-resolve-dependencies-c849bd184020",
            "title": "Using Python AST to resolve dependencies",
            "summary": "This article covers how to resolve python dependencies using Python’s Abstract Syntax Trees (AST).  There are different and maybe better ways to understand the scope of your python dependencies...",
            "date_modified": "2023-06-30T00:00:00.000Z"
        },
        {
            "id": "https://gauravsarma.com/posts/2023-06-02_Building-your-own-Kubernetes-webhook-575bf9712654",
            "content_html": "\nThis blog is the 2nd part of a blog post on how to write custom logic for your kubernetes objects. The first post can be found here [https://gsarmaonline.medium.com/kubernetes-operators-using-kubebuilder-7db99559120c](https://gsarmaonline.medium.com/kubernetes-operators-using-kubebuilder-7db99559120c) which covers the approach to building your own kubernetes controller using Golang.\n\nIn this post, we will cover the best approach to write your own kubernetes webhooks. Webhooks are usually interceptors which can be used in 2 ways here:\n\n*   Mutating — Changing the payload before the custom object is created\n*   Validating — Ensuring the payload is proper, if it’s not, the CRD can be prevent from being created\n\nIn this post, we will be covering only Validating Webhooks.\n\nThe logic for the webhook will be very simple. If the `task` is set to True by default, we will reject the request.\n\nUse kubebuilder, let’s scaffold the webhook logic.\n\nkubebuilder create webhook --group todo --version v1 --kind TodoList  --programmatic-validation\n\nModify the required files by uncommenting the parts for enabling the validating webhooks.\n\n**_config/crd/kustomization.yaml_**\n\n```yaml\nresources:  \n\\- bases/todo.sarmag.co\\_todolists.yaml  \n#+kubebuilder:scaffold:crdkustomizeresource  \n  \npatchesStrategicMerge:  \n\\# \\[WEBHOOK\\] To enable webhook, uncomment all the sections with \\[WEBHOOK\\] prefix.  \n\\# patches here are for enabling the conversion webhook for each CRD  \n\\- patches/webhook\\_in\\_todolists.yaml  \n#+kubebuilder:scaffold:crdkustomizewebhookpatch  \n  \n\\# \\[CERTMANAGER\\] To enable cert-manager, uncomment all the sections with \\[CERTMANAGER\\] prefix.  \n\\# patches here are for enabling the CA injection for each CRD  \n\\- patches/cainjection\\_in\\_todolists.yaml  \n#+kubebuilder:scaffold:crdkustomizecainjectionpatch  \n  \n\\# the following config is for teaching kustomize how to do kustomization for CRDs.  \nconfigurations:  \n\\- kustomizeconfig.yaml\n\n**_config/default/kustomization.yaml_**\n\nnamespace: custom-k8-controller-system  \n  \nnamePrefix: custom-k8-controller-  \n  \nbases:  \n\\- ../crd  \n\\- ../rbac  \n\\- ../manager  \n\\# \\[WEBHOOK\\] To enable webhook, uncomment all the sections with \\[WEBHOOK\\] prefix including the one in  \n\\# crd/kustomization.yaml  \n\\- ../webhook  \n\\# \\[CERTMANAGER\\] To enable cert-manager, uncomment all sections with 'CERTMANAGER'. 'WEBHOOK' components are required.  \n\\- ../certmanager  \n\\# \\[PROMETHEUS\\] To enable prometheus monitor, uncomment all sections with 'PROMETHEUS'.  \n#- ../prometheus  \n  \npatchesStrategicMerge:  \n\\# Protect the /metrics endpoint by putting it behind auth.  \n\\# If you want your controller-manager to expose the /metrics  \n\\# endpoint w/o any authn/z, please comment the following line.  \n\\- manager\\_auth\\_proxy\\_patch.yaml  \n  \n\\# Mount the controller config file for loading manager configurations  \n\\# through a ComponentConfig type  \n#- manager\\_config\\_patch.yaml  \n  \n\\# \\[WEBHOOK\\] To enable webhook, uncomment all the sections with \\[WEBHOOK\\] prefix including the one in  \n\\# crd/kustomization.yaml  \n\\- manager\\_webhook\\_patch.yaml  \n  \n\\# \\[CERTMANAGER\\] To enable cert-manager, uncomment all sections with 'CERTMANAGER'.  \n\\# Uncomment 'CERTMANAGER' sections in crd/kustomization.yaml to enable the CA injection in the admission webhooks.  \n\\# 'CERTMANAGER' needs to be enabled to use ca injection  \n\\- webhookcainjection\\_patch.yaml  \n  \n\\# the following config is for teaching kustomize how to do var substitution  \nvars:  \n\\# \\[CERTMANAGER\\] To enable cert-manager, uncomment all sections with 'CERTMANAGER' prefix.  \n\\- name: CERTIFICATE\\_NAMESPACE \\# namespace of the certificate CR  \n  objref:  \n    kind: Certificate  \n    group: cert-manager.io  \n    version: v1  \n    name: serving-cert \\# this name should match the one in certificate.yaml  \n  fieldref:  \n    fieldpath: metadata.namespace  \n\\- name: CERTIFICATE\\_NAME  \n  objref:  \n    kind: Certificate  \n    group: cert-manager.io  \n    version: v1  \n    name: serving-cert \\# this name should match the one in certificate.yaml  \n\\- name: SERVICE\\_NAMESPACE \\# namespace of the service  \n  objref:  \n    kind: Service  \n    version: v1  \n    name: webhook-service  \n  fieldref:  \n    fieldpath: metadata.namespace  \n\\- name: SERVICE\\_NAME  \n  objref:  \n    kind: Service  \n    version: v1  \n    name: webhook-service\n```\n\nModify the `ValidateCreate` method in `todolist_webhook.go` code\n\n```go\n// ValidateCreate implements webhook.Validator so a webhook will be registered for the type  \nfunc (r \\*TodoList) ValidateCreate() error {  \n todolistlog.Info(\"validate create\", \"name\", r.Name)  \n if r.Spec.Task == \"\" {  \n  err := errors.New(\"task cannot be empty\")  \n  return err  \n }  \n return nil  \n}  \n```\n  \n\nOnce this is done, build the docker image, load it on the k8s cluster and then deploy it.\n\n```bash\nmake docker-build IMG=gsarma/k8s-operators:v1  \nkind load docker-image gsarma/k8s-operators:v1 --name k8s-operators  \nmake deploy IMG=gsarma/k8s-operators:v1\n```\n\nAll that’s left is to create a todolist object with an empty `task` and it should fail the request.\n\napiVersion: todo.sarmag.co/v1  \nkind: TodoList  \nmetadata:  \n  name: jack   \n  namespace: operator\\-namespace  \nspec:  \n  task: \"\"\n\nEnjoy reading and implementing your own custom kubernetes webhook!!\n\n**_I hope you liked the article. Please let me know if you have any queries regarding the article. Happy reading!!_**\n\n#### References\n\n*   [https://github.com/gsarmaonline/todolist-k8s-operator](https://github.com/gsarmaonline/todolist-k8s-operator)\n*   [https://gsarmaonline.medium.com/kubernetes-operators-using-kubebuilder-7db99559120c](https://gsarmaonline.medium.com/kubernetes-operators-using-kubebuilder-7db99559120c)\n*   [https://book.kubebuilder.io/reference/admission-webhook.html](https://book.kubebuilder.io/reference/admission-webhook.html)\n*   [https://betterprogramming.pub/writing-custom-kubernetes-controller-and-webhooks-141230820e9](https://betterprogramming.pub/writing-custom-kubernetes-controller-and-webhooks-141230820e9)\n",
            "url": "https://gauravsarma.com/posts/2023-06-02_Building-your-own-Kubernetes-webhook-575bf9712654",
            "title": "Building your own Kubernetes webhook",
            "summary": "This blog is the 2nd part of a blog post on how to write custom logic for your kubernetes objects.  The first post can be found here [https://gsarmaonline...",
            "date_modified": "2023-06-02T00:00:00.000Z"
        },
        {
            "id": "https://gauravsarma.com/posts/2023-05-27_Kubernetes-operators-using-Kubebuilder-7db99559120c",
            "content_html": "\nIn this post, we will be going over the fastest no-frills approach to getting your operator off the ground using `kubebuilder`. The post assumes knowledge of the following:\n\n*   Kubernetes and how it works\n*   Kubernetes custom resource definitions\n*   Kubernetes Operators and reconciliation loops\n*   Setting up a local cluster, I use `kind` for my k8s orchestration needs\n*   Golang\n\nThe task is to create an operator that operates on a Kubernetes CRD `TodoList`. It listens on the pods available in the system. If there are any pods with the same name as the `TodoList`, it marks the status as True.\n\nThis operator only operates on the `operator-namespace` namespace.\n\nThe first step is to install `kubebuilder` using the following command:\n\ncurl -L -o kubebuilder https://go.kubebuilder.io/dl/latest/$(go env GOOS)/$(go env GOARCH) && chmod +x kubebuilder && mv kubebuilder /usr/local/bin/\n\nCheck if it’s installed properly by running `kubebuilder version` .\n\nInstall the kubernetes cluster using `kind`\n\nkind create cluster --name operators\n\nSetup the initial project, APIs, groups and kinds\n\nkubebuilder init --domain sarmag.co --repo sarmag.co/todo  \nkubebuilder create api --group todo --version v1 --kind TodoList\n\nThis will create the required scaffolded project, group, API and kind.\n\nOnce that’s done, there are 2 main files that we have to update\n\n*   api/v1/todolist\\_types.go\n*   internal/controller/todolist\\_controller.go\n\nIn the `todolist_types.go` file, update the required specification and status of the CRD.\n\n```go\ntype TodoListSpec struct {  \n Task string \\`json:\"task,omitempty\"\\`  \n}  \n  \ntype TodoListStatus struct {  \n IsCompleted bool \\`json:\"status,omitempty\"\\`  \n}\n```\n\nYou can refer to the file here [https://github.com/gsarmaonline/todolist-k8s-operator/blob/main/api/v1/todolist\\_types.go#L24-L30](https://github.com/gsarmaonline/todolist-k8s-operator/blob/main/api/v1/todolist_types.go#L24-L30).\n\nIn the `todolist_controller.go` file, update the reconcilation logic with the following:\n\n```go\nfunc (r \\*TodoListReconciler) Reconcile(ctx context.Context, req ctrl.Request) (result ctrl.Result, err error) {  \n var (  \n  todoList todov1.TodoList  \n  podList  corev1.PodList  \n  logger   logr.Logger  \n  \n  isCompleted bool  \n )  \n  \n logger = log.FromContext(ctx)  \n logger.Info(\"Reconciling TodoList\")  \n  \n if err = r.Get(ctx, req.NamespacedName, &todoList); err != nil {  \n  logger.Error(err, \"Error in fetching Todolist\")  \n  err = client.IgnoreNotFound((err))  \n  return  \n }  \n  \n if err = r.List(ctx, &podList); err != nil {  \n  logger.Error(err, \"Error in fetching pods list\")  \n  return  \n }  \n  \n for \\_, item := range podList.Items {  \n  if item.GetName() != todoList.Spec.Task {  \n   continue  \n  }  \n  logger.Info(\"Pod just became available with\", \"name\", item.GetName())  \n  isCompleted = true  \n }  \n  \n todoList.Status.IsCompleted = isCompleted  \n if err = r.Status().Update(ctx, &todoList); err != nil {  \n  logger.Error(err, \"Error in updating TodoList\", \"status\", isCompleted)  \n  return  \n }  \n  \n if todoList.Status.IsCompleted == true {  \n  result.RequeueAfter = time.Minute \\* 2  \n }  \n return  \n}\n```\n\nYou can refer to the complete file here [https://github.com/gsarmaonline/todolist-k8s-operator/blob/main/internal/controller/todolist\\_controller.go#L48-L89](https://github.com/gsarmaonline/todolist-k8s-operator/blob/main/internal/controller/todolist_controller.go#L48-L89).\n\nOnce the controller is done, you can create and deploy your code to the kubernetes cluster already created.\n\n```bash\nmake manifests  \nmake install  \nmake run\n```\n\nThis will run the manager with the required reconciliation logic hooked in to the k8s cluster.\n\n#### Testing time!!\n\nCreate the todolist object\n\n```yaml\napiVersion: todo.sarmag.co/v1  \nkind: TodoList  \nmetadata:  \n  name: jack   \n  namespace: operator-namespace  \nspec:  \n  task: jack\n```\n\nThis creates a todolist object called `jack` in the k8s namespace named`operator-namespace` .\n\nYou can refer to the full file here [https://github.com/gsarmaonline/todolist-k8s-operator/blob/main/samples/todo.yml#L1](https://github.com/gsarmaonline/todolist-k8s-operator/blob/main/samples/todo.yml#L1).\n\n```yaml\napiVersion: v1  \nkind: Pod  \nmetadata:  \n  name: jack  \n  namespace: operator-namespace  \nspec:  \n  containers:  \n  \\- name: ubuntu  \n    image: ubuntu:latest  \n    \\# Just sleep forever  \n    command: \\[ \"sleep\" \\]  \n    args: \\[ \"infinity\" \\]\n```\n\nYou can refer to the full file here [https://github.com/gsarmaonline/todolist-k8s-operator/blob/main/samples/pod.yml#L1-L12](https://github.com/gsarmaonline/todolist-k8s-operator/blob/main/samples/pod.yml#L1-L12).\n\nOnce you create the pod, you will see this specific log line [https://github.com/gsarmaonline/todolist-k8s-operator/blob/main/internal/controller/todolist\\_controller.go#L48-L89](https://github.com/gsarmaonline/todolist-k8s-operator/blob/main/internal/controller/todolist_controller.go#L48-L89). This means that the operator was able to find a pod with the same name as the task.\n\n#### Watching on pod events\n\nOne problem with the current approach is the operator only listens on the events of the `TodoList` type, whereas it should also monitor the pod events so that it can update the state accordingly. In order to ensure the reconcilation loops when the pod events change, chain the following method on the manager.\n\n```go\nfunc (r \\*MyController) SetupWithManager(mgr ctrl.Manager) (err error) {  \n  err = ctrl.NewControllerManagedBy(mgr).  \n    For(&todov1.TodoList{}).  \n    Watches(&source.Kind{Type: &corev1.Pod{}}, &handler.EnqueueRequestForObject{}).  \n    Complete(r)  \n  return  \n}\n```\n\nIf you want to ensure that the operator only watch on the pods it has created, you can create the pod and set the OwnerReferences by calling `SetControllerReference` on the pod.\n\nYou can then create the manager by using the `Owns` method\n\n```go\nfunc (r \\*MyController) SetupWithManager(mgr ctrl.Manager) (err error) {  \n err = ctrl.NewControllerManagedBy(mgr).  \n  For(&todov1.TodoList{}).  \n  Owns(&corev1.Pod{}).  \n  Complete(r)  \n  return  \n}\n```\n\n#### Watching on external events\n\nWhat if we want the reconcilation loop to run on external events as well?\n\nYou can create a goroutine which sends an event to the reconciliation loop every 5 seconds\n\n```go\nfunc (r \\*TodoListReconciler) startTickerLoop(periodicReconcileCh chan event.GenericEvent) {  \n var (  \n  ticker \\*time.Ticker  \n  count  int  \n )  \n ticker = time.NewTicker(time.Second \\* 5)  \n defer ticker.Stop()  \n  \n for {  \n  select {  \n  case <-ticker.C:  \n   periodicReconcileCh <- event.GenericEvent{Object: &todov1.TodoList{ObjectMeta: metav1.ObjectMeta{Name: \"jack\", Namespace: \"operator-namespace\"}}}  \n  \n   count += 1  \n   if count > 100 {  \n    return  \n   }  \n  }  \n }  \n}\n```\n\nYou can then change the manager setup to also watch on the `periodReconcileCh` channel\n\n```go\nfunc (r \\*TodoListReconciler) SetupWithManager(mgr ctrl.Manager) (err error) {  \n var (  \n  periodicReconcileCh chan event.GenericEvent  \n )  \n periodicReconcileCh = make(chan event.GenericEvent)  \n go r.startTickerLoop(periodicReconcileCh)  \n  \n err = ctrl.NewControllerManagedBy(mgr).  \n  For(&todov1.TodoList{}).  \n  Watches(&source.Kind{Type: &corev1.Pod{}}, &handler.EnqueueRequestForObject{}).  \n  Watches(&source.Channel{Source: periodicReconcileCh}, &handler.EnqueueRequestForObject{}).  \n  Complete(r)  \n return  \n}\n```\n\nYou can hook in the above channel with external events, expose an API which can trigger the loop, etc.\n\nIf you want the reconciliation loop to requeue itself after some duration even when it’s successful, you can use `RequeueAfter` as shown here [https://github.com/gsarmaonline/todolist-k8s-operator/blob/main/internal/controller/todolist\\_controller.go#L112-L125](https://github.com/gsarmaonline/todolist-k8s-operator/blob/main/internal/controller/todolist_controller.go#L112-L125).\n\n#### References\n\n*   Github link — [https://github.com/gsarmaonline/todolist-k8s-operator](https://github.com/gsarmaonline/todolist-k8s-operator)\n*   [https://yash-kukreja-98.medium.com/develop-on-kubernetes-series-demystifying-the-for-vs-owns-vs-watches-controller-builders-in-c11ab32a046e](https://yash-kukreja-98.medium.com/develop-on-kubernetes-series-demystifying-the-for-vs-owns-vs-watches-controller-builders-in-c11ab32a046e)\n",
            "url": "https://gauravsarma.com/posts/2023-05-27_Kubernetes-operators-using-Kubebuilder-7db99559120c",
            "title": "Kubernetes operators using Kubebuilder",
            "summary": "In this post, we will be going over the fastest no-frills approach to getting your operator off the ground using kubebuilder.  The post assumes knowledge of the following: Kubernetes and how it works Kubernetes custom resource definitions Kubernetes Operators and reconciliation loops Setting up a local cluster, I use kind for my k8s orchestration needs Golang The task is to create an operator that operates on a Kubernetes CRD TodoList...",
            "date_modified": "2023-05-27T00:00:00.000Z"
        },
        {
            "id": "https://gauravsarma.com/posts/2022-09-18_Implement-your-own-CDC-using-Kafka-5ca716634126",
            "content_html": "\nMost of the problems that people mention with their Kafka implementations is that they don’t have the complete visibility required over the configuration and the API usage. Having more visibility into the commonly required tweaks can allow admins and developers to use Kafka as comfortably as a MySQL or PostgreSQL cluster.\n\nI was recently working on implementing a custom [CDC for Mongo](https://www.mongodb.com/docs/kafka-connector/current/sink-connector/fundamentals/change-data-capture/). Using Kafka connect, we have out of the box solutions available to use Mongo CDC. However, there was a requirement which needed us to gain complete control over the CDC process.\n\nIn the upcoming sections, we will discuss the overall process in brief and then explain the individual steps in detail.\n\n#### What is CDC?\n\nQuoting the mongodb site,\n\n> CDC is a software architecture that converts changes in a datastore into a stream of **CDC events**. A CDC event is a message containing a reproducible representation of a change performed on a datastore\n\nCDC can be used to connect databases like mongodb to other sinks, for eg, elasticsearch, spark, s3, etc. Another major advantage is that it can also be used as a reliable log of all the mongo events in sequence. The CDC events are generated by using the mongo oplog which is also used to maintain replication among the mongo replica sets.\n\n#### Prerequisites\n\nDelivery Semantics\n\n*   At-least once delivery\n*   Exactly once delivery\n\n#### CDC Approach\n\nThere can be at the minimum, 2 types of workers we can foresee:\n\n*   Poller\n*   Publisher\n\n#### Sequence of actions\n\n*   Poller fetches the mongo oplog’s **_resume token_** if present\n*   The resume token allows the poller to start from a specific offset at which an event was received.\n*   Fetch a fixed number of events from the mongo oplog starting from the resume token\n*   Send the batched events to the publisher\n*   The publisher sends the batched events to Kafka\n*   Once the publisher is able to send the events to Kafka, the poller should update the resume token as well\n*   Repeat the process again\n\n#### Rules of Resiliency\n\nThe sequence is pretty simple to understand. However, there are also multiple points of failure that we need to guard against here.\n\n*   Resume token should be updated only if the event batch has been published\n*   If the event publish has failed, then the resume token shouldn’t be updated\n*   If the resume token is not updated, the event publish should also fail\n*   A publishing will be termed successful if all the partition replicas have received the message\n*   The poller should not update the consumer’s offset till the resume token has been updated\n\n#### Questions\n\n*   How does the consumer know which offset to start from?\n*   How do we prevent the consumer from updating the offset at which the read just happened?\n*   How does the consumer fetch the latest resume token?\n*   How does the publisher ensure all partition replicas are updated?\n*   How does the publisher batch the events without custom logic?\n*   How do we ensure a transaction between the event batch publish and resume token update?\n\n#### How does the consumer know which offset to start from?\n\nEach consumer is assigned to a kafka consumer group by default. Kafka maintains the latest offset read per consumer group. That’s one of the reasons why each partition can have only consumer in a consumer group. If the consumer group doesn’t have any offset maintained, then it starts from the earliest partition offset or the latest partition offset depending on the consumer’s configuration as shown below.\n\n\\# Update consumer config with the following  \nauto.offset.reset: “latest” # “earliest” is also an option\n\n#### How do we prevent the consumer from updating the offset at which the read just happened?\n\nBy default, whenever a consumer reads from the offset of a specific partition, the consumer asynchronously updates the Kafka cluster about the read offsets and the Kafka updates the mapping. If the consumer fails unceremoniously before the offset update is completed, then there is a chance of messages being re-read which would lead to duplicate messages does violating the **_Exactly Once_** delivery guarantee.\n\nThe other approach is to commit the offsets only after the underlying tasks like pushing to Kafka is completed. This requires us to make the consumer offset update synchronous and should be controlled via an API. This can be configured by doing the following:\n\n\\# Update consumer config to disable the auto offset commit  \nenable.offset.commit: False\n\n\\# When the connected tasks are completed, the offset can be updated by running  \nconsumer.commit(asynchronous=False)\n\n#### How does the consumer fetch the latest resume token?\n\nWe store the resume token per poller in a separate topic. So whenever the poller starts up, it should fetch the latest token from the topic. If the poller is starting up for the first time, then there will be no resume token, so the polling of the oplog will start from the last made change as per the mongo configuration.\n\nFor the consumer to jump directly to the last resume token, we can follow the below approach:\n\n*   Get the highest watermark offset of the partition in the topic.\n*   Get the consumer’s current offset of the partition.\n*   Assign the offset of the partition to **_(offset-1)_**\n*   Read from the offset\n\n#### How does the publisher ensure all partition replicas are updated?\n\nWhenever we publish any message to Kafka without calling flush, it does an asynchronous publish of messages. Why do we need to ensure that all in-sync partitions are updated when sending a message.\n\nIn Kafka, the reliability of the cluster is dependent on the partitions and their replication factors. Each partition replica set has a leader where the writes happen which asynchronously syncs it to the other partition replicas.\n\nIn cases where we don’t wait for the acknowledgement between all the partition replicas, if the leader partition goes down before syncing it with the other partition replicas, then loss of data is possible. Hence it’s important to receive an acknowledgement from all the partitions to avoid a split brain or message loss situation.\n\nKafka provides 3 levels of acknowledgements from the kafka brokers:\n\n*   No acknowledgement\n*   Acknowledgement from the leader partition\n*   Acknowledgement from all the in-sync partition replicas\n\nThe same can be configured by the following configuration in the publisher\n\n\"acks\": \"all\" # Also can be 0 and 1\n\n#### How does the publisher batch the events without custom logic?\n\nUsually when we have to batch multiple events for any tool, we have a size threshold (let’s call it **_x_**) and a time threshold (this is **_y_**). It means we are saying that we have buffer the events till we reach **_x_** events. This helps us in reducing continuous network calls and provides the benefits of batching, like block size allocations, lower compression overhead, etc. However, we also shouldn’t wait for more than a certain duration till the event is published to reduce the latency or lag observed by the service. So we also have **_y_** with a higher precedence over **_x_**.\n\nIn a Kafka producer’s configuration, **_x_** correlates with `batch.size`. The main difference in the batch.size compared to x is that the events’ count number is the frequency of the events where batch.size is the actual size in bytes. The default is 16384 bytes.\n\nThe **_y_** correlates with `linger.ms` . The linger.ms configuration controls the time duration till which the message is kept in the buffer unless the buffer size is exceeded.\n\nbatch.size: 1638400  \nlinger.ms: 2000\n\n#### How do we ensure a transaction between the event batch publish and resume token update?\n\nKafka has the `flush` call which forces the publisher to publish all the events to the Kafka brokers with the appropriate acknowledgement configurations.\n\nHowever, the problem remains that kafka publishers also publish asynchronously, the `flush` call just ensures that the local buffers are emptied. This can lead to multiple instances of the same message arriving on the cluster.\n\nAfter a little searching and to my honest amusement (this is incredible), Kafka has support for Transactions. It uses the 2 commit phase pattern to ensure the transaction.\n\nIMO, understanding all of the above should give you a good control over how to configure your Kafka components.\n\n**_I hope you liked the article. Please let me know if you have any queries regarding the article. Happy reading!!_**",
            "url": "https://gauravsarma.com/posts/2022-09-18_Implement-your-own-CDC-using-Kafka-5ca716634126",
            "title": "Implement your own CDC using Kafka",
            "summary": "Most of the problems that people mention with their Kafka implementations is that they don’t have the complete visibility required over the configuration and the API usage.  Having more visibility into the commonly required tweaks can allow admins and developers to use Kafka as comfortably as a MySQL or PostgreSQL cluster...",
            "date_modified": "2022-09-18T00:00:00.000Z"
        },
        {
            "id": "https://gauravsarma.com/posts/2022-09-15_Migrating-Kafka-topics-without-downtime-f863819cfb3d",
            "content_html": "\nEach kafka topic defines the number of partitions and replication factors when it’s created. However, once a topic is created, the partition count cannot be changed without affecting the ordering guarantees of the kafka partitions since kafka uses the following formula to calculate which partition a record should go to:\n\npartition\\_id = partition\\_key % number of partitions\n\nKafka partitions are the gateway to concurrency and scalability. Having too many partitions cause a management overhead to manage the partitions, the sync between the replicas and choosing the leader, etc on the control plane and having too little partitions can bottleneck the concurrency metrics of a consumer group as there can be only one consumer per consumer group reading from a single partition.\n\nI didn’t find any direct tool which can do this without downtime. If you find something from established sources, then this article is irrelevant.\n\n#### Versioning topics\n\nWe will version each topic by adding a version number suffixed to the original topic name. Every time we need to migrate the topic to another, we increase the version number suffixed to the topic name.\n\nFor example, if the original topic name is `myTopic` and we want to migrate the topic, the versioning that will happen is `myTopic.v0` -> `myTopic.v1` . Since `myTopic` didn’t have a suffix, by default, `v0` is added to it. If we want to migrate again, then it will transition from `myTopic.v1` -> `myTopic.v2` .\n\n#### Overall steps to migrate a kafka topic\n\n*   Create a new topic suffixed with a increased version compared to the previous topic\n*   Inform your publisher that a newer version of the topic has been created\n*   Point your publisher to the new topic with the latest prefix\n*   Drain the old topic by the respective consumer groups listening on the same topic\n*   Once the older topic is drained, point the consumers to the new topic\n\n#### How do we inform the publisher that a newer version of the topic has been created?\n\nWe use Kafka itself to store all the versions of the topic. We can call this topic `_meta_versions` . Whenever a newer version is required, push the version to the `_meta_versions` topic.\n\nIn every publisher, run an internal consumer in a consumer group specific to the publisher only that reads from the `_meta_versions` topic. This internal consumer shouldn’t be confused with the actual consumer which developers use to consume the actual messages. The internal consumer can read all the versions from the `_meta_versions` topic till there are no more messages left in the topic for a certain duration of time. Set the topic version to that version in the publisher.\n\nEven if there are multiple versions in the topic before the publisher starts, we need to ensure that at every instance, the publisher is writing only to the latest version in the topic.\n\nSynchronization between publishers and its internal consumer, publishing to the `_meta_versions` topic and configured retention period for the topic differs based on the use case.\n\n#### How does the consumer group behave when a newer version of the topic has been created?\n\nThe publisher shifts immediately to the newer version and starts publishing there. To maintain the ordering guarantees, the consumer has to drain the partitions in the older topic before moving on to the newer version.\n\nHow can the consumer detect that it has drained, or in other words, read all the messages in the kafka topic?\n\nThere are multiple options possible here.\n\nFirst option is to run the consumer polling on a topic till no messages are received till the timeout is exceeded. However, this may also result in bugs where the timeout is breached because of an intermittent network issue.\n\nThe second option is to fetch the highest offset of the partition of the topic. There are 2 types of offsets, **_watermark_** offsets and **_end of log_** offsets. Use the watermark offset and ensure that the consumers consume up to the offset.\n\nWe can check the offset till which the consumer has consumed for a partition in a topic. Comparing the consumer’s offset and the topic partition’s offset can be a good indication that the consumer has drained the topic.\n\nOnce the draining is complete, the consumer can move on to the newer version of the topic.\n\n**_I hope you liked the article. Please let me know if you have any queries regarding the article. Happy reading!!_**",
            "url": "https://gauravsarma.com/posts/2022-09-15_Migrating-Kafka-topics-without-downtime-f863819cfb3d",
            "title": "Migrating Kafka topics without downtime",
            "summary": "Each kafka topic defines the number of partitions and replication factors when it’s created.  However, once a topic is created, the partition count cannot be changed without affecting the ordering guarantees of the kafka partitions since kafka uses the following formula to calculate which partition a record should go to: partition\\_id = partition\\_key % number of partitions Kafka partitions are the gateway to concurrency and scalability...",
            "date_modified": "2022-09-15T00:00:00.000Z"
        },
        {
            "id": "https://gauravsarma.com/posts/2022-05-21_Kafka--KRaft-and-Storage-Tiers-b28850c4303a",
            "content_html": "\nI was recently looking at a managed Kafka service and came across services like AWS MSK and Kafka on Confluent Cloud. While comparing these services, I saw that there were limitations on the number of partitions allowed in a cluster. For example, the maximum number of partitions per broker in AWS MSK is 4000.\n\nI wanted to understand the underlying resource crunch for the partition limit.\n\nAs I proceeded with understanding the reasons behind the same, the underlying problem came out to be a non-uniform approach of dealing with metadata in Kafka.\n\nKafka elements use both Kafka Controller Nodes and Zookeeper to keep track of the metadata. This leads to lots of synchronization and management overhead between Kafka elements, Kafka Controller nodes and Zookeeper.\n\nAs mentioned [here](https://zookeeper.apache.org/),\n\n> ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.\n\n### Prerequisites\n\nPartitions and partition keys define the parallelism possible in a Kafka topic. In every Kafka consumer group, there can only be one consumer in the group which reads from a specific partition. So if you have a Kafka cluster and your consumer load is not evenly distributed, it can be a case of a bad partition key specific to your application.\n\nReplication for partitions is configured in the cluster. Each partition has a configured number of replicas and has a leader replica and backup replicas. Writes and reads happen via the leader replica, hence it is advised to distribute the leader partition replicas across brokers to minimise the load of redistribution of replicas if one of the brokers goes down.\n\nEach Kafka cluster has mainly the following elements which interact with the brokers:\n\n*   Producer\n*   Consumer\n*   AdminClient\n\n### Current System with Zookeeper\n\n![](/img/kraft_1.png)\n\n![](/img/kraft_2.png)\n\nAs we can see from the above system, metadata like the commit offset, group partition assignment is stored in Zookeeper and the other information is stored in Kafka.\n\nZookeeper maintains a map of all the brokers in the cluster and their status. Whenever a broker joins or leaves the cluster, Zookeeper has to track the change and broadcast it to the other nodes in the cluster. Updates to Zookeeper and the Controller node is synchronous but updates to the brokers is asynchronous, which may lead to race conditions.\n\nWhenever a broker node starts up, it tries to mark itself as the Controller and sends it to Zookeeper. The Zookeeper service responds with a Controller Already available message. Whenever a broker becomes unreachable to Zookeeper, it marks the broker as unreachable and removes the entry. When the broker becomes reachable again to Zookeeper, it has to fetch all the information of the cluster again as it has no concept of deltas and the previous information already present with the broker.\n\nWhen a Controller node goes down or is restarted, it has to read all the metadata for all brokers and partitions from ZooKeeper and then send this metadata to all brokers. This results in a **_n\\*m_** operation (where **n** is the number of brokers and **m** is the number of partitions).\n\nWhen a network partition happens for a broker and is unreachable to either the Controller node or Zookeeper, fencing kind of becomes difficult or expensive to have.\n\n### Proposed System with kRaft\n\nMore about Raft [here](https://raft.github.io/).\n\n> Raft is a consensus algorithm that is decomposed into relatively independent subproblems, and it cleanly addresses all major pieces needed for practical systems. Consensus is a fundamental problem in fault-tolerant distributed systems which involves multiple servers agreeing on values.\n\nKafka serves as a distributed commit log which can be used for various purposes like message queues, streaming, audits, etc.\n\nThe fundamental advantages of a log based system are:\n\n*   Logs are append only\n*   Different readers can store their own offset\n\nAll other advantages like caching, maintaining epochs, replays, audits, efficient backups, delta reads are a side effect of the above fundamentals.\n\nOnce the Kafka team started looking at the comparisons between the Raft and Kafka commit logs, they came up with the below image:\n\n![](/img/kraft_3.png)\n\nKeeping the above table in mind, the Kafka team introduced a new consensus based protocol called kRaft where metadata is stored as a commit log. More details on the same [here](https://cwiki.apache.org/confluence/display/KAFKA/KIP-500%3A+Replace+ZooKeeper+with+a+Self-Managed+Metadata+Quorum).\n\n![](/img/kraft_4.png)\n\nInstead of storing metadata like partition assignment, commit offsets, etc in Zookeeper, Kafka uses internal Kafka topics like **_\\_\\_offsets_** to store the metadata information.\n\nInternal APIs like **_OffsetCommit_**, **_OffsetFetch_**, **_JoinGroup_**, **_SyncGroup_**, **_Heartbeat_**, etc are now handled by Kafka itself instead of sending it to Zookeeper.\n\nInstead of ad-hoc message passing across the brokers, the brokers can now consume the messages in the logs and process them. As the messages are processed, the brokers keep themselves up to date.\n\nIf the broker becomes unreachable for a certain duration, once it is up, it can process the messages from the internal topics and once it has processed the messages, it can again be marked as active. Only the delta messages needs to be processed which results in a significant decrease in time required to rejoin the cluster.\n\n![](/img/kraft_5.png)\n\nThere are specific controller nodes (usually 3 to 5) which maintain a self managed quorum to decide the leader. If the leader controller goes down, the backup controller can almost instantly take over once its messages are up to date.\n\nAll of the above items reduce the load of partitioning and overall, the load of maintaining metadata across large clusters.\n\n### Storage Tiers in Kafka\n\nKafka mainly uses disks for log retention. The size and speed of the disks required is dependent on the retention period configured on the data. Because of Kafka underlying ability to replay messages from the start, different applications use the ability of configuring longer retention periods.\n\nHowever, having longer retention periods also has its own strain on the disk in terms of backups, migration, etc. This increases the risk on consumers on the same cluster which depend on the fairly recent messages and don’t need a longer retention period for more than 2–3 days.\n\nThe Kafka team has come up with a concept of local storage tier and remote storage tier to alleviate the problem.\n\nThe retention period and the disk type can be configured separately for the different tiers. For example, for the local storage tier, we can use a smaller 100 GB SSD disk with a retention of 2 days. For the remote tier, we can use HBase or S3 with a retention period of 6 months.\n\nFor applications which need older data then the data in the local storage, the Kafka brokers will itself talk to the remote tier and fetch the data accordingly. This allows the same Kafka cluster and topics to serve different kinds of applications.\n\n_References_\n\n*   [https://docs.confluent.io/platform/current/zookeeper/kraft.html](https://docs.confluent.io/platform/current/zookeeper/kraft.html)\n*   [https://www.confluent.io/kafka-summit-san-francisco-2019/kafka-needs-no-keeper](https://www.confluent.io/kafka-summit-san-francisco-2019/kafka-needs-no-keeper/?_ga=2.183644783.83533890.1653037520-610075882.1647320397&_gac=1.58336856.1652705262.CjwKCAjw7IeUBhBbEiwADhiEMTSMm6yrYRqXpYyjivAhuZcHRytg1d028hpW6U3156sEXwHkHDa7NhoCy_kQAvD_BwE)\n*   [https://cwiki.apache.org/confluence/display/KAFKA/KIP-500%3A+Replace+ZooKeeper+with+a+Self-Managed+Metadata+Quorum](https://cwiki.apache.org/confluence/display/KAFKA/KIP-500%3A+Replace+ZooKeeper+with+a+Self-Managed+Metadata+Quorum)\n*   [https://www.amazon.in/Kafka-Definitive-Guide-Neha-Narkhede/dp/1491936169](https://www.amazon.in/Kafka-Definitive-Guide-Neha-Narkhede/dp/1491936169)\n",
            "url": "https://gauravsarma.com/posts/2022-05-21_Kafka--KRaft-and-Storage-Tiers-b28850c4303a",
            "title": "Kafka, KRaft and Storage Tiers",
            "summary": "I was recently looking at a managed Kafka service and came across services like AWS MSK and Kafka on Confluent Cloud.  While comparing these services, I saw that there were limitations on the number of partitions allowed in a cluster...",
            "date_modified": "2022-05-21T00:00:00.000Z"
        },
        {
            "id": "https://gauravsarma.com/posts/2021-05-20_Understanding-Monarch--Google-s-Planet-Scale-Monitoring-System-60e59b63ac0c",
            "content_html": "\nMonarch is a planet-scale in-memory time series database developed by Google. It is mainly used by as a reliable monitoring system by most of Google’s internal systems like Spanner, BigTable, Colossus, BlobStore.\n\nAs is the case with any Google service, it has to be designed for massive scale, highly available, support regional locality. Another use case that was important for Monarch was to depend on other Google services as little as possible since other services were using Monarch for their own monitoring and any outage in either would affect the other as well.\n\nMonarch is a service that has to be highly available and partitioned, hence it compromises the consistency by providing the required hints to the client service in cases of consolidating consistency delays.\n\nMonarch tries\n\n#### **Data stores:**\n\nThe data is stored in two formats:\n\n*   **Leaves** are the components where the actual monitoring data is stored in memory\n*   **Logs** are persistent stores that can be used to replay the events in case of component failures\n\n#### Data Ingestion\n\nThe data ingestion pipeline tries to follow the below guidelines:\n\n*   Store data of the client service as close to the service’s operating region as possible so that network latency is minimal\n*   Store data of client service in the same leaf as there is a high probability of data queries being clubbed and focussed on that leaf for faster query responses\n\n![](/img/monarch_1.png)\n\nThe data traversal will happen in the following approach:\n\n*   **Ingestion Router** routes the data to the leaf routers\n*   **Leaf Routers** routes the data to the leaves\n*   **Range Assigner** decides the leaf to store the data\n\nIngestion routers regionalize time series data into zones according to location fields, and leaf routers distribute data across leaves according to the range assigner\n\n![](/img/monarch_2.png)\n\nThe data received has the following categories:\n\n**Targets** are used to identify the node/service/component form which the data has generated. Based on the above diagram, a Target string **_ComputeTask::sql-dba::db.server::aa::0876_** represents the Borg task of a database server. The format of target strings is important in data placement among the leaves as target ranges are used for lexicographic sharding and load balancing among leaves.\n\n**Metrics** contain the metric information in the format of key-value pairs where keys are the type of metrics of a target and the value is time series based data points. The metric types supported are boolean, int64, double, string, distribution, or tuple of other types. The metric values can be cumulative or a gauge. The advantage of using cumulative points is the intermittent data losses don’t affect the distribution by much.\n\nThe data can be sent in **Delta Time Series** where only the delta in the time series data is sent instead of the whole metric. This reduces the continuous input of data and requires handling only there is a change in the data range.\n\n**Bucketing** helps to aggregate data points for a certain duration before sending them to the ingestion pipeline. This reduces the network handling and bulk inserts can be performed.\n\n**Admission windows** are used to reject queries that are received after a certain duration so that the pressure of handling data received after a certain duration can be avoided.\n\n#### Data Querying\n\nMonarch provides a globally federated query engine. All queries can be fired at the global level and Monarch takes care of routing the query to the leaves where the relevant data is stored and consolidates the responses from the leaves.\n\nComponents used in data querying which can be viewed in the above diagrams are as follows:\n\n*   **Mixers** break down the queries into subqueries and consolidate the responses from the subqueries. Root mixers receive the queries and fan them out to zone mixers which further fan it out to the leaves, thus forming a **Query Tree**. The Mixers also check the Index servers to limit the queries to the zones or leaves where the data resides in\n*   **Index servers** index the data for each zone and leaf which can be used to understand which leaves the queries are meant for\n*   **Evaluators** generate the responses from Standing queries and write the data back to the leaves\n\nMonarch’s **Query Language** supports the following keywords:\n\n*   Fetch\n*   Filter\n*   Join\n*   Align\n*   GroupBy\n\n**Ad-hoc queries** are queries that are from users outside of the system.\n\n**Standing queries** are queries that are similar to views in other database systems. The standing queries are periodically calculated and stored back into Monarch for faster query responses.\n\nStanding queries are also more performant since the evaluation can be done at the zone or root level depending on the breadth of the query. This minimizes the query space to region-specific leaves.\n\n**Level analysis** of the query is done which breaks the query based on various levels for authentication and better query locality. The levels can be defined based on the Query Tree mentioned above.\n\n**Replica Resolution** is used to figure out the best replica to answer the query as there may be differences in query load, system configuration, etc which makes a certain replica better suited for responses.\n\n**User Isolation** limits the amount of memory any user can use in the system so that other rule-abiding users are not affected.\n\n#### Performance\n\n*   Monarch runs on 38 zones spread across five continents. It has around 400,000 tasks\n*   As of July 2019, Monarch stored nearly 950 billion time series, consuming around 750TB of memory with a highly optimized data structure\n*   Monarch’s internal deployment ingested around 4.4 terabytes of data per second in July 2019\n*   Monarch has sustained exponential growth and was serving over 6 million queries per second as of July 2019.\n\n![](/img/monarch_3.png)\n\n**_I hope you liked the article. Please let me know if you have any queries regarding the article. Happy reading!!_**\n\nReferences:\n\n*   [https://www.vldb.org/pvldb/vol13/p3181-adams.pdf](https://www.vldb.org/pvldb/vol13/p3181-adams.pdf)\n",
            "url": "https://gauravsarma.com/posts/2021-05-20_Understanding-Monarch--Google-s-Planet-Scale-Monitoring-System-60e59b63ac0c",
            "title": "Understanding Monarch, Google’s Planet-Scale Monitoring System",
            "summary": "Monarch is a planet-scale in-memory time series database developed by Google.  It is mainly used by as a reliable monitoring system by most of Google’s internal systems like Spanner, BigTable, Colossus, BlobStore...",
            "date_modified": "2021-05-20T00:00:00.000Z"
        },
        {
            "id": "https://gauravsarma.com/posts/2021-04-10_Understanding-the-Concept-of-Virtual-Time-Using-the-Time-Warp-Algorithm-4579dfe5eca8",
            "content_html": "\nWhat is virtual time and why do we need it?\n\nAs distributed systems have progressed and been adopted over the last decade, there have been numerous technologies in different segments like databases, caches, message queues, etc which are built on top of other frameworks which abstract away the difficulty of managing distributed systems. One of the most important and difficult things to manage in distributed systems is managing synchronicity using time.\n\nSome common forms of synchronization techniques in distributed systems are **_block-resume_**, **_abortion-retry_**, **_lookahead-rollback_**. This post will cover the **_lookahead-rollback_** used by the **_Time Warp algorithm_** since, though not intuitive, it is leads to elegant and efficient solutions when the cons are compared for each of the techniques.\n\n> A virtual time system is a distributed system executing in coordination with an imaginary virtual clock that ticks virtual time. Virtual time itself is a global, one-dimensional, temporal coordinate system imposed on a distributed computation; it is used to measure computational progress and to define synchronization.\n\nThe main purpose is to have a single virtual time (which may also be in sync with the real time) across the system so that processes can always operate in an opaque manner when in actuality, it is an unpredictable entity.\n\nAll processes communicate with each other (either locally or remotely) via messages which mainly consist of 4 primary fields: sender, virtual send time, receiver, virtual receive time.\n\nSome fundamental rules which must be observed for virtual time:\n\n> The virtual send time of each message must be less than its virtual receive time.\n\n> The virtual time of each event in a process must be less than the virtual time of the next event at that process.\n\nIt should also be noted that the virtual times of any event A and B should follow the above rules only if there are events which directly or indirectly have causality amongst A and B.\n\n**Lamport’s logical Clock**\n\nLamport was one of the first to show that real time temporal order, causality between events had a strong connection to the concepts of relativity. He provides an algorithm which assigns ordered clock values to events once the execution of a distributed system starts. The Time Warp algorithm is an inverse of the Lamport alogrithm where they mainly have the time assigned to the event and rollback if any discrepancy is found\n\n**Reed’s pseudotime**\n\nReed came up with the concept of pseudotime which seems similar to virtual time but is different in the sense that pseudotime where events are assigned multi-version timestamps which are used for concurrency control to understand the atomic time of the event in distributed systems where virtual time is more relative in nature such that it mainly deals with events which have happened **_before or after_** in comparison to others. Reed uses abortion-retry for his algorithm where there may be starvation, unlimited retries, deadlocks, etc.\n\n**Schneiders’s work**\n\nSchneider’s algorithm mainly consists of broadcasting also synchronized messages to all processes and not proceeding till acknowledgements are received from each of them. Keeping all synchronized events in their local memory, processes are able to make decisions locally about the order of events. However, the algorithm doesn’t scale with the requirements of today’s scenario where the broadcast messages and acknowledgements needs to be performed for tens of thousands of servers for millions of messages.\n\n**Time Warp algorithm**\n\nEach process maintains its own virtual clock which is changed only between events. Each process has 3 queues; the **_input message queue_**, **_output message queue_** and the **_state queue._** Keeping in mind the virtual time rules mentioned in the previous section, the each event in the local queue of the process should always be in increasing order.\n\nThere are extremely probable cases where events with virtual time lesser than the current time in the process may arrive, which violates the rules of virtual time. In such cases, the time warp algorithm follows the lookahead-rollback algorithm, which results in rolling back to the point where the incoming message fits appropriately with the processes’ virtual time and remaining messages have to be replayed in the correct sequence.\n\nThe rollback of messages cannot be narrowed down to a single processor alone since the messages sent in the wrong sequence also affects other processes which maintain their own queues. Hence, there has to be a rollback for the error messages in other processes as well which are connected to each other. Other processes which don’t have any direct or indirect connection to the message with error sequence won’t be affected, thus leading to a lesser network rework effect.\n\nFor rollbacks, there is a concept of **antimessages** in the system which can also be used for other purposes. All fields of messages and antimessages are the same except for one field, which is the sign of the message. All messages which have been sent to other processes have a **(+)** sign and their antimessages have a **(-)** sign. Whenever a message is sent, the message is stored in the **receiver’s** input queue and the antimessage is stored in the **sender’s** output queue.\n\n> Whenever a message and antimessage exist in the same queue, both the messages cancel each other out and are therefore removed.\n\nThe messages and antimessages are created together and can exist in different queues.\n\nComing back to the rollback case where we had to rollback the sent messages to other processes to maintain the new correct order, the process generates and sends antimessages for all the incorrect messages sent to other processes which do the same locally as well, thus leading to ripple effect of efficient rollbacks. Even in cases where the antimessage arrives before the actual error message in another process, both the messages will be annihilated since the same queue cannot have both the messages ultimately bringing all processes to the correct order.\n\nAs we go through the above mechanism, we need to understand that all the steps listed above are happening in the **_local context of a process_**, which doesn’t have knowledge of a global virtual clock yet. Not having a global value doesn’t allow efficient memory management of the queues since we see that messages have to be kept in their queues indefinitely which can be a problem at scale.\n\nWe introduce the concent of **Global Virtual Time (GVT)**, which at real time **_r_**, is the minimum of the following\n\n*   All virtual times in all virtual clocks at **_r_**\n*   Virtual send times which have been sent but have not been processed/received at time **_r_**.\n\nFrom the above definition, it becomes evident that at any point in time in entire system, the GVT will also be the lowest value or the floor value in the system. The GVT symbolizes that the messages below the GVT have been processed in the correct order and can be forgotten about, which the local processes can use to clean up their queues.\n\nThe delay in calculating the GVT is the total delay in sending a broadcast message to the processes.\n\n**Conclusion**\n\nA common feedback of the Time Warp algorithm contains the argument that rolling back across thousands of processes may not be feasible in the real life use cases. However, there is a point to be made that the rollbacks are termed as exceptions in real life use cases and not the norm. Since processes folllow the **_temporal locality principle_**, it makes more sense that events would arrive in the actual order and the events which arrive in the past arrive in the recent past which means the rollbacks are lesser.\n\n> The only alternative to lookahead/rollback is for the process to be blocked (i.e., doing nothing) for the same length of real time as the lookahead computation, which is just as much of a “waste.”\n\nVirtual time is strongly analogous to Virtual Memory, the same concept of memory management where the most optimal pages are kept in the main memory. There are numerous efficient algorithms which try to determine the pages which are kept in the main memory, where lookahead section of data files prefetch a few pages in blocks and keep as it’s the usual trend of the user. In cases where a required page is not found in the main memory, there is a page fault which swaps out the least optimal pages from main memory and replaces it with the required page which is again similar to the rollback mechanism. The analogy can be extended in various ways to suggest that virtual time can be implemented in an elegant and efficient manner in distributed systems.\n\n**_I hope you liked the article. Please let me know if you have any queries regarding the article. Happy reading!!_**\n\nReferences:\n\n*   [http://cobweb.cs.uga.edu/~maria/pads/papers/p404-jefferson.pdf](http://cobweb.cs.uga.edu/~maria/pads/papers/p404-jefferson.pdf)\n*   [https://en.wikipedia.org/wiki/Lamport\\_timestamp](https://en.wikipedia.org/wiki/Lamport_timestamp)",
            "url": "https://gauravsarma.com/posts/2021-04-10_Understanding-the-Concept-of-Virtual-Time-Using-the-Time-Warp-Algorithm-4579dfe5eca8",
            "title": "Understanding the Concept of Virtual Time Using the Time Warp Algorithm",
            "summary": "What is virtual time and why do we need it.  As distributed systems have progressed and been adopted over the last decade, there have been numerous technologies in different segments like databases, caches, message queues, etc which are built on top of other frameworks which abstract away the difficulty of managing distributed systems...",
            "date_modified": "2021-04-10T00:00:00.000Z"
        },
        {
            "id": "https://gauravsarma.com/posts/2020-08-09_Neo4j-storage-internals-be8d150028db",
            "content_html": "\nI was exploring Neo4j and came upon this [video](https://www.youtube.com/watch?v=BfPDZf2wmqg) where Jim Webber, Chief Scientist at Neo4j, explains these numbers:\n\n**125x = 48y = 3z** is the ratio of the **cluster size(number of instances)** required for a similar data store functionality where x=MongoDB, y=Cassandra and z=Neo4j\n\nand\n\n**20x = 50y = 0.33z** is the ratio of the **disk size** required for a similar data store functionality where x=MongoDB, y=Cassandra and z=Neo4j\n\nThe blog post will look to cover the internals of how the data is stored and accessed in Neo4j and why it is a serious contender for a certain type of data storage.\n\nBefore going further, let us explore how MySQL would handle nodes and relationships internally and the affects of it as the data size grows.\n\nSuppose we want to build the feature where **_users_** can **_like_** a **_post_** in facebook. Following tables will be required:\n\n*   Users (containing user information)- ID, Name, Email, etc\n*   Posts (containing post information)- ID, URL, Content\n*   Likes (is a table containing the many-to-many relationships)- UserID, PostID\n\nWe would ideally index the UserID and PostID columns belonging to the **_likes_** table. Let’s try to find the posts that a particular user has liked. First, we have to find the user account using the UserID from the users table. In order to find the posts on which the user has liked, MySQL follows the join based approach where it joins the **_likes_** table and the **_users_** table using the UserID column and that leads a scan of all the indexes of the table which becomes expensive as the table grows. This is a first level relationship. The cost becomes more expensive as the the degree of the relationship grows as there is a need of joining multiple tables which is not really efficient and doesn’t guarantee a **_O(1)_** complexity.\n\nOne of the major differences that a graph database like Neo4j provides is index free adjacency, which means that there is no requirement of having indexes and looking them up in order to match and form relationships. We will get to how Neo4j does it in some time.\n\nThe index style of finding relationships is also one way, i.e it will take separate queries to find the relationship in the opposite direction. The relationships are not bidirectional, whereas in graph databases, the bi-directional nature of the relationships are out of the box since it follows a more natural way of linking relationships.\n\n> Neo4j uses a fixed record size based pointer scheme to to store graph data thus, providing O(1) traversal\n\n> — Jim Webber\n\n**Neo4j entities:**\n\n*   Node\n*   Relationship\n*   RelationshipType\n*   Property\n*   Label\n\nThe attributes/properties of a node and the relationships are separated out so that they are not treated in the same manner. It is important to segregate the value and reference of the nodes in order to provide optimal storage and access for both of them.\n\nEach of the entities have their own store file in neo4j. For example, all the nodes are stored in the node store file which is **_neostore.nodestore.db_** for Neo4j. All relationships are stored in the relationship store file which is **_neostore.relationshipstore.db_** for Neo4j.\n\nAll records of any type in neo4j are fixed sized records, thus allowing easy and efficient mathematical formulas to access the record without the need for traversal.\n\n**Node store:**\n\nAll the nodes in the database are stored in the node store file.\n\nEach node record accounts for a fixed 15 bytes.\n\nThe layout is as follows:\n\n*   1st byte — isInUse\n*   Next 4 bytes — ID of the node\n*   Next byte — First relationship ID\n*   Next byte — First property ID\n*   Next 5 bytes — Label store\n*   Remaining byte — for future use\n\nThe first byte is used to determine whether the record is being used or has been deleted. If not, the record will be used for newer entries.\n\nThe next 3 sectors are the IDs of the node itself, first relationship ID, first property ID and label store. Some of the labels are stored in the node itself if possible for lesser jumps. The other bytes are to saved for future use.\n\n**Relationship store:**\n\nEach relationship record is a fixed record of 34 bytes\n\nThe relationship record’s layout is as follows:\n\n*   Start node ID\n*   End node ID\n*   Pointer to the relationship type\n*   Pointer to the next and previous relationship record for each of the start node and end node\n\nEach relationship record belongs to two nodes, the start node and the end node. Each relationship also has a type associated with it, which signifies which type of relationship is connecting the 2 nodes. The pointer to the relationship type helps to determine this.\n\nThe relationship node contains 4 other pointers or misdirections to relationship records. They point to the previous and next relationship of both the start node and end node similar to how doubly linked lists behave.\n\nNeo4j uses trees to provide indexing capabilities to reach the start node from where we can start the traversal.\n\nTo reach the appropriate node, we have to iterate through the relationship linked list from the start node, iterate till we find the appropriate the required relationship record, and then apply the formula to find the appropriate node from the relationship record.\n\n![](/img/neo4j_1.png)\n\nAs we see from the above description, the fixed sized records and pointer type traversal instead of scanning the table using indexes lead to much faster and efficient way of finding relationships.\n\nOnce we find the start node record, using the first relationship ID, we can find the relationship in the relationship store by multiplying the ID with the size of the relationship record. We can get the relationship record and from that we can find the second node in the node store using a similar formula.\n\nThe Property Store and the Label store are simpler stores similar to node store.\n\nThe Neo4j engine also has a LRU k-page cache which basically divides the cache into segments based on the different types of store files and keeps a fixed count of records in these segments by removing the least recently used records.\n\nBelow are some numbers taken from the Neo4j site:\n\n**Scenario #1 — Initial status**\n\n*   Node count: 4M nodes\n*   Each node has 3 properties (12M properties total)\n*   Relationship count: 2M relationships\n*   Each relationship has 1 property (2M properties total)\n\nThis is translated to the following size on disk:\n\n*   Nodes: 4.000.000x15B = 60.000.000B (60MB)\n*   Relationships: 2.000.000x34B = 68.000.000B (68MB)\n*   Properties: 14.000.000x41B = 574.000.000B (574MB)\n*   TOTAL: **703MB**\n\n**Scenario #2–4x growth + added properties + indexes on all properties**\n\n*   Node count: 16M nodes\n*   Each node has 5 properties (80M properties total)\n*   Relationship count: 8M relationships\n*   Each relationship has 2 properties (16M properties total)\n\nThis is translated to the following size on disk:\n\n*   Nodes: 16.000.000x15B = 240.000.000B (240MB)\n*   Relationships: 8.000.000x34B = 272.000.000B (272MB)\n*   Properties: 96.000.000x41B = 3.936.000.000B (3936MB)\n*   Indexes: 4448MB \\* ~33% = 1468MB\n*   TOTAL: **5916MB**\n\nReferences:\n\n*   O’Reilly Graph databases book\n*   [https://neo4j.com/developer/kb/understanding-data-on-disk/](https://neo4j.com/developer/kb/understanding-data-on-disk/)\n*   [https://www.youtube.com/watch?v=BfPDZf2wmqg](https://www.youtube.com/watch?v=BfPDZf2wmqg)\n*   [https://www.youtube.com/watch?v=NlT21Ceg3y0](https://www.youtube.com/watch?v=NlT21Ceg3y0)\n",
            "url": "https://gauravsarma.com/posts/2020-08-09_Neo4j-storage-internals-be8d150028db",
            "title": "Neo4j storage internals",
            "summary": "I was exploring Neo4j and came upon this [video](https://www. youtube...",
            "date_modified": "2020-08-09T00:00:00.000Z"
        },
        {
            "id": "https://gauravsarma.com/posts/2020-06-11_Interacting-between-C-libraries-and-Go-using-Unsafe-cb8b460d4f0c",
            "content_html": "\nSuppose we have a C lib where we have defined various data structures and methods. Due to some reason or constraint, there needs to be a Golang process which has to reuse the structures mentioned in the C lib.\n\nApart from accessing the C defined elements from the Go code, another topic which is more important is to understand the difference between both the C and Go runtimes.\n\nGo’s runtime does all the memory management tasks like memory allocation and memory freeing for the processes and deciding which object’s memory should escape to the heap. In C, we have to explicitly allocate and free memory allocated on the heap.\n\nNaturally, when C objects are used in the context of Go runtime, the Go runtime has to disable accounting and management of those memory segments. The unsafe package helps to do that as well.\n\nThe unsafe package, is one of the Go packages which helps in accessing system internals and performing complex operations when necessary like accessing the object memory, modifying the object memory across data types which are not supported in Golang, firing system calls, etc. In this post, we will be mainly focussing on how the Unsafe package helps in providing a bridge between C and Go objects.\n\nThe main reason that we require unsafe package is because the pointers provided by Go doesn’t allow pointer arithmetic, type casting across different data types, etc to make the language safer for application programmers with lesser experience in the system programming side. Pointers normally in Go, are mainly to be used as references for objects. However, by using unsafe, we can leverage the power of a language which is able to properly utilize system constructs.\n\nLet us take the example of a short C snippet where we are trying to take a message object and call the receiver method where we are just printing it for now, but can potentially be a network call later as well.\n\n```c\n# include <stdio.h>\n# include <stdlib.h>\n# include <stdint.h>\n\ntypedef struct Message {\n    uint8_t     m_type;\n    uint32_t    buff_size;\n} message_t;\n\nchar * gl_msg = \"message_in_c_global\";\n\nvoid receiver(message_t* msg) {\n    printf(\"%d \\t %d \\n\", msg->m_type, msg->buff_size);\n}\n\nmessage_t* getMessage() {\n    message_t *msg;\n    msg = (message_t*)malloc(sizeof(message_t));\n    msg->m_type = 1;\n    msg->buff_size = 1024;\n    return msg;\n}\n\nvoid hello() {\n    printf(\"hello from C\\n\");\n}\n```\n\nIn order to call the C code in Go, take the C snippet and paste it in the Go code as shown below. There should not be any line space between the C block and the **_import C_** line.\n\n```go\npackage main\n\n/*\n# include <stdio.h>\n# include <stdlib.h>\n# include <stdint.h>\ntypedef struct Message {\n    uint8_t     m_type;\n    uint32_t    buff_size;\n} message_t;\nchar * gl_msg = \"message_in_c_global\";\nvoid receiver(message_t* msg) {\n    printf(\"%d \\t %d \\n\", msg->m_type, msg->buff_size);\n}\nmessage_t* getMessage() {\n    message_t *msg;\n    msg = (message_t*)malloc(sizeof(message_t));\n    msg->m_type = 1;\n    msg->buff_size = 1024;\n    return msg;\n}\nvoid hello() {\n    printf(\"hello from C\\n\");\n}\n*/\nimport \"C\"\n\nimport (\n    \"fmt\"\n    \"unsafe\"\n)\n\ntype (\n    Message struct {\n        MType    uint8\n        BuffSize uint32\n    }\n)\n\nfunc main() {\n\n    var (\n        msg string\n\n        msgC  *C.message_t\n        msgGo *Message\n    )\n\n    C.hello()\n\n    // Convert C char pointer to Go String\n    msg = C.GoString(C.gl_msg)\n    fmt.Println(\"Converted C string to Go string\", msg)\n\n    // Fetch C struct of same data type\n    msgC = C.getMessage()\n\n    // Convert the C pointer using unsafe and convert it back\n    // to the Go Message struct pointer\n    msgGo = (*Message)(unsafe.Pointer(msgC))\n\n    fmt.Println(\"Converted C struct to Go struct of similar data types\",\n        msgGo.MType, msgGo.BuffSize)\n\n    // Change the struct variable and cast it back to C struct\n    msgGo.BuffSize = 4096\n\n    // Convert the msgGo pointer to unsafe pointer\n    // and pass it back to the C receiver method\n    C.receiver((*C.message_t)(unsafe.Pointer(msgGo)))\n\n}\n\n```\nIn the first method inside the main function, we are converting C string which is a char pointer to a Go string.\n\nIn the second method, we initialize a **_\\*C.message\\_t_** type and assign the C data type to it. Using **_unsafe.Pointer_**, we convert it to a pointer which is then casted to a Go struct which contains the similar data types.\n\nThe structures are all padded to keep in mind the cache alignment. That’s why if you print the **_msgC_** variable, we get the output\n\n> &{1 \\[145 84 1\\] 1024}\n\nHere, 1 and 1024 were initialized in the C code. 145, 84, 1 are **_uint8_** size elements padded so that the total size comes up to 64.\n\nOnce we have the unsafe pointer to the C structure, we can cast it to any data type which contains similar alignment. The same can be done for the **_sk\\_buff_** packet structure which received from the network interfaces and used in Go accordingly.\n\nIn the third example, we take the go structure pointer, convert it to unsafe pointer again and then reconvert it to the C **_message\\_t_** pointer and pass it as an argument to the C receiver method.\n\nA very important thing to remember here is that **_msgGo_** is a Go pointer and becomes a candidate for garbage collection. But since we are passing that value to the C method, garbage collection may happen before the data is read which will lead to segmentation faults. To avoid this, we need to call unsafe.Pointer while passing it as the argument to the function itself. When the go runtime sees the unsafe pointer being passed in the same function call, it doesn’t garbage collect the pointer.\n\nThis is one of the uses of using the unsafe package. Using such techniques, the Go runtime can have access to other system and hardware level abstractions like device drivers, etc.\n\nReferences\n\n*   [https://golang.org/pkg/unsafe/](https://golang.org/pkg/unsafe/)\n*   [https://medium.com/a-journey-with-go/go-what-is-the-unsafe-package-d2443da36350](https://medium.com/a-journey-with-go/go-what-is-the-unsafe-package-d2443da36350)\n",
            "url": "https://gauravsarma.com/posts/2020-06-11_Interacting-between-C-libraries-and-Go-using-Unsafe-cb8b460d4f0c",
            "title": "Interacting between C libraries and Go using Unsafe",
            "summary": "Suppose we have a C lib where we have defined various data structures and methods.  Due to some reason or constraint, there needs to be a Golang process which has to reuse the structures mentioned in the C lib...",
            "date_modified": "2020-06-11T00:00:00.000Z"
        },
        {
            "id": "https://gauravsarma.com/posts/2020-05-30_BKD-trees--used-in-Elasticsearch-40e8afd2a1a4",
            "content_html": "\nI had worked on Elasticsearch back in 2015, when it was more known for its text searching capabilities using inverted indexes. As I looked to pick it up again last year for another project, I saw that Elasticsearch had added core support for other data types from text like numbers, IP addresses, geospatial data types, etc.\n\nAs I looked to understand the main differences which could allow optimized search over such data types, I stumbled upon **_BKD trees_**. Surprisingly, there is not much written about BKD trees apart from a white paper and some blogs. The blog post will look to cover elements leading up to the development of BKD trees and its advantages starting from **_KD trees_**.\n\nWe will start with the **_BST_** **_(Binary Search Tree)_** which will be the base for our post. A BST is a binary tree which has lesser elements to its left and greater elements to its right for all nodes. The article will not contain more information regarding the insertion, deletion and searching of elements further since there are numerous sources out there.\n\nBST or other similar implementations of BST like **_AVL trees_**, leverage the capability of dividing the search space by 2 at each node during the traversal, thus resulting in a O(logN) search in the best case scenario. It is possible to balance BSTs by rotating the tree with the pivot.\n\nA major flaw or lack of ability with the BST is the ability to deal with multiple dimensions or spaces. For example, if we have a store of latitudes and longitudes, and we are asked to search for a specific set of latitude and longitude. It is easy for us to use BST to search for either the latitude or longitude but not both the elements together since BST is capable of handling only one dimension in its store.\n\nWhat do we do if we have multiple dimensions or multiple metrics across which we need to run our search queries?\n\n#### KD or K-Dimensional trees\n\nThis is where KD or K-Dimensional trees come into the picture. KD trees are similar to BST in the terms that it allows segregation of data to the left or right depending on the value. The main difference is the consideration of multiple planes or dimensions or spaces while constructing it. In a **_KD tree_** or **_K-NN_** problems, each traversal is able to divide a particular plane into 2 equal sub planes. As the traversals go deeper, the combination of division of planes is used to reach the point in space that was being searched for.\n\nA very good example of splitting of planes in a 1 dimensional structure is the **binary search** method on an array. For every jump, half of the array is taken out of consideration.\n\nFor a 2D structure, the structure can be split in the following way:\n\n![](/img/bkd_1.png)\n\nFor the point A, the X axis is split into 2. For the point B and C, the Y axis is split into 2 and so on.\n\nHow do we represent the 2D split in a tree?\n\nIn order to do this, we define something called as the discriminator. The discriminator is mainly used to figure out which plane is to be considered for the split during the jump. The formula to figure out the discriminator is\n\n**_discriminator = level % N,_** where level is the level of the tree and N is the number of dimensions\n\nEach dimension is assigned a key. The key is used against the discriminator value of each node.\n\n![](/img/bkd_2.png)\n\nWe will try to use the above knowledge to try to explain the tree.\n\nAt node A, discriminator = 0 % 2 = 0\n\nAt node B, discriminator = 1% 2 = 1\n\nAt node C, discriminator = 1% 2 = 1\n\nAt node D, discriminator = 2 % 2 = 0\n\nAt node E, discriminator = 3 % 2 = 1\n\nAt node F, discriminator = 3 % 2 = 1\n\nSince there are only 2 dimensions here, the discriminator value can be only 0 or 1 since the level starts from 0. We will call the first dimension as **_X_** and the second dimension as **_Y._** The **_X_** can very well be the latitude and **_Y_** the longitude.\n\nApplying BST strategy here, we select the next node based on the comparison of values of the dimension.\n\nLet us try to search the element with value **S(66, 85)**\n\nAt **_point A_**, discriminator is 0, hence we compare the **_X_** dimension of S and A. **_S(X)_** is 66 and **_A(X)_** is 40. Since **_66 > 40_**, we navigate to the right which is the C node.\n\nAt **_point C_**, discriminator is 1, hence we compare the **_Y_** dimension of S and A. **_S(Y)_** is 85 and **_C(Y)_** is 10. Since **_85 > 10_**, we navigate to the right which is the D node.\n\nAt **_point D_**, discriminator is 0, hence we compare the **_X_** dimension of S and A. **_S(X)_** is 66 and **_D(X)_** is 69. Since **_66 < 69_**, we navigate to the left which is the E node.\n\nWe finally reach the node that we were searching for.\n\nThe same algorithm can be used to a 3D or a 4D structure as well. The only change is the calculation of the discriminator and the number of discriminator values will be 3 or 4 respectively depending on the dimensions. If it’s a 3D structure, the space will repeat itself every 3 jumps.\n\nAs we can see above, the read operations tend to be optimized well keeping N-dimensions in mind.\n\nAny insertion or deletion on the tree can be a little more tricky than in a BST. In a BST, the rotation works since we are rotating along only one dimension. Rotation in a KD tree will not work easily since rotation along only one dimension will disrupt the other dimensions as well. Hence, write operations on a KD tree can become expensive.\n\nIn the above example, the space division selection is done in a round robin fashion. However, that is the simplest approach that can be taken to ensure proper cutting of all spaces. If we want to give more priority to specific dimensions, it can very easily be controlled by modifying the **_discriminator_** formula to reflect the same.\n\nThe BKD tree is a collection of multiple KD trees as children. This makes sense as the write operation propagations can be controlled to a single KD tree as the data increases. Since BKD trees were mainly built for disk operations, it borrows a leaf (pun intended) from B+ trees and stores the actual points only at the leaf nodes. The internal nodes are mainly used as pointers to reach the appropriate blocks. The tree is a combination of complete binary trees and B+ trees. Since it is a complete binary tree, array techniques using formulas like **_2n+1_** and **_2n+2_** can be used to fetch the appropriate child nodes which will be further optimized for disk IO. I have a hunch that the number of KD trees and the nodes in them can be directly correlated with the shard sizes used in Elasticsearch.\n\n#### **Searching on a N-dimensional space**\n\nIn this section, we try to use the above learnt knowledge of developing the required atomic query types using which other complex queries can be built.\n\n*   Filter based on **_X_** dimensions where **_1 <= X <= N_**\n*   Top hits based on **_X_** dimensions as the unique scope on a function of **_Y_** dimensions where **_1 <= X <= N and_** and **_1 <= Y <= N_**\n*   Range distribution or buckets formation with **_X_** dimensions as the unique scope on a function of **_Y_** dimensions falling within defined ranges where **_1 <= X <= N and_** and **_1 <= Y <= N_**\n*   Temporal listing of **_X_** dimensions with sampling functions performed on **_Y_** dimensions where **_1 <= X <= N and_** and **_1 <= Y <= N_**\n*   Datatable listing of **_X_** dimensions where **_1 <= X <= N_**\n\nI think being able to restrict the queries to be a collection of the above query types should be enough to answer further complex queries.\n\nReferences\n\n*   [https://opendsa-server.cs.vt.edu/ODSA/Books/CS3/html/KDtree.html](https://opendsa-server.cs.vt.edu/ODSA/Books/CS3/html/KDtree.html)\n*   [https://medium.com/@nickgerleman/the-bkd-tree-da19cf9493fb](https://medium.com/@nickgerleman/the-bkd-tree-da19cf9493fb)\n",
            "url": "https://gauravsarma.com/posts/2020-05-30_BKD-trees--used-in-Elasticsearch-40e8afd2a1a4",
            "title": "BKD trees, used in Elasticsearch",
            "summary": "I had worked on Elasticsearch back in 2015, when it was more known for its text searching capabilities using inverted indexes.  As I looked to pick it up again last year for another project, I saw that Elasticsearch had added core support for other data types from text like numbers, IP addresses, geospatial data types, etc...",
            "date_modified": "2020-05-30T00:00:00.000Z"
        },
        {
            "id": "https://gauravsarma.com/posts/2020-05-24_Deep-Dive-into-Maglev--Google-s-Load-Balancer-f5fa943d578c",
            "content_html": "\nI recently heard about Maglev, the load balancer that Google uses in front of most of its services. I wanted to get a short gist on the matter to understand the reason why Google had to create its own load balancer and the optimizations that they took in order to actually run a load balancer at Google’s scale. To my surprise, I couldn’t find many articles which actually brought out the main reasons for Maglev’s existence. I had no other option but to go through the research paper submitted by the Google team. This post looks to list down the elements in the load balancer which actually make it what it is to help other readers ramp up their knowledge in a short span.\n\nA brief introduction about Maglev and some basic design principles adopted by the authors:\n\n*   Maglev is a software load balancer compared to multiple hardware load balancers to leverage the flexibility of software\n*   Maglev runs on commodity hardware similar to how most of Google’s infrastructure works\n*   Maglev is a distributed scale-out load balancer, meaning it scales by adding nodes to the cluster compared with other load balancers scaling up their machines and deploying clusters only in high availability modes\n*   Maglev uses connection tuples (source IP, destination IP, source port, destination port, protocol) to redirect an user to the appropriate Maglev instance where it keeps a track of the connection and the backend services\n*   Maglev looks to pass traffic at line rate, which is currently 10Gbps, limited by the NICs in their current machines\n*   Maglev assumes that the incoming packets will be smaller in size which means that packet fragmentation, though possible, will not be the norm and thus hashing based on the connection tuple becomes optimal\n*   Maglev assumes that the outgoing packets can be larger in size which means Maglev adopts Direct Server Return (DSR), which is a standard way for load balancers to offload the load to the actual servers instead of the load balancer bearing the brunt\n*   Maglev uses a Maglev hashing, which is derived from Consistent Hashing. Consistent hashing is useful to ensure that traffic restructuring across the cluster is limited when nodes crash. Though important for a load balancer, Maglev hashing gives a higher priority to ensure that the load is distributed across all instances.\n*   Maglev tries to avoid cross thread synchronizations to avoid the performance complexity of maintaining synchronous data structures\n*   Maglev keeps track of the health of the backend services and use this data to select the backend service for the required traffic\n\n![](/img/maglev_1.png)\n\nSome terminologies and modules listing before we start\n\n*   Backend Service\n*   Google Routers\n*   Maglev Controller\n*   Maglev Forwarder\n*   Magler Steering module: Part of Maglev Forwarder\n*   Magler Multiplexer module: Part of Maglev Forwarder\n\nEach backend service of Google, which can Gmail, youtube, etc hosts a VIP (virtual IP). Maglev broadcasts the VIP to the Google Router sitting in front of it and the Google Router broadcasts it to the Google backbone which in turn publishes the networks to the ISPs. In case of multiple shards of the same cluster, the VIP distribution is done accordingly based on performance and isolation decisions.\n\nSteering Module\n\n*   Calculates the 5 tuple hash of the packet and assigns it to receive queues which are listened to by packet rewriter threads\n*   Packet rewriter threads ensures that the packet belongs to the VIP otherwise is dropped\n*   The packet rewriter thread then calculates the connection hash again and checks whether it’s an existing connection or a new connection. Each packet rewriter thread manages its own connection hash association with the backend\n*   Once the backend is found, the packet rewriter thread encapsulates the packet using GRE and sends it to the Transmission Queue\n*   In cases where a particular receive queue becomes full, the Steering module resorts to round robin scheduling instead of connection hashing to ensure that the load is evenly distributed\n\nMuxing module\n\n*   The muxing module listens on the transmission queues, and forwards the packets towards the NIC\n\n**Fast Packet Processing**\n\n*   Maglev is an userspace application\n*   In a normal linux server, packets are received by the kernel and de-encapsulated or encapsulated (based on the direction) layer by layer with possible memcopy at various places. In order for Maglev to operate on a standard linux server, the kernel would have to copy packets back and forth the Maglev service which would be computationally expensive\n*   Since Maglev’s functionalities are very narrow, the authors took apart the entire Linux kernel networking stack and replaced it with their own packet processing logic to avoid redundant checks\n*   Maglev prevents any cases of data copying of packets entirely to prevent memory bloating and saves CPU cycles\n*   Maglev preallocates the entire packet pool depending on the instance size. All Maglev components use pointers towards the packets in the packet pool to maintain their business logic\n*   There are multiple pointers types which help the Maglev forwarder components to maintain state\n*   Received — When the packets are received from the NIC\n*   Processed — Steering module assigns the packet to the packet writer threads\n*   Reserved — Collects the unused packets and stores in the reserved pool\n*   Sent — The packets are sent by the NIC\n*   Ready — Muxing module sends the packets to the NIC\n*   Batch operations are preferred whenever possible to minimize boundary-crossing operations\n*   Each packet rewriter threads runs on a single CPU to prevent CPU multiplexing and context switches\n\nReferences:\n\n*   [https://storage.googleapis.com/pub-tools-public-publication-data/pdf/44824.pdf](https://storage.googleapis.com/pub-tools-public-publication-data/pdf/44824.pdf)\n*   [https://medium.com/martinomburajr/maglev-the-load-balancer-behind-googles-infrastructure-architectural-overview-part-1-3-3b9aab736f40](https://medium.com/martinomburajr/maglev-the-load-balancer-behind-googles-infrastructure-architectural-overview-part-1-3-3b9aab736f40)\n*   [https://blog.acolyer.org/2016/03/21/maglev-a-fast-and-reliable-software-network-load-balancer/](https://blog.acolyer.org/2016/03/21/maglev-a-fast-and-reliable-software-network-load-balancer/)\n",
            "url": "https://gauravsarma.com/posts/2020-05-24_Deep-Dive-into-Maglev--Google-s-Load-Balancer-f5fa943d578c",
            "title": "Deep Dive into Maglev, Google’s Load Balancer",
            "summary": "I recently heard about Maglev, the load balancer that Google uses in front of most of its services.  I wanted to get a short gist on the matter to understand the reason why Google had to create its own load balancer and the optimizations that they took in order to actually run a load balancer at Google’s scale...",
            "date_modified": "2020-05-24T00:00:00.000Z"
        },
        {
            "id": "https://gauravsarma.com/posts/2019-12-14_Kubernetes-meets-SD-WAN-29376a974de2",
            "content_html": "\nKubernetes is the de-facto container management system for all sorts of distributed workloads. Known for its extensibility and community support, there are numerous plugins for multiple use cases.\n\nThe most unpredictable element for any distributed system is the network. A distributed system is as strong as its weakest link. As networks go down, it leads to various well known problems like thundering herd, split brain, etc. Most applications are today built with the assumption that anything and everything can go down. Though this leads to robust applications, the major reason for having it in-built in the application is that we seldom have control over the network.\n\nWith the recent introduction of [AWS Outposts](https://aws.amazon.com/blogs/aws/aws-outposts-now-available-order-your-racks-today/) and [AWS Transit Gateway](https://aws.amazon.com/blogs/aws/new-for-aws-transit-gateway-build-global-networks-and-centralize-monitoring-using-network-manager/) , it is evident that AWS is keen on moving to private data centers. There was introduction of [AWS Fargate](https://aws.amazon.com/fargate/) which means support for serverless containers in the same annual event. Though both the items seem contradictory to each other at the first glance, AWS is trying to tie up the elements over which it has very less control. As enterprises adopt this approach, there will be massive explosion of hybrid heterogeneous infrastructure deployments.\n\nAs the adoption of hybrid infrastructure deployment grows, infrastructure vendors like AWS, Azure will have to partner with networking companies like Cisco and Juniper as is evident by [this link](https://www.sdxcentral.com/articles/news/cisco-pushes-aci-to-aws-and-azure-embraces-data-center-anywhere-strategy/2019/01/). Networking companies will leverage their SD-WAN platforms for this transition and only companies with SD-WAN at their core will thrive compared to those providing ad-hoc SD-WAN features.\n\nSome prominent and core features of SD-WAN platforms are:\n\n*   VPN/VRFs over the internet and private networks alike\n*   Ability to identify complex applications\n*   Remote application breakout from a data center device\n*   Link aggregation to leverage data capable and quality capable links\n*   Application steering based on link quality and capacity\n*   Per packet load balancing to leverage the most of all the links\n*   Security features like IPS/IDS, Content Filtering,etc\n*   and many many more….\n\n> The most important point of all, is that for a SD-WAN network, every action is a function.\n\nFor a better introduction to SD-WAN, head over to [Lavelle Networks’ website](https://lavellenetworks.com/).\n\nAs the enterprises start discovering the advantages of moving to the cloud with applications running and scaling on serverless containers in their private and public datacenters, they will be encouraged and eager to move their legacy applications to the new way of deploying applications.\n\nIt is at that point of time where we will see a rise in networking integrations with Kubernetes.\n\nThis post seeks to jot the advantages of having a SD-WAN based networking runtime for Kubernetes. Kubernetes comes with CNI (Container Network Interface) by default and there are many worthwhile plugins like Calico, Kubenet, etc which does a fantastic job.\n\nImagine the following scenario…\n\n1.  You have deployed a crucial configuration database service with critical SLAs which requires minimal latency and is not bandwidth hungry\n2.  There is a monitoring service which has also been configured on the same node which is not latency sensitive and is bandwidth hungry\n\nWith a SD-WAN network, you can define specific QoS (Quality of Service) policies at runtime to always prioritize the database traffic in case of contention. There can be load balancing policies which are able to steer the database traffic to a low capacity high quality MPLS link with lesser drops and jitter. The monitoring service traffic can be configured to exit via a broadband link as it can bear the uncertainties of the internet. No more manually configuring policies on the nodes. Just define the database app and make an API call to affect your application traffic over the entire network irrespective of cloud environment.\n\nYou can configure network security groups for your applications across cloud service providers like AWS, Azure and also for private deployments, with just a function call.\n\nIt is amazing how almost every SD-WAN feature can be used to leverage a seamless and robust experience with how applications are delivered to the end user.\n\nThe network has always been treated as a second class citizen when compared to compute or storage. Similar to the introduction of memory optimized or compute optimized instances in AWS and other popular CSPs, there will be a need for the introduction of networking as a service for different application needs. Most applications today are data intensive applications which needs a reliable networking backbone to provide the optimal experience to the users.\n\nAs the industry moves towards the elimination of distinction between running applications in the private or public datacenters and treating containers as the atomic entity to deploy applications, it is inevitable that Kubernetes and SD-WAN will meet each other somewhere down the road and they will live on happily ever after.",
            "url": "https://gauravsarma.com/posts/2019-12-14_Kubernetes-meets-SD-WAN-29376a974de2",
            "title": "Kubernetes meets SD-WAN",
            "summary": "Kubernetes is the de-facto container management system for all sorts of distributed workloads.  Known for its extensibility and community support, there are numerous plugins for multiple use cases...",
            "date_modified": "2019-12-14T00:00:00.000Z"
        },
        {
            "id": "https://gauravsarma.com/posts/2019-08-11_Packet-sniffer-and-parser-in-C-c86070081c38",
            "content_html": "\nThis post will cover a concise implementation of how to open live pcap sessions on any network device and reading the incoming packets on that interface. In the end, the post will display how to parse the packets appropriately to get the required information.\n\nWe use libpcap in the implementation to listen to the packets on the network device. The same can also be used to directly read from a pcap file instead of live sessions. In the implementation, we have turned the promiscuous mode to true so that we are able to listen to packets also not destined for the machine on which the session is being run on.\n\n```c\n#include <stdio.h>\n#include <pcap.h>\n\nint main() {\n    char *dev = argv[1], errbuf[PCAP_ERRBUF_SIZE];\n    int BUFSIZE = 1024;\n    pcap_t *handle;\n    struct bpf_program fp;\n    char filter_exp[] = \"port 22\";\n    bpf_u_int32 mask;\n    bpf_u_int32 net;\n    struct pcap_pkthdr header;\n    const u_char *packet;\n    \n    dev = pcap_lookupdev(errbuf);\n\n    if (dev == NULL) {\n        fprintf(stderr, \"Couldn't find default device: %s\\n\", errbuf);\n        return(2);\n    }\n\n    printf(\"\\nDevice: %s\\n\", dev);\n\n    handle = pcap_open_live(dev, BUFSIZE, 1, 1000, errbuf);\n\n    if (pcap_compile(handle, &fp, filter_exp, 0, net) == -1) {\n        fprintf(stderr, \"Couldn't parse filter %s: %s\\n\", filter_exp, pcap_geterr(handle));\n        return(2);\n    }\n\n    if (pcap_setfilter(handle, &fp) == -1) {\n        fprintf(stderr, \"Couldn't install filter %s: %s\\n\", filter_exp, pcap_geterr(handle));\n        return(2);\n    }\n\n    packet = pcap_next(handle, &header);\n    printf(\"Jacked a packet with length of [%d]\\n\", header.len);\n    return (0);\n}\n```\nIn the above gist, we use bpf (Berkeley Packet Filters) along with libpcap to compile the required filters. BPF or eBPF can be used to run secure sandboxed code directly in the kernel. We can use the bpf filters to listen for the kind of traffic that we are interested in. In the above gist, we use this line to listen on port 22 which is mainly used for SSH.\n\n> char filter\\_exp\\[\\] = “port 22”;\n\nOnce the filters are compiled, we set the filter on the libpcap session to filter the packets appropriately.\n\nWe are using **_pcap\\_next_** here to get the next packet. Ideally, we should use **_pcap\\_loop_** so that can read the packets from the session in a loop.\n\nTo compile and run the code, try the following\n\n> gcc -o /tmp/pcapper packet\\_capture.c -lpcap\n\nNow that we have received the packet, we move on to the techniques used to parse the packet to retrieve the information required.\n\n```c\n#include <netinet/in.h>\n#include <pcap.h>\n\n/* Ethernet addresses are 6 bytes */\n#define ETHER_ADDR_LEN  6\n#define SIZE_ETHERNET 14\n\n/* Ethernet header */\nstruct sniff_ethernet {\n    u_char ether_dhost[ETHER_ADDR_LEN]; /* Destination host address */\n    u_char ether_shost[ETHER_ADDR_LEN]; /* Source host address */\n    u_short ether_type; /* IP? ARP? RARP? etc */\n};\n\n/* IP header */\nstruct sniff_ip {\n    u_char ip_vhl;      /* version << 4 | header length >> 2 */\n    u_char ip_tos;      /* type of service */\n    u_short ip_len;     /* total length */\n    u_short ip_id;      /* identification */\n    u_short ip_off;     /* fragment offset field */\n#define IP_RF 0x8000        /* reserved fragment flag */\n#define IP_DF 0x4000        /* dont fragment flag */\n#define IP_MF 0x2000        /* more fragments flag */\n#define IP_OFFMASK 0x1fff   /* mask for fragmenting bits */\n    u_char ip_ttl;      /* time to live */\n    u_char ip_p;        /* protocol */\n    u_short ip_sum;     /* checksum */\n    struct in_addr ip_src,ip_dst; /* source and dest address */\n};\n#define IP_HL(ip)       (((ip)->ip_vhl) & 0x0f)\n#define IP_V(ip)        (((ip)->ip_vhl) >> 4)\n\n/* TCP header */\ntypedef u_int tcp_seq;\n\nstruct sniff_tcp {\n    u_short th_sport;   /* source port */\n    u_short th_dport;   /* destination port */\n    tcp_seq th_seq;     /* sequence number */\n    tcp_seq th_ack;     /* acknowledgement number */\n    u_char th_offx2;    /* data offset, rsvd */\n#define TH_OFF(th)  (((th)->th_offx2 & 0xf0) >> 4)\n    u_char th_flags;\n#define TH_FIN 0x01\n#define TH_SYN 0x02\n#define TH_RST 0x04\n#define TH_PUSH 0x08\n#define TH_ACK 0x10\n#define TH_URG 0x20\n#define TH_ECE 0x40\n#define TH_CWR 0x80\n#define TH_FLAGS (TH_FIN|TH_SYN|TH_RST|TH_ACK|TH_URG|TH_ECE|TH_CWR)\n    u_short th_win;     /* window */\n    u_short th_sum;     /* checksum */\n    u_short th_urp;     /* urgent pointer */\n};\n\nint main(int argc, char *argv[]) {\n\n    printf(\"Launching Packet Capture\");\n\n    char *dev = argv[1], errbuf[PCAP_ERRBUF_SIZE];\n    int BUFSIZE = 1024;\n    pcap_t *handle;\n    struct bpf_program fp;\n    char filter_exp[] = \"port 80\";\n    bpf_u_int32 mask;\n    bpf_u_int32 net;\n    struct pcap_pkthdr header;\n    const u_char *packet;\n    const struct sniff_ethernet *ethernet; /* The ethernet header */\n    const struct sniff_ip *ip; /* The IP header */\n    const struct sniff_tcp *tcp; /* The TCP header */\n    const char *payload; /* Packet payload */\n\n    u_int size_ip;\n    u_int size_tcp;\n\n\n    dev = pcap_lookupdev(errbuf);\n\n    if (dev == NULL) {\n        fprintf(stderr, \"Couldn't find default device: %s\\n\", errbuf);\n        return(2);\n    }\n\n    printf(\"\\nDevice: %s\\n\", dev);\n\n    handle = pcap_open_live(dev, BUFSIZE, 1, 1000, errbuf);\n\n    if (pcap_compile(handle, &fp, filter_exp, 0, net) == -1) {\n        fprintf(stderr, \"Couldn't parse filter %s: %s\\n\", filter_exp, pcap_geterr(handle));\n        return(2);\n    }\n\n    if (pcap_setfilter(handle, &fp) == -1) {\n        fprintf(stderr, \"Couldn't install filter %s: %s\\n\", filter_exp, pcap_geterr(handle));\n        return(2);\n    }\n\n    packet = pcap_next(handle, &header);\n    printf(\"Jacked a packet with length of [%d]\\n\", header.len);\n\n    printf(\"Parsing Ethernet header\\n\");\n\n    ethernet = (struct sniff_ethernet*)(packet);\n    ip = (struct sniff_ip*)(packet + SIZE_ETHERNET);\n    size_ip = IP_HL(ip)*4;\n\n    if (size_ip < 20) {\n        printf(\"   * Invalid IP header length: %u bytes\\n\", size_ip);\n        return (0);\n    }\n        printf(\"Parsing TCP header\\n\");\n\n    tcp = (struct sniff_tcp*)(packet + SIZE_ETHERNET + size_ip);\n    size_tcp = TH_OFF(tcp)*4;\n    if (size_tcp < 20) {\n        printf(\"   * Invalid TCP header length: %u bytes\\n\", size_tcp);\n        return (0);\n    }\n\n    printf(\"Ether Type %d\\n\", ethernet->ether_type);\n    printf(\"Src Host %d\\n\", ethernet->ether_shost[ETHER_ADDR_LEN]);\n    printf(\"Dst Host %d\\n\", ethernet->ether_dhost[ETHER_ADDR_LEN]);\n    printf(\"Src Port %d\\n\", tcp->th_sport);\n    printf(\"Dst Port %d\\n\", tcp->th_dport);\n    printf(\"Protocol %d\\n\", ip->ip_p);\n\n    payload = (u_char *)(packet + SIZE_ETHERNET + size_ip + size_tcp);\n\n    pcap_close(handle);\n\n    return(0);\n}\n```\n\nIn order to understand the above gist, it is important to understand pointer arithmetics and how information is laid across the memory.\n\nWe can see that the packet variable is a pointer of u\\_char which points to the starting byte of the data pointed by packet.\n\n> const u\\_char \\*packet;\n\nIn order to read the ethernet header, we need to typecast the packet to the ethernet variable of type sniff\\_ethernet\n\n> ethernet = (struct sniff\\_ethernet\\*)(packet);\n\nThe ethernet headers are of fixed length always. To parse the next layer, we do the following.\n\n> ip = (struct sniff\\_ip\\*)(packet + SIZE\\_ETHERNET);\n\nThe IP header starts from (packet + SIZE\\_ETHERNET). The TCP headers can be derived from the same format and so on.\n\nYou can also implement custom tunneling protocols via this format and parse the packet accordingly.\n\nThe entire post has been heavily derived from [https://www.tcpdump.org/pcap.html](https://www.tcpdump.org/pcap.html).\n\n_Please let me know if you have any queries regarding the article. Happy reading!!_\n\nReferences\n\n*   [https://www.tcpdump.org/pcap.html](https://www.tcpdump.org/pcap.html)\n",
            "url": "https://gauravsarma.com/posts/2019-08-11_Packet-sniffer-and-parser-in-C-c86070081c38",
            "title": "Packet sniffer and parser in C",
            "summary": "This post will cover a concise implementation of how to open live pcap sessions on any network device and reading the incoming packets on that interface.  In the end, the post will display how to parse the packets appropriately to get the required information...",
            "date_modified": "2019-08-11T00:00:00.000Z"
        },
        {
            "id": "https://gauravsarma.com/posts/2019-06-09_Hooking-in-a-Stats-module-in-Rails-Active-Record-942c4fdbc0a9",
            "content_html": "\nMost full-fledged web frameworks come with ORMs built in. ORMs or Object Relational Mappings help to map the programming language data structures to actual data stores without having to worry about the underlying data source.\n\nThis helps to abstract the data store interfaces which helps in migrating to a separate data store more of a configuration knob and doesn’t require any change to the actual codebase. ORMs also help in connection pooling, managing database connections, validations, etc.\n\nIn this post, we will be assuming a base knowledge of ORMs and we will be looking at how to integrate a Statistics module into Active Record, an ORM layer used popularly by Ruby on Rails.\n\n**Problem**\n\nWe want to keep track of the CRUD operations happening at a model layer and we want to keep the stats layer separate of the data store.\n\n**Solution**\n\nThe first thing that I do in any Rails project is to define a base class which all models are derived from. The base model is derived from **_ActiveRecord::Base_** which is the class which defines the methods available to the models. Let’s say we have a **_User_** model.\n\n```ruby\nclass ApplicationModel < ActiveRecord::Base\n  self.abstract_class = true\nend\n```\n\n```ruby\nclass User < ApplicationModel\nend\n```\n\nWe need to now override the methods provided by Active Record for CRUD operations.\n\nSome of the common functions are\n\n*   create\n*   save\n*   update\n*   destroy\n*   delete\n*   find\n*   find\\_by\n*   where\n*   all\n\nFor people who have studied about OOPS, we will be overriding the methods defined in the ActiveRecord class. We will be overriding the create method and incrementing the counter stored inside a hash with the model name as the base key and the action as the nested key.\n\n```ruby\nclass ApplicationModel < ActiveRecord::Base\n\n  self.abstract_class = true\n  @@stats = {}\n  \n  def self.create args\n    \n    @@stats[self.to_s] = {} if not @@stats[self.to_s]\n    @@stats[self.to_s][__method__.to_s] = 0 if not @@stats[self.to_s][__method__.to_s]\n\n    @@stats[self.to_s][__method__.to_s] += 1\n    \n    super\n  end\n \nend\n```\n\nIn the above gist, we are storing the counters in the stats variable.\n\nSo if you call **_User.create(name: “user1”)_**, the stats class variable will have the following representation.\n\n> **{“User”=>{“create”=>0}}**\n\nWe can override most of the other methods in the same manner. For example, if we want to override the **_all_** method, we can define a **_self.all_** method in the application\\_model.rb class.\n\nHowever, the approach did we have taken here is not fool proof. There are usually multiple record chaining statements that we usually have to perform which is not derived on the ActiveRecord::Base class.\n\nFor example,\n\n> _User.includes(:customer).where(name: “user1”)_\n\nHere, the **_User.includes(:customer)_** returns a **_ActiveRecord::Relation_** object and the same is returned by the **_.where(name: “user1”)_** chain as well.\n\nFrom the ActiveRecord::Relation docs, it has the following methods.\n\n> CLAUSE\\_METHODS=\\[:where, :having, :from\\] INVALID\\_METHODS\\_FOR\\_DELETE\\_ALL=\\[:distinct, :group, :having\\] MULTI\\_VALUE\\_METHODS=\\[:includes, :eager\\_load, :preload, :select, :group, :order, :joins, :left\\_outer\\_joins, :references, :extending, :unscope\\] SINGLE\\_VALUE\\_METHODS=\\[:limit, :offset, :lock, :readonly, :reordering, :reverse\\_order, :distinct, :create\\_with, :skip\\_query\\_cache\\]\n\nIn order to override the above methods, we can override the ActiveRecord::Relation class by doing the following.\n```ruby\nclass ApplicationModel < ActiveRecord::Base\n  \n  self.abstract_class = true\n  \n  class ActiveRecord::Relation\n     \n    def where args\n      # Do your stuff here\n      super\n    end\n    \n  end\n  \nend\n```\n\nThe above set should be enough to override the ORM such that the stats layer can be reliably built in.\n\nWe also need to take care of ruby’s metaprogramming aspects in the stats module. Maybe that’s a blog post for another day.\n\n**_I hope you liked the article. Please let me know if you have any queries regarding the article. Happy reading!!_**\n\nReferences\n\n*   [https://api.rubyonrails.org/classes/ActiveRecord/Relation.html](https://api.rubyonrails.org/classes/ActiveRecord/Relation.html)\n",
            "url": "https://gauravsarma.com/posts/2019-06-09_Hooking-in-a-Stats-module-in-Rails-Active-Record-942c4fdbc0a9",
            "title": "Hooking in a Stats module in Rails Active Record",
            "summary": "Most full-fledged web frameworks come with ORMs built in.  ORMs or Object Relational Mappings help to map the programming language data structures to actual data stores without having to worry about the underlying data source...",
            "date_modified": "2019-06-09T00:00:00.000Z"
        },
        {
            "id": "https://gauravsarma.com/posts/2019-05-23_Comparison-of-net-http-and-httprouter-df8edd1004e7",
            "content_html": "\nThis post will mainly revolve around the comparison between different implementations of Routers in the HTTP based frameworks.\n\nLet’s first go over what routers are in the context of a HTTP framework.\n\nMost frameworks today implement the MVC pattern or at least something similar to it. Even lightweight frameworks which don’t actually implement any design pattern have multiple built in features which usually doesn’t require custom logic by the application developer.\n\nWhen a request is received by the framework handler, the framework reads the URI path and dispatches the request to the user defined action along with the request context in the request object and expects and response object in return. It is the responsibility of the router to read the request URI and call the appropriate handler.\n\nIn this blog post, we are going to compare the routers in 2 packages:\n\n*   [net/http](https://golang.org/pkg/net/http/) (standard golang package)\n*   [httprouter](https://github.com/julienschmidt/httprouter)\n\n### Net/http\n\nGolang’s standard library comes with a pretty powerful package to handle and build HTTP applications. The best part of the package is that the modules can be plugged in and changed as required.\n\nBefore starting the HTTP server, it is required to register the handlers with the appropriate URI pattern. When a URI pattern is added, the mapping is added to a _map_ data structure which is equivalent to a hashmap.\n\nWhen a request is received by the _http_ module, it looks in the _mux_ data structure which contains a map containing the pattern and the registered handler.\n\n![](/img/httprouter_1.png)\n\nIn order to find the required handler for the received pattern, it calls the **_match_** method defined on the _ServeMux_ structure. In the below method, it first checks for a direct comparison in the _mux.m_ map. If no items are found, it tries to find the longest valid match and calls the appropriate handler.\n\n![](/img/httprouter_2.png)\n\n### Httprouter\n\nThere are multiple popular frameworks built on top of the _httprouter_ package like Gin, Ace, Neko, etc. The _httprouter’s_ [github](https://github.com/julienschmidt/httprouter) page has an excellent description of the way the module handles the routing.\n\nThe main distinction from the _net/http_ package is that _httprouter_ uses compact prefix trees to find the appropriate handler to the URI pattern.\n\n![](/img/httprouter_3.png)\n![](/img/httprouter_4.png)\n\nWhen a request is received by the _httprouter_ module, it finds the appropriate tree based on the API method and calls the **_addRoute_** method defined on the _node_ struct.\n\n![](/img/httprouter_5.png)\n\nTaking a sample out of the github page, a tree is formed based on the registered handlers.\n\n![](/img/httprouter_6.png)\n\n### Comparison between net/http and httprouter Routers\n\nSince it tries to find the longest valid match, the results from the _match_ method may be confusing when there are many similar patterns defined. This has caused confusion for developers based on the issues raised for the same.\n\nSince there are also multiple entries for each pattern defined, the memory usage will also be a little higher compared to _httprouter_ when there are multiple similar patterns.\n\nIt also has to spend more compute compared to _httprouter_ to reach the actual handler as the _httprouter_ router can have a direct walk to the required handler making the correct and more predictable decision.\n\n**_I hope you liked the article. Please let me know if you have any queries regarding the article. Happy reading!!_**\n\nReferences\n\n*   [https://github.com/julienschmidt/httprouter](https://github.com/julienschmidt/httprouter)\n*   [https://golang.org/pkg/net/http/](https://golang.org/pkg/net/http/)\n",
            "url": "https://gauravsarma.com/posts/2019-05-23_Comparison-of-net-http-and-httprouter-df8edd1004e7",
            "title": "Comparison of net/http and httprouter",
            "summary": "This post will mainly revolve around the comparison between different implementations of Routers in the HTTP based frameworks.  Let’s first go over what routers are in the context of a HTTP framework...",
            "date_modified": "2019-05-23T00:00:00.000Z"
        },
        {
            "id": "https://gauravsarma.com/posts/2018-03-02_Linux-Address-Space-45e1d0aa8c86",
            "content_html": "\nLinux processes interact with virtual memory and not the physical memory. Every process has a notion that it is the only process running in the system and hence, has unlimited access to the memory present in the system.\n\nVarious processes may have the same virtual memory address space but it doesn’t collide because the kernel takes care of the virtual memory to physical memory mapping. An example when a process may have to share it’s virtual memory is when it spawns threads, or threads of execution.\n\nThe process doesn’t have permission to access certain parts of the address space which is reserved by the kernel. A process can access a memory address only if it is in the valid area. Memory addresses can have associated permissions that a process must respect. If this is not respected by the process, then the kernel throws a _Segmentation Fault_ message and kills the process.\n\nMemory areas may have the following content:\n\n*   Executable file’s code, which is known as the _text section_\n*   Executable file’s initialized global variables, which is known as the _data section_\n*   Uninitialized variables called the _bss (block started by symbol) section_\n*   _Stack_\n*   _Heap_\n\n**Memory Descriptor:**\n\nIn the linux kernel code, the processes’ address space can be defined in the following data structure.\n\n```\nstruct mm\\_struct {  \n        struct vm\\_area\\_struct  \\*mmap;               /\\* list of memory areas \\*/  \n        struct rb\\_root         mm\\_rb;               /\\* red-black tree of VMAs \\*/  \n        struct vm\\_area\\_struct  \\*mmap\\_cache;         /\\* last used memory area \\*/  \n        unsigned long          free\\_area\\_cache;     /\\* 1st address space hole \\*/  \n        pgd\\_t                  \\*pgd;                /\\* page global directory \\*/  \n        atomic\\_t               mm\\_users;            /\\* address space users \\*/  \n        atomic\\_t               mm\\_count;            /\\* primary usage counter \\*/  \n        int                    map\\_count;           /\\* number of memory areas \\*/  \n        struct rw\\_semaphore    mmap\\_sem;            /\\* memory area semaphore \\*/  \n        spinlock\\_t             page\\_table\\_lock;     /\\* page table lock \\*/  \n        struct list\\_head       mmlist;              /\\* list of all mm\\_structs \\*/  \n        unsigned long          start\\_code;          /\\* start address of code \\*/  \n        unsigned long          end\\_code;            /\\* final address of code \\*/  \n        unsigned long          start\\_data;          /\\* start address of data \\*/  \n        unsigned long          end\\_data;            /\\* final address of data \\*/  \n        unsigned long          start\\_brk;           /\\* start address of heap \\*/  \n        unsigned long          brk;                 /\\* final address of heap \\*/  \n        unsigned long          start\\_stack;         /\\* start address of stack \\*/  \n        unsigned long          arg\\_start;           /\\* start of arguments \\*/  \n        unsigned long          arg\\_end;             /\\* end of arguments \\*/  \n        unsigned long          env\\_start;           /\\* start of environment \\*/  \n        unsigned long          env\\_end;             /\\* end of environment \\*/  \n        unsigned long          rss;                 /\\* pages allocated \\*/  \n        unsigned long          total\\_vm;            /\\* total number of pages \\*/  \n        unsigned long          locked\\_vm;           /\\* number of locked pages \\*/  \n        unsigned long          def\\_flags;           /\\* default access flags \\*/  \n        unsigned long          cpu\\_vm\\_mask;         /\\* lazy TLB switch mask \\*/  \n        unsigned long          swap\\_address;        /\\* last scanned address \\*/  \n        unsigned               dumpable:1;          /\\* can this mm core dump? \\*/  \n        int                    used\\_hugetlb;        /\\* used hugetlb pages? \\*/  \n        mm\\_context\\_t           context;             /\\* arch-specific data \\*/  \n        int                    core\\_waiters;        /\\* thread core dump waiters \\*/  \n        struct completion      \\*core\\_startup\\_done;  /\\* core start completion \\*/  \n        struct completion      core\\_done;           /\\* core end completion \\*/  \n        rwlock\\_t               ioctx\\_list\\_lock;     /\\* AIO I/O list lock \\*/  \n        struct kioctx          \\*ioctx\\_list;         /\\* AIO I/O list \\*/  \n        struct kioctx          default\\_kioctx;      /\\* AIO default I/O context \\*/  \n};\n```\n\nThe number of processes/threads using the same address space can be checked via the _mm\\_users_ variable. The _mmap_ and _mm\\_rb_ point to the memory addresses in the address space. Both the variables point to the same information but in different representations. _mmap_ is a linked list whereas _mm\\_rb_ is a red black tree. This is done so that the _mmap_ can be used for simple traversal need and the _mm\\_rb_ can be used for searching purposes.\n\nThe kernel represents the process address space via the memory descriptor. The memory descriptor of the process is pointed to via the _mm_ field in the _task\\_struct_ structure.\n\n```\nstruct task\\_struct {\n\n  volatile long        state;          /\\* -1 unrunnable, 0 runnable, >0 stopped \\*/  \n  long                 counter;  \n  long                 priority;  \n  unsigned             long signal;  \n  unsigned             long blocked;   /\\* bitmap of masked signals \\*/  \n  unsigned             long flags;     /\\* per process flags, defined below \\*/  \n  int errno;  \n  long                 debugreg\\[8\\];    /\\* Hardware debugging registers \\*/  \n  struct exec\\_domain   \\*exec\\_domain;\n\n  struct linux\\_binfmt  \\*binfmt;  \n  struct task\\_struct   \\*next\\_task, \\*prev\\_task;  \n  struct task\\_struct   \\*next\\_run,  \\*prev\\_run;  \n  unsigned long        saved\\_kernel\\_stack;  \n  unsigned long        kernel\\_stack\\_page;  \n  int                  exit\\_code, exit\\_signal;\n\n  unsigned long        personality;  \n  int                  dumpable:1;  \n  int                  did\\_exec:1;  \n  int                  pid;  \n  int                  pgrp;  \n  int                  tty\\_old\\_pgrp;  \n  int                  session;  \n  /\\* boolean value for session group leader \\*/  \n  int                  leader;  \n  int                  groups\\[NGROUPS\\];\n\n  struct task\\_struct   \\*p\\_opptr, \\*p\\_pptr, \\*p\\_cptr,   \n                       \\*p\\_ysptr, \\*p\\_osptr;  \n  struct wait\\_queue    \\*wait\\_chldexit;    \n  unsigned short       uid,euid,suid,fsuid;  \n  unsigned short       gid,egid,sgid,fsgid;  \n  unsigned long        timeout, policy, rt\\_priority;  \n  unsigned long        it\\_real\\_value, it\\_prof\\_value, it\\_virt\\_value;  \n  unsigned long        it\\_real\\_incr, it\\_prof\\_incr, it\\_virt\\_incr;  \n  struct timer\\_list    real\\_timer;  \n  long                 utime, stime, cutime, cstime, start\\_time;\n\n  unsigned long        min\\_flt, maj\\_flt, nswap, cmin\\_flt, cmaj\\_flt, cnswap;  \n  int swappable:1;  \n  unsigned long        swap\\_address;  \n  unsigned long        old\\_maj\\_flt;    /\\* old value of maj\\_flt \\*/  \n  unsigned long        dec\\_flt;        /\\* page fault count of the last time \\*/  \n  unsigned long        swap\\_cnt;       /\\* number of pages to swap on next pass \\*/\n\n  struct rlimit        rlim\\[RLIM\\_NLIMITS\\];  \n  unsigned short       used\\_math;  \n  char                 comm\\[16\\];\n\n  int                  link\\_count;  \n  struct tty\\_struct    \\*tty;  \n  struct sem\\_undo      \\*semundo;  \n  struct sem\\_queue     \\*semsleeping;  \n  struct desc\\_struct \\*ldt;  \n  struct thread\\_struct tss;  \n  struct fs\\_struct     \\*fs;  \n  struct files\\_struct  \\*files;  \n  struct mm\\_struct     \\*mm;  \n  struct signal\\_struct \\*sig;  \n#ifdef \\_\\_SMP\\_\\_  \n  int                  processor;  \n  int                  last\\_processor;  \n  int                  lock\\_depth;       \n#endif     \n};\n```\n\nThe _current->mm_ points to the memory descriptor of the process. The _copy\\_mm()_ is used to copy the parent’s memory descriptor to the child during _fork()._ Each process receives a unique _mm\\_struct_, hence a unique address space. In some cases where the address space is shared by multiple processes, they are known as threads and are done by calling the _clone()_with _CLONE\\_VM_ flag set. This is why threads are just another process according to the linux kernel who happen to share the address space i.e some of its resources with another process.\n\nWhen the process exits, it calls the _exit\\_mm()_ function which in turn calls _free\\_mm()_ if the reference count of the process is 0 and does some housekeeping and statistics update.\n\n**Virtual memory areas**\n\nThe memory areas are represented in the kernel code via the _vm\\_area\\_struct_, which are also called virtual memory areas.\n\n```\nstruct vm\\_area\\_struct {  \n        struct mm\\_struct             \\*vm\\_mm;        /\\* associated mm\\_struct \\*/  \n        unsigned long                vm\\_start;      /\\* VMA start, inclusive \\*/  \n        unsigned long                vm\\_end;        /\\* VMA end , exclusive \\*/  \n        struct vm\\_area\\_struct        \\*vm\\_next;      /\\* list of VMA's \\*/  \n        pgprot\\_t                     vm\\_page\\_prot;  /\\* access permissions \\*/  \n        unsigned long                vm\\_flags;      /\\* flags \\*/  \n        struct rb\\_node               vm\\_rb;         /\\* VMA's node in the tree \\*/  \n        union {         /\\* links to address\\_space->i\\_mmap or i\\_mmap\\_nonlinear \\*/  \n                struct {  \n                        struct list\\_head        list;  \n                        void                    \\*parent;  \n                        struct vm\\_area\\_struct   \\*head;  \n                } vm\\_set;  \n                struct prio\\_tree\\_node prio\\_tree\\_node;  \n        } shared;  \n        struct list\\_head             anon\\_vma\\_node;     /\\* anon\\_vma entry \\*/  \n        struct anon\\_vma              \\*anon\\_vma;         /\\* anonymous VMA object \\*/  \n        struct vm\\_operations\\_struct  \\*vm\\_ops;           /\\* associated ops \\*/  \n        unsigned long                vm\\_pgoff;          /\\* offset within file \\*/  \n        struct file                  \\*vm\\_file;          /\\* mapped file, if any \\*/  \n        void                         \\*vm\\_private\\_data;  /\\* private data \\*/  \n};\n```\n\nIt describes a single memory area over a contiguous interval. Each memory area has certain associated permissions and flags which help to denote the type of memory area — for example, memory-mapped areas or the processes’s user-space stack.\n\nThe _vm\\_mm_ struct points to the corresponding _mm\\_struct_ that it belongs to which confirms the uniqueness of the address space of a process.\n\nAlthough the applications operate on the virtual memory address space, the processors operate on the physical memory. Therefore, whenever an application accesses a virtual memory address, it is first converted to the physical memory, i.e where the data actually resides. This lookup is done via page tables. Virtual memory is divided up into chunks and the index is stored. The index can point to another table or to the physical page.\n\nLinux, by default, maintains 3 levels of page tables to further optimize the page lookup. Even on systems which have no hardware support, it still optimizes the 3 level page table as it is necessary to have indexed page tables for faster lookups.\n\nThe top page table is known as the _Page Global Directory (PGD)_ which contains an array of unsigned long entries. The entry in the PGD point to the PMD.\n\nThe second page table is known as the _Page Middle Directory (PMD)_ which further points to the PTE.\n\nThe _Page Table Entries (PTE)_ point to the actual physical pages.\n\nEvery process has its own page tables and is pointed to the PGD via the _pgd_ data structure in the memory descriptor.\n\nEven after maintain 3 levels of page tables, the lookup can only be so fast as it is vast searchable area. In order to further improve upon this, most processors implement a _Translation Lookaside Buffer (TLB)_ which acts as a hardware cache between virtual to physical mappings. Therefore, if the cache is hit, it returns directly from the TLB or it further processes the virtual to physical memory mapping.\n\n**_Most of the data in the article is inspired by Linux Kernel Development book by Robert Love. This is a must read for anybody who wishes to actually understand the underneath workings of the linux kernel._**\n",
            "url": "https://gauravsarma.com/posts/2018-03-02_Linux-Address-Space-45e1d0aa8c86",
            "title": "Linux Address Space",
            "summary": "Linux processes interact with virtual memory and not the physical memory.  Every process has a notion that it is the only process running in the system and hence, has unlimited access to the memory present in the system...",
            "date_modified": "2018-03-02T00:00:00.000Z"
        },
        {
            "id": "https://gauravsarma.com/posts/2017-10-06_A-peek-at-OpenStack-Neutron-8660a6905b2",
            "content_html": "\nFrom the OpenStack official [website](https://www.openstack.org/software/), OpenStack is a cloud operating system that controls large pools of compute, storage, and networking resources throughout a datacenter, all managed through a dashboard that gives administrators control while empowering their users to provision resources through a web interface.\n\nToday we will be concentrating more on the detailed working of the networking service of OpenStack, known as [Neutron](https://www.openstack.org/software/releases/ocata/components/neutron).\n\nOpenStack (nova) compute also has a legacy networking service which manages the networking part for it. However, it is not as extensive and flexible you would want if networking is one of the elements on which you want to have control on.\n\nThe article will also give you a better understanding of networking in virtual machines work.\n\nThere are certain prerequisites in order to gain a better understanding of the following article.\n\n[Linux Namespaces](https://en.wikipedia.org/wiki/Linux_namespaces) are a feature of linux via which help to isolate and virtualize the resources of the system. The types of namespaces are process based, networking based, mount based, IPC based, User Id based and control group based. For this article, understanding the networking namespace would suffice.\n\n[Tun/tap interfaces](http://backreference.org/2010/03/26/tuntap-interface-tutorial/) are a feature offered by Linux (and probably by other UNIX-like operating systems) that can do userspace networking, that is, allow userspace programs to see raw network traffic (at the ethernet or IP level) and do whatever they like with it.\n\n[VLAN interfaces](https://en.wikipedia.org/wiki/Switch_virtual_interface) are interfaces which are unique based on their VLAN ID, multiple VLANs can be set on top of a single interface.\n\n[Veth pair](http://www.opencloudblog.com/?p=66) are a feature which is usually used to provide direct connectivity between network namespaces.\n\n[Linux bridge](https://goyalankit.com/blog/linux-bridge) is a layer 2 virtual device that on its own cannot receive or transmit anything unless you bind one or more real devices to it.\n\n[OVS](http://openvswitch.org/) is virtual switch which helps in switching via flow tables and contains a database and daemon in order to match and carry out rules\n\n**Types of Networks:**\n\n*   Local — A local network is a network where the instances can only communicate with the instances in the same compute node if they are in the same network.\n*   Flat — A flat network doesn’t have any segregation of networks based on VLANs.\n*   VLAN — A VLAN network is a network where VLANs are used for segregation of networks.\n*   VXLAN/GRE — VXLAN and GRE are used to create overlay networks (network built on top of a network).\n\nOpenStack neutron has 2 main networking plugins.\n\n*   Linux Bridge\n*   OVS\n\nIn the previous version of OpenStack, only one of the plugins was usable per deployment. However, in the recent releases, both the versions can coexist. In this article, we will focus mainly on the Linux Bridge plugin and a little on the OVS.\n\nWe will now discuss how the different types of networks are implemented in Neutron.\n\n**Local:**\n\n![](/img/openstack_1.png)\n\nIn local networks, the instances don’t have communication with the external world. Hence, there is no physical interface in the bridge and the bridge only contains the tap interfaces which enables communication between the local instances.\n\n**Flat:**\n\n![](/img/openstack_2.png)\n\nIn a flat network, there are no VLAN segregations. Hence, the tap interfaces are directly put in a linux bridge with the physical interface which means that only a single network can exist.\n\n**VLAN:**\n\n![](/img/openstack_3.png)\n\nIn the above diagram, we have a physical interface eth0. Using VLANs, we create two more interfaces eth0.100 and eth0.101 on the interface eth0. We create tap interfaces Tap0, Tap1 and Tap2 for VM1, VM2 and VM3. We put Tap0, Tap1 and eth0.100 in the same bridge and Tap2 and eth0.101 in another bridge. When the traffic goes out of the tap interfaces, onto to the VLAN interfaces, the traffic is tagged with the VLAN ID so that ingress traffic can also be forwarded back to the VM.\n\nThe OVS plugin which Neutron uses is implemented in a different manner. OVS is a virtual switch which basically maintains flow tables which can be chained to others. Depending on the flow (source ip, destination ip, source port, destination port, protocol), it matches the flow with the appropriate action which can be dropping, tagging or sending the packet. It uses veth pairs, patch ports, tap interfaces, provider bridge, integration bridge and physical interfaces to achieve the above mentioned type of networks.\n\nThis is precisely how hypervisors implement networking in our host machines. I hope this article helps you to understand how the hypervisors implement multiple networking strategies like bridged mode, NAT mode and so on.\n\n_I hope you liked the article. I will be coming up with something on OVS as well soon. Please let me know if you have any queries regarding the article. Happy reading!!_\n",
            "url": "https://gauravsarma.com/posts/2017-10-06_A-peek-at-OpenStack-Neutron-8660a6905b2",
            "title": "A peek at OpenStack Neutron",
            "summary": "From the OpenStack official [website](https://www. openstack...",
            "date_modified": "2017-10-06T00:00:00.000Z"
        },
        {
            "id": "https://gauravsarma.com/random/2026-04-21_how-big-tech-interviews-work",
            "content_html": "\nHaving worked at a few big tech companies and having interviewed candidates for them as well, here are some things I wish more candidates knew going in.\n\n### How decisions are actually made\n\nEach interviewer is assigned a round and for every round, there are specific signals they need to collect. For example, in a coding round, they are looking for things like clean code practices, sufficient test coverage and edge case handling, awareness of algorithmic complexity, and how you communicate your thought process. In a system design round, they care about trade-off analysis, scalability thinking, and whether you can identify bottlenecks before being prompted.\n\nThe interviewers don't know how you performed in your other rounds. This is intentional. Each round is evaluated independently so that a rough start doesn't bias the rest of your loop.\n\nAfter the interview, they submit feedback as Strong Hire, Hire, No Hire, or Strong No Hire.\n\nOnce all rounds are over, there is a panel discussion where all the interviewers meet to discuss. This is where it gets interesting. Even if one interviewer marks you as No Hire, the panel checks whether the skill that was flagged was demonstrated in a different round. For example, if you got a No Hire in the coding round because your edge case handling was weak, but the system design interviewer noted that you proactively identified edge cases in your design, the panel takes that into account.\n\nSolving the question correctly is not the whole picture. I have seen candidates who solved the problem but got a No Hire because they couldn't explain their approach, or they wrote code that worked but was completely unreadable, or they didn't consider any edge cases until prompted. Conversely, I have seen candidates who didn't fully finish the problem but got a Hire because they demonstrated strong problem-solving instincts, communicated clearly, and showed they knew where they were heading.\n\nThe interviewers are collecting signals throughout the entire interview. If they have received all the required signals, congrats, you are in.\n\n### The rounds are standardised\n\nMost companies run the same set of rounds regardless of team: DSA, system design, and project deep dives. These common rounds exist to ensure every candidate clears a baseline bar of expertise. It doesn't matter if you are joining the payments team or the infrastructure team, the bar is the same.\n\nSome companies hold additional specialised rounds like API integration or bug squash, which are more representative of actual day-to-day work. These rounds tend to give a better signal of how the candidate will actually perform on the job, but they are harder to standardise and harder to evaluate without bias. That's why they are less common.\n\nBased on the role they are hiring for, there may also be team-specific rounds to understand which team the candidate is more suitable for. These usually happen after the common rounds and are more conversational.\n\n### Startup vs MNC interviews are fundamentally different\n\nThis is something a lot of candidates don't appreciate, and it costs them.\n\nStartup interviewers go in with open ended questions. The direction of the interview can be controlled by you based on your answers. If you bring up an interesting tangent about a past project, the interviewer might follow that thread for 15 minutes because they found it relevant. The interview is more of a conversation.\n\nMNC interviewers don't have that freedom. They have a specific set of questions, a specific set of signals to look for, and a structured rubric to fill out. They cannot deviate from the plan even if they want to. This is by design because structure removes bias and ensures consistency across thousands of interviews.\n\nWhy is this important to know? I have seen candidates who go into tangential answers during MNC interviews, spending 10 minutes on something that doesn't map to any signal the interviewer is collecting. The interviewer can't redirect you because they are not supposed to lead you. You just lost 10 minutes of a 45-minute interview, and those minutes are not coming back.\n\nKeep your answers to the point. If the interviewer asks about time complexity, give them the time complexity. Don't launch into a story about how you optimised a similar problem at your last job unless they ask. The interview is time-boxed and every minute counts.\n\n### The poker face is by design\n\nThis one throws off a lot of candidates.\n\nMNC interviewers go through extensive bias training. They are trained to not react positively or negatively to your answers because their reactions can influence your performance. A smile after a correct answer makes you confident. A frown after a wrong one makes you spiral. Both are forms of bias.\n\nWhat this means in practice is that the interviewer will maintain a completely neutral expression for the entire interview. They won't nod approvingly when you are on the right track. They won't give you a concerned look when you are going down the wrong path. They will just sit there, take notes, and ask their next question.\n\nThis makes it extremely hard to gauge how you are doing, and that is the point. The interview is designed to evaluate you in a controlled environment, not to give you real-time feedback.\n\nDon't get disheartened by it. The lack of reaction doesn't mean you are doing poorly. It means the interviewer is doing their job correctly. I have given interviews where I maintained a complete poker face for 45 minutes and then submitted a Strong Hire. The candidate had no idea.\n\n### Pick the right language\n\nThis sounds obvious but I see candidates get this wrong all the time.\n\nPick the language you use daily. I have seen candidates choose Rust or C++ because they think it \"looks impressive\" and then struggle with basic string operations during the interview. They spend 5 minutes trying to split a string or figure out how hashmaps work in that language. C++ doesn't even have a split function.\n\nEven if internet access is allowed during the interview, spending 2-3 minutes looking up a basic stdlib helper is not a good signal. It tells the interviewer that you are not comfortable with your tools, which raises questions about your day-to-day coding fluency.\n\nPython, Go, Java, JavaScript, pick whichever one you write code in every day. The interviewer does not care which language you use. They care about whether you can solve the problem correctly and communicate your thought process clearly within the time limit.\n\nIt doesn't matter how elegant your code looks if you can't finish the solution in time. Speed and correctness beat aesthetics every single time.\n\n### Don't underestimate the behavioural round\n\nA lot of engineers treat the behavioural round as a formality. It is not. This round often has veto power, especially at senior levels. The interviewer is trying to understand how you handle conflict, how you deal with ambiguity, how you influence without authority, and whether you have self-awareness about your own mistakes.\n\nPrepare concrete examples from your past work. The STAR format (Situation, Task, Action, Result) works well here, not because it is clever, but because it forces you to be specific instead of giving vague answers like \"I am a team player.\"\n\nThe worst thing you can do in a behavioural round is say you have never had a conflict with a colleague or that you have never failed at anything. Nobody believes that, and it signals a lack of self-awareness.\n\n### After the interview\n\nOnce you are done, let it go. You cannot change the outcome by replaying the interview in your head. The panel will meet, the decision will be made, and you will hear back.\n\nIf you get rejected, ask for feedback if the company offers it. Not all do, but when they do, the feedback is usually specific enough to be actionable. I have seen candidates come back after 6 months, address the exact gaps that were flagged, and clear the interview on their second attempt.\n\nThe process is not perfect. No interview process is. But understanding how it works from the inside gives you a real advantage over candidates who walk in blind.\n",
            "url": "https://gauravsarma.com/random/2026-04-21_how-big-tech-interviews-work",
            "title": "How big tech interviews actually work",
            "summary": "Having worked at a few big tech companies and having interviewed candidates for them as well, here are some things I wish more candidates knew going in.  How decisions are actually made Each interviewer is assigned a round and for every round, there are specific signals they need to collect...",
            "date_modified": "2026-04-21T00:00:00.000Z",
            "tags": [
                "career",
                "interview",
                "faang"
            ]
        },
        {
            "id": "https://gauravsarma.com/random/2026-04-15_running-the-first-10k",
            "content_html": "\nFinished another 10k. Felt way stronger this time. The main difference was refueling with an isotonic drink mid-run.\n\nI started running more frequently a couple of months back. As someone who absolutely hated running before that, I've surprised myself quite a bit.\n\nThe first 2 kms are always the hardest. There's a lot of inertia, the body resists movement. After that, something shifts. The pain and lethargy disappear, and the body enters cruising mode. From that point on, it's purely a mental game: how long can the mind sustain the resistance.\n\nSince I wasn't a regular runner until recently, I'm being deliberate about taking it slow. Listening to my body on run days and recovery days. Long distance running is well documented to be harder on the body, knee pain and ligament tears being common. I don't want to get injured while building up.\n\nSo far it feels comfortable, and I think I can push more. The things that have made a difference: refueling during and after runs, stretching religiously, and strength training specifically to support the running.\n\nTargeting a half marathon (21k) by June. Still a long way to go!\n",
            "url": "https://gauravsarma.com/random/2026-04-15_running-the-first-10k",
            "title": "Running the first 10k",
            "summary": "Finished another 10k.  Felt way stronger this time...",
            "date_modified": "2026-04-15T00:00:00.000Z",
            "tags": [
                "running",
                "fitness"
            ]
        },
        {
            "id": "https://gauravsarma.com/random/2025-12-02_running-towards-fire",
            "content_html": "\nFIRE stands for Financially Independent Retire Early.\nThere is a growing trend of people trying to achieve FIRE before the age of 40.\n\nEveryone has a FIRE number based on their current lifestyle and supposed future expenses based\non which they calculate a networth number which would allow them to retire.\n\nAs I complete a decade in the software industry, I often had contradictory opinions about the\nentire concept as a whole. In this article, I try to argue both for and against FIRE and then hope\nto come to a conclusion.\n\nI have often grown up hearing the phrase \"Karma is dharma\", which roughly translates to \"Work is Duty\".\nWhenever I talk to my father about work, he only says \"be sincere in your work and everything else will\nfall into place\". He gave 37 years of his life to a single company as have many of our previous generation\nand have led happy fulfilled respectable lives.\n\nWhy then, has this trend suddenly become so popular? Why did people develop this overarching urge to stop\nworking and just be \"free\"? What is the definition of \"free\"?\n\nDoes being free equate to owning your time? If so, what would one want to do with their time? Play a sport,\nlearn an instrument, tend their garden, travel around the world?\n\nBut if you do something frequently and continuously, isn't that similar to work? It's just that you \"like\" doing it,\nso maybe you thought that you could do that more frequently. But would you be willing to sacrifice for it? For example,\na sportperson has to sacrifice their entire childhood and major parts of their adulthood to be in the top percentile\nat their fields. A musician has to spend hours practising their craft to be the best.\n\nMaybe you don't want to sacrifice so much for a specific thing? You just want to do it for the fun of it. You are\nsearching for work-life balance. Maybe you are content with not being the best version of yourself in a particular field?\n\nI believe overwhelmingly in being `Financially Independent`. I revolt even more strongly at `Retire Early`.\nBeing financially independent means having the ability to take more risks. Having the ability to take more risks means being\nable to stretch into areas that one may have been interested in the past, but didn't get time to explore. Maybe that risk\ntaking capability allows you to start a new career which was under-invested in your life for some reason.\n",
            "url": "https://gauravsarma.com/random/2025-12-02_running-towards-fire",
            "title": "Running towards FIRE",
            "summary": "FIRE stands for Financially Independent Retire Early.  There is a growing trend of people trying to achieve FIRE before the age of 40...",
            "date_modified": "2025-12-02T00:00:00.000Z",
            "tags": [
                "fire",
                "retirement"
            ]
        },
        {
            "id": "https://gauravsarma.com/random/2025-11-09_is-devops-dying",
            "content_html": "\nOriginal Tweet - https://x.com/sarmag77/status/1987482331683140014\n\nBack in 2019, I was in charge of hiring 5-6 engineers for a new team.\n\nThis team was supposed to build automation around the existing APIs so that we could scale faster. The automation was supposed to be around all steps of the product, from CI/CD to production monitoring to automatic healing.\n\nThe entire product was used to manage 10s of thousands of SD-WAN routers in distributed locations with complex topologies. Building an automation tool for it was going to be a complex and fun affair. It may have been more complicated than building the entire product because it would require understanding the product completely and to build proper interfaces for data and actions which was not already exposed.\n\nBeing a company where there were only around 40-50 engineers, they didn't have a DevOps team previously.\n\nSo I put the hired engineers in the new team called the \"DevOps\" team. All of them protested saying that they didn't want to do manual work. That's when I realised that the entire industry hadn't settled on the actual scope of work that a DevOps engineer has to do.\n\nIn some companies, DevOps engineers do only manual stuff, basically checking logs, deployments, creating pipelines, etc. In some companies, they build the entire infrastructure around managing the product efficiently.\n\nI believe that's where the distinction started coming in. The ability to build a system or platform to manage a product compared to just writing ad-hoc scripts or yaml configurations.\n\nToday, there are different specialisations of platform/infrastructure roles.\n\nLet's take the example of Stripe. Stripe hires specialised engineers for it's infrastructure teams. There are different infrastructure teams for different platforms, for eg: Mongo, Elasticsearch, Ruby. All these teams build tooling and platforms around these tools.\n\nThe advantage of these teams is that these teams are highly specialised teams whose main focus is to standardise and scale the overall platform so that the product engineer building features wouldn't have to worry about whether the data is being backed up properly or if there is insecure code being written.\n\nHave the manual DevOps or Platform jobs gone away? No. But are they decreasing. Yes.\n\nSo if you are someone who is looking to become a platform or devops engineer, better focus on building specialisation on a specific part of the process.\n\nI worked for database specific teams in the last 2 roles and it has always been exciting to understand what affects a distributed high throughput database from scaling.\n",
            "url": "https://gauravsarma.com/random/2025-11-09_is-devops-dying",
            "title": "Is DevOps dying",
            "summary": "Original Tweet - https://x. com/sarmag77/status/1987482331683140014 Back in 2019, I was in charge of hiring 5-6 engineers for a new team...",
            "date_modified": "2025-11-09T00:00:00.000Z",
            "tags": [
                "career",
                "devops",
                "platform"
            ]
        },
        {
            "id": "https://gauravsarma.com/random/2025-07-03_find-the-right-mentor",
            "content_html": "\nOriginal Tweet - https://x.com/sarmag77/status/1952147743935521188\n\nSomething similar happened to me when I was working at a startup right out of college.\n\nI was building a service on top of ZeroMQ using Python and was working along with a system programmer with 15 YoE who had used only C his whole life.\n\nHe hadn't worked on python ever or any other language except for C and a little bit of lua.\n\nHe saw a part of the code where I was using Python's dict and asked me if this was a hashmap. I said yes.\nThen he asked me the type of hashmap it was, mainly separate chaining or open addressing. No clue.\nAsked me how large was the hashmap's footprint? No clue.\nHow did the hashmap grow when new keys are added?\nIs the memory freed when the hashmap keys are deleted?\nAre the hashmap allocations NUMA aware?\n\nI had no answers to his questions and this was the first time I felt like I didn't know the actual internals of what I was working on.\n\nHe is one of the main reasons why I focus on internals instead of the high level stuff people keep spitting out.\n\nI worked with him for the next 5 years and it was an amazing experience. I learnt to question everything till I understood it inside out.\n\nFind the right mentors and it will change your life.\n\nArchimedes' principe: Give me a large enough lever and I will move the world.\n\nHis principle: Give me vim and some coffee and I will rewrite the entire operating system.\n",
            "url": "https://gauravsarma.com/random/2025-07-03_find-the-right-mentor",
            "title": "Find the right mentor",
            "summary": "Original Tweet - https://x. com/sarmag77/status/1952147743935521188 Something similar happened to me when I was working at a startup right out of college...",
            "date_modified": "2025-07-03T00:00:00.000Z",
            "tags": [
                "career",
                "mentor"
            ]
        },
        {
            "id": "https://gauravsarma.com/random/2025-06-25_invest-in-the-right-tech",
            "content_html": "\nOriginal Tweet - https://x.com/sarmag77/status/1948641731966369923\n\nBack in 2018-2019, Go experienced a spike massive adoption with popular projects like Kubernetes, Docker, Terraform.\n\nAnybody who knew Go back then were sought after by a lot of companies. Today it's already a maintstream language and most folks are learning, using or at least evaluating Go in their jobs.\n\nIf you already a seasoned developer and looking to pick a new language, it makes sense to pick a language which is gathering steam in few interesting products.\n\nThere will be people trying to troll the language for the syntax, runtime, safety, etc which may be off putting for anyone learning the language.\nBut remember, the true test of a language is the rate of adoption by companies.\n",
            "url": "https://gauravsarma.com/random/2025-06-25_invest-in-the-right-tech",
            "title": "Invest in the right future tech",
            "summary": "Original Tweet - https://x. com/sarmag77/status/1948641731966369923 Back in 2018-2019, Go experienced a spike massive adoption with popular projects like Kubernetes, Docker, Terraform...",
            "date_modified": "2025-06-25T00:00:00.000Z",
            "tags": [
                "career",
                "tech"
            ]
        },
        {
            "id": "https://gauravsarma.com/random/2025-06-11_dont-be-hasty",
            "content_html": "\nWhen I was new to the tech world and didn't have much experience with production systems, my main focus was to build fast.\n\nThat was the way I would measure how good a developer is.\n\nDidn't care about coding practices, tests, logs, monitoring, CI/CD pipelines and performance as well.\n\nBecause of this, I had to spend a lot of time in refactoring modules or just fixing stuff.\n\nIt seemed like I was moving fast, but it was just because I didn't pay attention to a lot of things that I probably should have.\n\nA senior came to me and asked me to \"take my time\".\nHe said \"harbari mei kyu ho\".\n\nIt's been some years now and I have changed the way I operate. It's much less stressful.\n\nLearn to productionize your code.\n\nIt seems slower, but I accomplish way more with a lesser amount of code than the younger me.\n",
            "url": "https://gauravsarma.com/random/2025-06-11_dont-be-hasty",
            "title": "Don't be hasty. Slow down",
            "summary": "When I was new to the tech world and didn't have much experience with production systems, my main focus was to build fast.  That was the way I would measure how good a developer is...",
            "date_modified": "2025-06-11T00:00:00.000Z",
            "tags": [
                "career",
                "tech"
            ]
        },
        {
            "id": "https://gauravsarma.com/random/2025-05-30_a-decade-in-the-tech-industry",
            "content_html": "\nI just realised that I completed a decade in the tech industry this month.\n17th May 2015, was my joining date in my first ever big boy job.\nI remember the date because I completed my college exams on 15th May and shifted\nfrom Chennai to Bangalore within a day to start the new job.\n\nWhen I was in college, the Software Development field was still in a nascent stage.\nMechanical, Electrical and Electronic branches were still in way more demand than any\nof the computer related disciplines.\n\nCS Jobs were also not available in abundance. Most of us had thought our destiny was to work\nin one of the WITCH companies (Wipro, Infosys, TCS, CTS, HCL) for less than Rs 3.5 LPA.\nCompanies like Google, Amazon and Microsoft only had support verticals in India or were doing\nmenial work for the other US based product companies.\n\nMost companies that came to college had an annual CTC of 4 LPA. Companies paying 6LPA were termed\nas `Dream companies`. For me, my only goal in college was to get a job, because Rs 30,000 per month\nsounded pretty sweet to me.\n\nI was pretty stoked to get an internship with a pre-placement offer from a startup for Rs 4.5 LPA in the\nfourth year of my college. I didn't go on to join the company after the internship ended because I received\na better offer from another company.\n\nIn those days, not a lot of folks understood the pros and cons of working in a startup. The first startup\nI worked in, was similar to a 3 bhk apartment on Indiranagar Double Road, Bangalore. I knew people who\nmade fun of people working in startups because we didn't get a lot of the benefits that they had access to,\nour work life balance was so skewed that we didn't have anything in our lives apart from work.\n\nIn those days, it was not a matter of choice but of chance. I was taught to give my all on anything I worked\non, and I did give it my all.\n\nThe fanciest tech stack you could work on back then was Ruby on Rails and Angular 1.x. That gave you the best kind\nof jobs.\n\nOver the years, I have had the opportunity of working in multiple companies with different environments, and I am\ngrateful for it all.\n\nLooking back, it has been a pretty fun and exciting ride.\nI still have the zeal of a fresher, with the added maturity of a senior engineer. And I am still as excited as ever.\n",
            "url": "https://gauravsarma.com/random/2025-05-30_a-decade-in-the-tech-industry",
            "title": "A decade in the tech industry",
            "summary": "I just realised that I completed a decade in the tech industry this month.  17th May 2015, was my joining date in my first ever big boy job...",
            "date_modified": "2025-05-30T00:00:00.000Z",
            "tags": [
                "career",
                "experience"
            ]
        },
        {
            "id": "https://gauravsarma.com/random/2025-05-27_the-right-way-to-interview",
            "content_html": "\nOriginal Tweet - https://x.com/sarmag77/status/1938429909338362348\n\nThe best kind of interview is the bug squash or pair programmer round where both the interviewer and the candidate team up to solve a common problem.\n\nIt provides so many insights into how the developer reacts when they are wrong, how they contradict your hypothesis and how they work in real life.\nIf I am in charge of hiring for a company, then this is the only technical interview I will hold.\n\nMaybe another round for behavioural fitment.\n\nThe main problem in making this more common is that it takes a lot of effort to prepare these kind of questions and the interviewer may also need to practise before the interview.\n\nAnother issue is that the judgement becomes a little vague and interview bias may kick in with the interviewer.\n\nOne thing DSA is good at is removing bias. Either you have the answer or you don't.\n\nThe best thing about this is that you don't have to prepare for long hours to interview.\nIt's fun for both the parties.\n\nThe company gets a great signal of how the candidate is going to perform in their actual job as well.\n",
            "url": "https://gauravsarma.com/random/2025-05-27_the-right-way-to-interview",
            "title": "The right way to interview",
            "summary": "Original Tweet - https://x. com/sarmag77/status/1938429909338362348 The best kind of interview is the bug squash or pair programmer round where both the interviewer and the candidate team up to solve a common problem...",
            "date_modified": "2025-05-27T00:00:00.000Z",
            "tags": [
                "career",
                "experience",
                "interview"
            ]
        },
        {
            "id": "https://gauravsarma.com/random/2025-05-19_why-go-for-faang",
            "content_html": "\nI have worked in early stage, late stage startups and bigger companies as well.\n\n\"FAANG in your 20s often makes your skills deteriorate\".\n\nThis is a completely untrue statement. Big tech folks are exposed to highly scalable distributed systems from the day they join the company. To release any feature, they have to be aware of standardised release guidelines, the CI/CD process, the right way to monitor and log your metrics, etc.\nBecause the surface area of their code is so high and the possibility of affecting existing revenue is omnipresent, big tech folks have to do more work to release a similar kind of feature than what an engineer in a startup has to do.\n\nAnd this happens to startups as well. Startups have rapid delivery processes in the beginning. But then they start seeing a growing trend of bugs and resiliency issues. This is where new processes get introduced and the time to deliver new features starts increasing.\n\nNow there are few companies, regardless of startups or MNCs, who continually evaluate their processes and trim the steps which no longer make sense.\n\n\"Lifestyle creep and golden handcuffs\".\nI agree completely with the above statement.\nBig tech used to pay so high with continuous stock refreshers and the valuation increases of actual shares that you can sell, that it did seem inconvenient to go anyplace else where you don't get these kind of rewards.\n\nI know Staff and Senior Staff engineers working in India for the past decade who have accumulated RSUs worth million dollars or more. You get to see this kind of a payout in a startup only if you had joined very early and the startup is successful.\n\nJoining late stage startups usually don't have that big of RoI since the stock doesn't grow as much as the company readies itself for an IPO. There are also cost trimming steps that the company undertakes in terms of employee count and compensation that may not make it a lucrative place to stay. When and if the company does IPO, the shares won't have the much of an upside compared to other big tech companies.\n",
            "url": "https://gauravsarma.com/random/2025-05-19_why-go-for-faang",
            "title": "Why should you go for FAANG if you can",
            "summary": "I have worked in early stage, late stage startups and bigger companies as well.  \"FAANG in your 20s often makes your skills deteriorate\"...",
            "date_modified": "2025-05-19T00:00:00.000Z",
            "tags": [
                "career",
                "experience",
                "faang"
            ]
        },
        {
            "id": "https://gauravsarma.com/random/2025-03-18_building-leverage-in-your-career",
            "content_html": "\nIf you have read the `Alamanack of Naval Ravikant`, then you must have come across this famous quote\n\n```\nIt doesn't take money to make money, it takes leverage to make money.\n```\n\nRetaining something after reading something is just as important and somehow, this quote along with\nfew other quotes of his, has stuck with me. I have found real life examples of this playing out so\nmany times that it's astounding that most people don't use this often.\n\nIn this blog, I will try to show how building leverage can help you in your career.\nWe will do that by going through a few real life examples with details changed and then we will\nregroup to collect our thoughts.\n\nHere we go.\n\n## Example 1 - Amy\n\n\nAmy, 19 yo, is studying humanities in Delhi University, one of the best colleges of Delhi University, India.\n\nShe feels something is missing and she has always liked programming, so she decides to pursue programming\nagain. She fiddles around with some tools and then she discovers WebAssembly(WA).\n\nFor some reason, she loves the idea of WA and decides to build something on top of it. She learns\nRust, builds a few examples and decides that the ecosystem is still too unstable. Instead of setting\nit aside, she decides that she can fix those issues and she starts contributing regularly to the WA\nsource code.\n\nShe is 22 now and has been regularly pushing patches to the WA source code. Mozilla takes note of her\ncontributions as they are also building in the same space and decides to hire her. They offer her a\njob and she accepts the offer since it means that she will be getting paid to do the same work, but from\ninside the company, and maybe also help the company out.\n\nShe joins as a SDE-1 and based on her continuous contributions, she is able to become an important member\nof the team and she is promoted to SDE-2 within 6 months.\n\nAt the age of 24, she becomes one of the maintainers of the project. Google takes note of her and offers to\nhire her. She notices that the pay and position are both better than the current one at Mozilla. She decides\nto leave, but Mozilla can't afford to let her resign. They immediately give her a Senior Software Engineer\nequivalent role.\n\nAt the age of 25, she is promoted to Staff Engineer and is one of the youngest engineers to become a Staff\nEngineer at one of the reputed companies.\n\nFor reference, Staff Engineers at reputed companies usually have a pay ranging between Rs 1.5-2.3 crores.\n\n\n## Example 2 - Barry\n\nBarry, 29 is working as a Senior Software Engineer in Coinbase on crypto payments.\nElon wants people to buy Teslas using DOGE coin and everybody starts going crazy in buying cryptocurrencies.\n\nCoinbase's database, Postgres is unable to handle the tremendous load and things start failing.\nBarry, along with a team of few other folks, offers to help fix these issues as he has some experience with\na database.\n\nHe, along with his team of just 3 engineers, are able to partition and scale the entire Postgres database\nin a year to handle Coinbase's growing popularity. \n\nBarry is immediately promoted to the position of Staff Engineer.\nNew teams come up with less experience in database systems, so Barry offers to help them build a data layer\nfor the product teams to use so that they don't have to deal with the database directly.\n\nThis helps Coinbase in building products faster and with more resiliency.\n\nThe database team, which was almost non-existent 1.5 years back, has now grown to a team of 20 engineers.\n\nBarry is promoted to Senior Staff Engineer at the age of 31.\n\n\n## Example 3 - Dhruv\n\nDhruv joins a small startup right out of his tier-3 college. \nDhruv is the 11th person to join the company.\n\nDhruv, since he is just out of college, doesn't know a lot of things, but is willing to work on almost everything.\nHe learns backend, frontend and everything under the Sun, wherever help is required.\n\nAs his breadth of experience grows and he becomes accustomed to working with senior folks, he is able to now\nmake decisions since he has more breadth than the original founding engineers.\n\nThe company grows to 50 engineers and Dhruv is promoted to be the team lead for his vertical at the age of 25.\nHis vertical becomes a scale bottleneck, but also the money earner. He spends time in understanding how to scale\nboth the team and the code, and with much experimentation, he is able to achieve it.\n\nIn the meantime, he is now the lead for a 40 member team and is promoted to Director of Engineering.\n\nThe company raises multiple rounds of investment during his tenure and his stock portfolio has grown above 30 crores.\n\nThe next few years, he spends time in building his social media community and promotes the company in his channels.\nThe company is acquired and he gets a Rs 70 crore payout and a huge social media following, which he leverages to build\nnew products in his own company now.\n\n\n## Let's regroup\n\nAll the folks listed in the above examples have had different backgrounds, different goals and different starting points\nwhen their career started to fly.\n\nWhat's the common point between all of their stories?\n\nLet's change a few minor things in their stories and see how it would turned out otherwise.\n\nIf Amy, after getting the offer from Mozilla, had decided to stop working on WebAssembly, Google wouldn't have made her\nthe offer and she would not have been promoted within Mozilla to Staff Engineer. Her persistence with WebAssembly, a growing\narea of interest in multiple companies helped her gain leverage over her company allowing her to get promoted much earlier\ncompared to many of her colleagues and seniors.\n\nIf Barry wouldn't have offered to help build the database layer, even though he was a Web3 engineer, he would have had a similar\ntrajectory as his peers and wouldn't have become the Senior Staff Engineer, which many people don't achieve in their entire\ncareers as well.\n\nIf Dhruv had decided to stop learning and not increase his breadth of experience and would have relied on his seniors to just do\nwhat he was told, he wouldn't have been able to gather the experience required to run his team and wouldn't have been promoted\nto Director of Engineering.\n\n\n## What if it didn't work out?\n\nAll the above examples are positive cases, or the happy path as we call it.\nIt also had the potential of turning out to be a negative scenario.\n\nFor example, if the company decided that they no longer want to improve WebAssembly and Amy still kept working on it, then\nshe wouldn't have gotten the double promotion in a short time. If the tool she was focussing on didn't have takers outside\nMozilla, then Google wouldn't have hired her as well.\n\nIf Barry had scaled the Postgres layer but the company decided to leverage a cloud database instead of managing it in-house,\nthen the entire year would have gone to waste for Barry without much leverage.\n\nIf the startup Dhruv was working in didn't do well and didn't find funding, then there would have been a salary cut or firing\nand Dhruv wouldn't have had the exponential rise as well.\n\n\n## What to look out for?\n\n### Industry Norms\n\nWhen Barry was scaling the database layer, there were not a lot of options apart from AWS RDS and they too didn't have that\nbig of a customer base that could immediately solve the problem. So what Barry built was definitely a niche use case that you\nwouldn't find generally and was tailor made for Coinbase, which meant that Coinbase wouldn't be able to just swap the data layer out.\n\n### Funding\n\nThis is similar to what companies face as well. But if the problem that you are solving is not a money maker for the company resulting\nin significant earnings either directly or indirectly, then it may not be a problem worth solving. For example, if the database\nlayer wasn't resulting in customer churn for Coinbase, then it wouldn't have recognised Barry's efforts and rewarded him.\n```\nTo fix something, identifying it is broken is paramount.\n```\n\n### Direction\n\nDoes this align with the company's long term vision?\nDhruv could try to increase his breadth in a bigger company all he wants, but it would not be rewarded in the same manner. The bigger\ncompany already has the ability to hire more people for specific tasks and having a generalist may not be that big of a win most of the\ntimes. On the other hand, a startup cannot afford to hire so many people and the more breadth you build, the faster you can leverage it.\n\n\n\nThe common point behind all these stories is Leverage.\nLeverage comes in different shapes and sizes. Identifying them is the main path to win in your career.\nThe power to force uncommon events because that event is more important than the routine flow of nature.\n\n\nDifferent people have different points of strength. It just takes time and effort to build on this.\n\nTake the time to invest in yourself, and build leverage in your life.\n\n",
            "url": "https://gauravsarma.com/random/2025-03-18_building-leverage-in-your-career",
            "title": "Building leverage in your career",
            "summary": "If you have read the Alamanack of Naval Ravikant, then you must have come across this famous quote It doesn't take money to make money, it takes leverage to make money.  Retaining something after reading something is just as important and somehow, this quote along with few other quotes of his, has stuck with me...",
            "date_modified": "2025-03-18T00:00:00.000Z",
            "tags": [
                "team",
                "career"
            ]
        },
        {
            "id": "https://gauravsarma.com/random/2025-03-10_migrating-from-hugo-to-nextjs",
            "content_html": "\nSo it's been a week since I have been trying to use and compare AI code editors and tools extensively to integrate into my coding routine.\n\n\nClaude Code has been the most successful, but it's way too expensive compared to Cursor or Windsurf.\nSo Cursor is the winner here.\n\nThe judgement was based on the ability to convert my markdown based blog to a NextJS application with search and firebase functionality.\nIt was able to provide elaborate steps on converting markdown to HTML, creating the right routes, changing the theme of the website, and so much more.\n\nOne major complaint that I have with this type of development is that you don't own any part of the app. It's all owned by the tool. So even if you have worked on the app for sometime, chances are that you would still not know a lot of things about the codebase.\nSo the incremental pace and value that you pick up as you grow with a codebase is going to decrease even further since changes are going to come way faster than you can absorb it. \n\nIn so many places, the AI tends to take shortcuts to reach the end goal. One major flaw was the most of the tools tried to disable type checks completely for the app since it was unable to fix the problem. \nIn other cases, for few problems, it tends to make the problem even worse by getting into a loopy state.\n",
            "url": "https://gauravsarma.com/random/2025-03-10_migrating-from-hugo-to-nextjs",
            "title": "Migrating from Hugo to NextJS",
            "summary": "So it's been a week since I have been trying to use and compare AI code editors and tools extensively to integrate into my coding routine.  Claude Code has been the most successful, but it's way too expensive compared to Cursor or Windsurf...",
            "date_modified": "2025-03-10T00:00:00.000Z",
            "tags": [
                "hugo",
                "nextjs",
                "blog"
            ]
        }
    ]
}