Skip to content

Storage Engine

The ParticleDB storage engine is built around an LSM-tree with a write-ahead log (WAL) for durability, layered caches for fast reads, and zone maps that carry precomputed aggregates so the query engine can skip or resolve entire chunks without touching individual rows.

All writes flow through a log-structured merge tree:

  1. Memtable — incoming writes land in an in-memory sorted structure.
  2. Immutable memtable — when the active memtable reaches its size threshold it becomes immutable and is queued for flush.
  3. SSTable flush — immutable memtables are serialized to sorted string table (SST) files on disk.
  4. Compaction — background compaction merges overlapping SST files to bound read amplification and reclaim space from deleted or overwritten rows.

Every mutation is first appended to the WAL before being applied in memory. On crash, the WAL is replayed to recover committed transactions that had not yet been flushed to SSTables.

ParticleDB supports three WAL synchronization modes, configured with --wal-sync-mode:

ModeBehaviorDurability Guarantee
syncfsync after every WAL entryNo data loss on crash
groupsyncBatched fsync across concurrent transactionsSub-millisecond window on crash
nosyncWAL writes skipped entirelyNo crash recovery; max throughput

In nosync mode, the WAL codepath is gated by an AtomicBool flag — no lock contention, no serialization overhead, and no fsync syscalls. This mode is useful for bulk loading or ephemeral analytics where durability is not required.

A thread-local WAL buffer (TL_WAL_BUF) eliminates contention on the shared WAL mutex during normal transaction processing.

Recently ingested data is held in an in-memory batch cache as Apache Arrow RecordBatch arrays. Queries that touch recent data hit this cache directly without reading SSTables from disk.

Cache invalidation rules:

OperationBatch cacheZone mapsProjected cacheFlat columnsRunning stats
INSERTAppendedAppendedInvalidatedInvalidatedAppended
UPDATEInvalidatedInvalidatedInvalidatedInvalidatedInvalidated
DELETEInvalidatedInvalidatedInvalidatedInvalidatedInvalidated

INSERT is incremental: new rows are built into RecordBatch chunks and appended to the batch cache, zone maps, and running stats without rebuilding from scratch. Only the projected cache and flat column cache are invalidated (they are cheaper to rebuild on-demand).

Every chunk of rows (default 8,192 rows per chunk) has an associated ColumnZoneMap that stores precomputed statistics:

StatisticTypePurpose
minColumn-nativeRange filter elimination
maxColumn-nativeRange filter elimination
sum_i128128-bit integerUngrouped SUM / AVG without row scan
sum_f6464-bit floatUngrouped SUM / AVG for float columns
countu64Row count per chunk
null_countu64Null tracking for correct AVG denominator

The query engine uses zone maps in several tiers:

  1. Running stats (ColumnRunningStats) — O(1) table-wide SUM / AVG / MIN / MAX computed from zone map summaries. No iteration over chunks at all.
  2. Zone aggregate — O(chunks) aggregation by summing zone-level statistics.
  3. Tri-state filter classification — each chunk is classified as All (every row passes), None (no rows pass), or Partial (some rows pass). ALL-match chunks resolve from zone stats; NONE-match chunks are skipped; only PARTIAL chunks are scanned row-by-row.
  4. Chunk group stats (ChunkGroupStats) — for columns with 32 or fewer distinct values, per-group SUM / COUNT / MIN / MAX are precomputed at build time, enabling GROUP BY resolution from O(chunks x groups) summaries.
Chunk 0: min=1, max=100 → WHERE x > 200 → NONE (skip)
Chunk 1: min=150, max=300 → WHERE x > 200 → PARTIAL (scan)
Chunk 2: min=400, max=500 → WHERE x > 200 → ALL (use zone stats)

Low-cardinality string columns (e.g., country codes, status enums) are stored with Arrow dictionary encoding. Each distinct string value is stored once in a dictionary array, and rows reference it by a compact u8 or u16 key.

Benefits:

  • Reduced memory footprint — one copy of each string instead of N copies.
  • Direct-index aggregation — GROUP BY on dictionary columns uses the integer key as a direct array index, avoiding hash table lookups entirely.
  • Cache-friendly scans — iterating over u8 keys is far more compact than iterating over variable-length strings.

Dictionary encoding is applied automatically during batch construction when a column’s cardinality falls below the encoding threshold. High-cardinality columns that exceed the threshold during build are converted back to plain string arrays.

For analytical queries that scan an entire column, the flat column cache concatenates all batch-cache chunks for a given column into a single contiguous Vec<i64> or Vec<f64> (FlatColumn enum).

Batch 0 [8192 i64] ──┐
Batch 1 [8192 i64] ──┼──▶ FlatColumn::I64(Vec<i64>) [total rows]
Batch 2 [8192 i64] ──┘

This eliminates per-batch TLB misses and enables the hardware prefetcher to stream data efficiently. The flat column cache is most effective for queries with one or two memory streams (e.g., a single-column filter + single-column aggregate). For three or more concurrent streams, the per-batch 64 KB chunks fit in L1 and perform better than large contiguous arrays.

Cache key format: "{table}:{column}". Invalidated alongside projected_cache and flat_columns on any DML operation.

Queries that repeatedly access the same subset of columns benefit from the projected batch cache, keyed on (table, column_set). Instead of calling batch.project() on every execution, the projected slices are cached and reused.

The default CHUNK_SIZE is 8,192 rows. At 8 bytes per value, a single column chunk occupies 64 KB — sized to fit in the CPU L1 data cache. Larger chunk sizes push columns into L2/L3, increasing cache pressure and reducing throughput for scan-heavy analytical queries.

ParticleDB supports three compression modes for on-disk SSTables:

ModeTrade-off
noneNo CPU overhead; largest files
lz4Fast compression/decompression; moderate reduction
zstdBest compression ratio; higher CPU cost

Each table maintains an in-memory primary key index (PkIndexMap) that maps PK values to row locations. Point lookups by primary key resolve in O(1) via this index without scanning any data.

The PK index is backed by a DashMap — a sharded concurrent hash map that provides lock-free reads and fine-grained shard-level write locks. Multiple threads can look up different keys simultaneously without contention. This is critical for high-concurrency OLTP workloads where many connections perform point lookups in parallel.

When a row is deleted, the PK index entry is removed using the swap_remove technique:

  1. The row at position i is swapped with the last row in the table.
  2. The PK index entry for the swapped row is updated to point to position i.
  3. The table length is decremented.

This avoids the cost of shifting all subsequent index entries (which would be O(N)) and keeps DELETE operations O(1) regardless of table size.

On INSERT, new PK entries are appended to the existing index rather than rebuilding it from scratch. For batch inserts, direct_pk_select_batch_with resolves multiple PKs under a single lock acquisition, amortizing lock overhead across the batch.

Summary of operations:

  • INSERT: new entries appended to the index (incremental).
  • DELETE: entries removed via swap_remove with O(1) index position update.
  • Batch lookup: direct_pk_select_batch_with resolves multiple PKs under a single lock acquisition.

ParticleDB supports a tiered storage model:

  • Hot tier — in-memory batch cache and flat column cache for recent and frequently-accessed data.
  • Cold tier — compressed SSTables on disk for historical data.

Data moves between tiers automatically based on access patterns and configured retention policies.