Storage Engine

The ParticleDB storage engine is built around an LSM-tree with a write-ahead log (WAL) for durability, layered caches for fast reads, and zone maps that carry precomputed aggregates so the query engine can skip or resolve entire chunks without touching individual rows.

LSM-Tree

All writes flow through a log-structured merge tree:

Memtable — incoming writes land in an in-memory sorted structure.
Immutable memtable — when the active memtable reaches its size threshold it becomes immutable and is queued for flush.
SSTable flush — immutable memtables are serialized to sorted string table (SST) files on disk.
Compaction — background compaction merges overlapping SST files to bound read amplification and reclaim space from deleted or overwritten rows.

Write-Ahead Log (WAL)

Every mutation is first appended to the WAL before being applied in memory. On crash, the WAL is replayed to recover committed transactions that had not yet been flushed to SSTables.

ParticleDB supports three WAL synchronization modes, configured with --wal-sync-mode:

Mode	Behavior	Durability Guarantee
`sync`	`fsync` after every WAL entry	No data loss on crash
`groupsync`	Batched `fsync` across concurrent transactions	Sub-millisecond window on crash
`nosync`	WAL writes skipped entirely	No crash recovery; max throughput

In nosync mode, the WAL codepath is gated by an AtomicBool flag — no lock contention, no serialization overhead, and no fsync syscalls. This mode is useful for bulk loading or ephemeral analytics where durability is not required.

A thread-local WAL buffer (TL_WAL_BUF) eliminates contention on the shared WAL mutex during normal transaction processing.

Batch Cache

Recently ingested data is held in an in-memory batch cache as Apache Arrow RecordBatch arrays. Queries that touch recent data hit this cache directly without reading SSTables from disk.

Cache invalidation rules:

Operation	Batch cache	Zone maps	Projected cache	Flat columns	Running stats
INSERT	Appended	Appended	Invalidated	Invalidated	Appended
UPDATE	Invalidated	Invalidated	Invalidated	Invalidated	Invalidated
DELETE	Invalidated	Invalidated	Invalidated	Invalidated	Invalidated

INSERT is incremental: new rows are built into RecordBatch chunks and appended to the batch cache, zone maps, and running stats without rebuilding from scratch. Only the projected cache and flat column cache are invalidated (they are cheaper to rebuild on-demand).

Zone Maps

Every chunk of rows (default 8,192 rows per chunk) has an associated ColumnZoneMap that stores precomputed statistics:

Statistic	Type	Purpose
`min`	Column-native	Range filter elimination
`max`	Column-native	Range filter elimination
`sum_i128`	128-bit integer	Ungrouped SUM / AVG without row scan
`sum_f64`	64-bit float	Ungrouped SUM / AVG for float columns
`count`	u64	Row count per chunk
`null_count`	u64	Null tracking for correct AVG denominator

Zone-Level Aggregation

The query engine uses zone maps in several tiers:

Running stats (ColumnRunningStats) — O(1) table-wide SUM / AVG / MIN / MAX computed from zone map summaries. No iteration over chunks at all.
Zone aggregate — O(chunks) aggregation by summing zone-level statistics.
Tri-state filter classification — each chunk is classified as All (every row passes), None (no rows pass), or Partial (some rows pass). ALL-match chunks resolve from zone stats; NONE-match chunks are skipped; only PARTIAL chunks are scanned row-by-row.
Chunk group stats (ChunkGroupStats) — for columns with 32 or fewer distinct values, per-group SUM / COUNT / MIN / MAX are precomputed at build time, enabling GROUP BY resolution from O(chunks x groups) summaries.

Example: Zone Map Skip

Chunk 0:  min=1,  max=100   → WHERE x > 200  → NONE  (skip)
Chunk 1:  min=150, max=300  → WHERE x > 200  → PARTIAL (scan)
Chunk 2:  min=400, max=500  → WHERE x > 200  → ALL   (use zone stats)

Dictionary Encoding

Low-cardinality string columns (e.g., country codes, status enums) are stored with Arrow dictionary encoding. Each distinct string value is stored once in a dictionary array, and rows reference it by a compact u8 or u16 key.

Benefits:

Reduced memory footprint — one copy of each string instead of N copies.
Direct-index aggregation — GROUP BY on dictionary columns uses the integer key as a direct array index, avoiding hash table lookups entirely.
Cache-friendly scans — iterating over u8 keys is far more compact than iterating over variable-length strings.

Dictionary encoding is applied automatically during batch construction when a column’s cardinality falls below the encoding threshold. High-cardinality columns that exceed the threshold during build are converted back to plain string arrays.

Flat Column Cache

For analytical queries that scan an entire column, the flat column cache concatenates all batch-cache chunks for a given column into a single contiguous Vec<i64> or Vec<f64> (FlatColumn enum).

Batch 0 [8192 i64]  ──┐
Batch 1 [8192 i64]  ──┼──▶  FlatColumn::I64(Vec<i64>)  [total rows]
Batch 2 [8192 i64]  ──┘

This eliminates per-batch TLB misses and enables the hardware prefetcher to stream data efficiently. The flat column cache is most effective for queries with one or two memory streams (e.g., a single-column filter + single-column aggregate). For three or more concurrent streams, the per-batch 64 KB chunks fit in L1 and perform better than large contiguous arrays.

Cache key format: "{table}:{column}". Invalidated alongside projected_cache and flat_columns on any DML operation.

Projected Batch Cache

Queries that repeatedly access the same subset of columns benefit from the projected batch cache, keyed on (table, column_set). Instead of calling batch.project() on every execution, the projected slices are cached and reused.

Chunk Size Tuning

The default CHUNK_SIZE is 8,192 rows. At 8 bytes per value, a single column chunk occupies 64 KB — sized to fit in the CPU L1 data cache. Larger chunk sizes push columns into L2/L3, increasing cache pressure and reducing throughput for scan-heavy analytical queries.

Compression

ParticleDB supports three compression modes for on-disk SSTables:

Mode	Trade-off
`none`	No CPU overhead; largest files
`lz4`	Fast compression/decompression; moderate reduction
`zstd`	Best compression ratio; higher CPU cost

Primary Key Index

Each table maintains an in-memory primary key index (PkIndexMap) that maps PK values to row locations. Point lookups by primary key resolve in O(1) via this index without scanning any data.

DashMap-Based Concurrent Access

The PK index is backed by a DashMap — a sharded concurrent hash map that provides lock-free reads and fine-grained shard-level write locks. Multiple threads can look up different keys simultaneously without contention. This is critical for high-concurrency OLTP workloads where many connections perform point lookups in parallel.

O(1) DELETE Index Fixup

When a row is deleted, the PK index entry is removed using the swap_remove technique:

The row at position i is swapped with the last row in the table.
The PK index entry for the swapped row is updated to point to position i.
The table length is decremented.

This avoids the cost of shifting all subsequent index entries (which would be O(N)) and keeps DELETE operations O(1) regardless of table size.

Incremental PK Index Building

On INSERT, new PK entries are appended to the existing index rather than rebuilding it from scratch. For batch inserts, direct_pk_select_batch_with resolves multiple PKs under a single lock acquisition, amortizing lock overhead across the batch.

Summary of operations:

INSERT: new entries appended to the index (incremental).
DELETE: entries removed via swap_remove with O(1) index position update.
Batch lookup: direct_pk_select_batch_with resolves multiple PKs under a single lock acquisition.

Storage Tiers

ParticleDB supports a tiered storage model:

Hot tier — in-memory batch cache and flat column cache for recent and frequently-accessed data.
Cold tier — compressed SSTables on disk for historical data.

Data moves between tiers automatically based on access patterns and configured retention policies.