Storage Engine
The ParticleDB storage engine is built around an LSM-tree with a write-ahead log (WAL) for durability, layered caches for fast reads, and zone maps that carry precomputed aggregates so the query engine can skip or resolve entire chunks without touching individual rows.
LSM-Tree
Section titled “LSM-Tree”All writes flow through a log-structured merge tree:
- Memtable — incoming writes land in an in-memory sorted structure.
- Immutable memtable — when the active memtable reaches its size threshold it becomes immutable and is queued for flush.
- SSTable flush — immutable memtables are serialized to sorted string table (SST) files on disk.
- Compaction — background compaction merges overlapping SST files to bound read amplification and reclaim space from deleted or overwritten rows.
Write-Ahead Log (WAL)
Section titled “Write-Ahead Log (WAL)”Every mutation is first appended to the WAL before being applied in memory. On crash, the WAL is replayed to recover committed transactions that had not yet been flushed to SSTables.
ParticleDB supports three WAL synchronization modes, configured with --wal-sync-mode:
| Mode | Behavior | Durability Guarantee |
|---|---|---|
sync | fsync after every WAL entry | No data loss on crash |
groupsync | Batched fsync across concurrent transactions | Sub-millisecond window on crash |
nosync | WAL writes skipped entirely | No crash recovery; max throughput |
In nosync mode, the WAL codepath is gated by an AtomicBool flag — no lock
contention, no serialization overhead, and no fsync syscalls. This mode is useful for
bulk loading or ephemeral analytics where durability is not required.
A thread-local WAL buffer (TL_WAL_BUF) eliminates contention on the shared WAL
mutex during normal transaction processing.
Batch Cache
Section titled “Batch Cache”Recently ingested data is held in an in-memory batch cache as Apache Arrow
RecordBatch arrays. Queries that touch recent data hit this cache directly without
reading SSTables from disk.
Cache invalidation rules:
| Operation | Batch cache | Zone maps | Projected cache | Flat columns | Running stats |
|---|---|---|---|---|---|
| INSERT | Appended | Appended | Invalidated | Invalidated | Appended |
| UPDATE | Invalidated | Invalidated | Invalidated | Invalidated | Invalidated |
| DELETE | Invalidated | Invalidated | Invalidated | Invalidated | Invalidated |
INSERT is incremental: new rows are built into RecordBatch chunks and appended to the
batch cache, zone maps, and running stats without rebuilding from scratch. Only the
projected cache and flat column cache are invalidated (they are cheaper to rebuild
on-demand).
Zone Maps
Section titled “Zone Maps”Every chunk of rows (default 8,192 rows per chunk) has an associated ColumnZoneMap
that stores precomputed statistics:
| Statistic | Type | Purpose |
|---|---|---|
min | Column-native | Range filter elimination |
max | Column-native | Range filter elimination |
sum_i128 | 128-bit integer | Ungrouped SUM / AVG without row scan |
sum_f64 | 64-bit float | Ungrouped SUM / AVG for float columns |
count | u64 | Row count per chunk |
null_count | u64 | Null tracking for correct AVG denominator |
Zone-Level Aggregation
Section titled “Zone-Level Aggregation”The query engine uses zone maps in several tiers:
- Running stats (
ColumnRunningStats) — O(1) table-wide SUM / AVG / MIN / MAX computed from zone map summaries. No iteration over chunks at all. - Zone aggregate — O(chunks) aggregation by summing zone-level statistics.
- Tri-state filter classification — each chunk is classified as
All(every row passes),None(no rows pass), orPartial(some rows pass). ALL-match chunks resolve from zone stats; NONE-match chunks are skipped; only PARTIAL chunks are scanned row-by-row. - Chunk group stats (
ChunkGroupStats) — for columns with 32 or fewer distinct values, per-group SUM / COUNT / MIN / MAX are precomputed at build time, enabling GROUP BY resolution from O(chunks x groups) summaries.
Example: Zone Map Skip
Section titled “Example: Zone Map Skip”Chunk 0: min=1, max=100 → WHERE x > 200 → NONE (skip)Chunk 1: min=150, max=300 → WHERE x > 200 → PARTIAL (scan)Chunk 2: min=400, max=500 → WHERE x > 200 → ALL (use zone stats)Dictionary Encoding
Section titled “Dictionary Encoding”Low-cardinality string columns (e.g., country codes, status enums) are stored with
Arrow dictionary encoding. Each distinct string value is stored once in a dictionary
array, and rows reference it by a compact u8 or u16 key.
Benefits:
- Reduced memory footprint — one copy of each string instead of N copies.
- Direct-index aggregation — GROUP BY on dictionary columns uses the integer key as a direct array index, avoiding hash table lookups entirely.
- Cache-friendly scans — iterating over
u8keys is far more compact than iterating over variable-length strings.
Dictionary encoding is applied automatically during batch construction when a column’s cardinality falls below the encoding threshold. High-cardinality columns that exceed the threshold during build are converted back to plain string arrays.
Flat Column Cache
Section titled “Flat Column Cache”For analytical queries that scan an entire column, the flat column cache
concatenates all batch-cache chunks for a given column into a single contiguous
Vec<i64> or Vec<f64> (FlatColumn enum).
Batch 0 [8192 i64] ──┐Batch 1 [8192 i64] ──┼──▶ FlatColumn::I64(Vec<i64>) [total rows]Batch 2 [8192 i64] ──┘This eliminates per-batch TLB misses and enables the hardware prefetcher to stream data efficiently. The flat column cache is most effective for queries with one or two memory streams (e.g., a single-column filter + single-column aggregate). For three or more concurrent streams, the per-batch 64 KB chunks fit in L1 and perform better than large contiguous arrays.
Cache key format: "{table}:{column}". Invalidated alongside projected_cache and
flat_columns on any DML operation.
Projected Batch Cache
Section titled “Projected Batch Cache”Queries that repeatedly access the same subset of columns benefit from the projected
batch cache, keyed on (table, column_set). Instead of calling batch.project()
on every execution, the projected slices are cached and reused.
Chunk Size Tuning
Section titled “Chunk Size Tuning”The default CHUNK_SIZE is 8,192 rows. At 8 bytes per value, a single column chunk
occupies 64 KB — sized to fit in the CPU L1 data cache. Larger chunk sizes push
columns into L2/L3, increasing cache pressure and reducing throughput for scan-heavy
analytical queries.
Compression
Section titled “Compression”ParticleDB supports three compression modes for on-disk SSTables:
| Mode | Trade-off |
|---|---|
none | No CPU overhead; largest files |
lz4 | Fast compression/decompression; moderate reduction |
zstd | Best compression ratio; higher CPU cost |
Primary Key Index
Section titled “Primary Key Index”Each table maintains an in-memory primary key index (PkIndexMap) that maps PK values
to row locations. Point lookups by primary key resolve in O(1) via this index without
scanning any data.
DashMap-Based Concurrent Access
Section titled “DashMap-Based Concurrent Access”The PK index is backed by a DashMap — a sharded concurrent hash map that provides lock-free reads and fine-grained shard-level write locks. Multiple threads can look up different keys simultaneously without contention. This is critical for high-concurrency OLTP workloads where many connections perform point lookups in parallel.
O(1) DELETE Index Fixup
Section titled “O(1) DELETE Index Fixup”When a row is deleted, the PK index entry is removed using the swap_remove technique:
- The row at position
iis swapped with the last row in the table. - The PK index entry for the swapped row is updated to point to position
i. - The table length is decremented.
This avoids the cost of shifting all subsequent index entries (which would be O(N)) and keeps DELETE operations O(1) regardless of table size.
Incremental PK Index Building
Section titled “Incremental PK Index Building”On INSERT, new PK entries are appended to the existing index rather than rebuilding
it from scratch. For batch inserts, direct_pk_select_batch_with resolves multiple PKs
under a single lock acquisition, amortizing lock overhead across the batch.
Summary of operations:
- INSERT: new entries appended to the index (incremental).
- DELETE: entries removed via
swap_removewith O(1) index position update. - Batch lookup:
direct_pk_select_batch_withresolves multiple PKs under a single lock acquisition.
Storage Tiers
Section titled “Storage Tiers”ParticleDB supports a tiered storage model:
- Hot tier — in-memory batch cache and flat column cache for recent and frequently-accessed data.
- Cold tier — compressed SSTables on disk for historical data.
Data moves between tiers automatically based on access patterns and configured retention policies.