Query Engine

The ParticleDB query engine executes physical plans using vectorized operators over Apache Arrow columnar arrays. Execution is parallelized across chunks via the Rayon thread pool, and multiple optimization tiers — from zone-level precomputed aggregates down to monomorphized SIMD loops — minimize the work performed per row.

Execution Model

SQL text
  │
  ▼
Parse → AST → Logical Plan
  │
  ▼
Optimizer (pushdown, folding, join reorder)
  │
  ▼
Physical Plan
  │
  ▼
Vectorized Execution (Arrow RecordBatch chunks, 8192 rows each)
  │
  ▼
Result batches → wire format (PG / gRPC / HTTP / RESP / WS)

Each physical operator processes data in chunks of 8,192 rows. At 8 bytes per value, one column chunk occupies 64 KB — sized to fit in the CPU L1 data cache. This chunk size balances per-chunk overhead against cache efficiency.

Plan Cache

Repeated identical queries hit a plan cache keyed on normalized SQL text. The cache stores the compiled physical plan so parsing and optimization are skipped on subsequent executions. The cache is invalidated on DDL changes (CREATE TABLE, DROP TABLE, CREATE INDEX).

A query result cache sits on top of the plan cache for fully deterministic queries: identical SQL text returns the cached result set in O(1) without re-execution. The result cache is invalidated on any DML operation that touches the referenced tables.

Aggregation Strategies

The engine uses a cost-based dispatch to select the fastest aggregation strategy for each query based on the GROUP BY column type, number of distinct groups, sort order, and number of aggregate functions.

Ungrouped Aggregation

Queries without GROUP BY (e.g., SELECT SUM(x) FROM t) resolve through a tiered fast path:

Running stats — O(1) lookup from precomputed table-wide statistics.
Zone aggregate — O(chunks) summation of per-chunk zone map statistics.
Raw-value accumulation — tight for &v in vals loops that LLVM auto-vectorizes into SIMD instructions, bypassing Arrow compute kernel overhead.
COUNT(*) — resolved from metadata in ~20 us, no row scan.

Low-Cardinality GROUP BY (SIMD Masks)

For GROUP BY columns with a small number of distinct values (groups x aggs ≤ 8), the engine builds a boolean mask per group value and accumulates each aggregate using bit-scan iteration over the mask. This strategy is used for dictionary-encoded string columns below the cardinality threshold.

Dense-Array GROUP BY (Int64)

When the GROUP BY column is an integer with a bounded range, the engine allocates a dense accumulator array indexed directly by the group key value. Each row maps to its accumulator slot in O(1) — no hashing, no collision handling.

Variants:

Strategy	Condition
Global dense	Per-thread memory ≤ 64 MB
Range-limited dense	Sorted data with per-thread range ≤ 32 MB
Radix-partition hash	Unsorted high-cardinality exceeding dense thresholds

Range-limited dense is especially effective for naturally sorted columns (e.g., session_id = row_index / 20). Each Rayon thread accumulates into a small dense array spanning only its local [min, max] range, then results are merged via parallel tree reduction with into_par_iter().reduce() in log2(N) levels.

Radix-Partitioned Hash Aggregation

For unsorted high-cardinality GROUP BY columns that exceed dense-array memory budgets, the engine uses radix-partitioned open-addressing hash tables (CompactPart). Each partition fits in cache, and partitions are merged in parallel using a field_merge vector of pre-resolved AccAction operations.

Dictionary Direct Aggregation

Dictionary-encoded columns use the raw integer dictionary key as a direct array index into accumulators — no hash table, no string comparison. The u8 key slice is accessed directly for maximum cache efficiency.

Filter + Aggregate Fusion

The engine fuses filter evaluation and aggregate accumulation into a single pass over the data, eliminating intermediate RecordBatch materialization.

Tri-State Zone Classification

Before scanning rows, each chunk is classified using zone map min/max values:

Classification	Action
`ALL`	Every row passes — use zone-level stats
`NONE`	No rows pass — skip entirely
`PARTIAL`	Some rows pass — scan with mask

This classification runs before any row-level work, and for many filtered aggregates the majority of chunks resolve as ALL or NONE.

Monomorphized Filter Dispatch

Filter predicates are dispatched once through a macro-generated monomorphization layer. Each comparison operator (LT, GT, LE, GE, EQ, NE) generates a specialized closure that compiles to a direct SIMD comparison instruction. No match per element.

For compound AND filters, a nested macro generates 36 monomorphized variants (6 x 6 operator pairs), enabling LLVM to fully auto-vectorize the fused filter+aggregate loop.

Contiguous vs Per-Batch Paths

Memory Streams	Best Path	Reason
1-2	Contiguous flat column	Hardware prefetcher handles 1-2 streams
3+	Per-batch (64 KB chunks)	L1-resident chunks beat 3+ large scans

The engine selects the path automatically based on the number of distinct column references in the filter and aggregate expressions.

Same-Column Optimization

When the filter column and aggregate column are identical (e.g., SELECT AVG(rating) WHERE rating > 3), only one memory stream is needed. The filtered values are accumulated directly without loading a second column, halving memory bandwidth.

Branchless Accumulation

On ARM (Apple Silicon), float aggregation uses val * m as f64 where m is 0 or 1. This avoids branch misprediction and costly general-purpose-to-NEON register file transfers that would result from bit-masking approaches like f64::from_bits(...).

For integer aggregation, val & mask.wrapping_neg() provides branchless selection without any floating-point conversion.

Join Execution

Streaming Hash Join

The engine builds a hash table on the smaller side of the join and streams the larger (probe) side batch by batch. This prevents OOM on large joins — only the build side plus one probe batch need to fit in memory at any time.

For inputs exceeding 100K rows, the probe phase runs in parallel using par_iter over probe batches against the shared read-only hash table.

Int64 Key Specialization

When the join key is Int64, a specialized probe_int64_inner path avoids generic hashing overhead and uses direct integer comparison.

Ungrouped Fused Join + Aggregate

For queries like SELECT SUM(b.amount) FROM a JOIN b ON a.id = b.id, the engine fuses the join probe and aggregation into a single pass using scalar accumulators — no per-group arrays, no intermediate join output materialization.

Vectorized Function Evaluation

Built-in functions are evaluated on whole Arrow columns rather than row-by-row:

Function	Vectorized?	Notes
`ABS`	Yes	Arrow compute kernel
`UPPER`	Yes	String array transform
`LOWER`	Yes	String array transform
`LENGTH`	Yes	String array transform
`COALESCE`	Yes	Null-aware column merge

This eliminates per-row allocation of Vec<(String, ScalarValue)> intermediaries. Functions not yet vectorized fall back to a row-by-row eval_expr_to_array_fallback path.

Top-K Partial Sort

When the plan contains a Limit(Sort(input)) pattern, the engine passes the combined limit + offset to Arrow’s lexsort_to_indices for O(N log K) partial sort instead of O(N log N) full sort. This is particularly effective for ORDER BY ... LIMIT queries where K is much smaller than N.

Parallel Execution

The Rayon thread pool is used throughout execution:

Chunk-level parallelism — scan, filter, and aggregate operators process chunks concurrently.
Parallel projection — execute_project uses par_iter for multi-batch inputs.
Parallel min/max pre-scan — key range detection uses par_iter().reduce().
Parallel tree reduction — dense-array merge across threads uses into_par_iter().reduce() in log2(N) levels instead of sequential accumulation.
Parallel join probe — probe batches processed concurrently against the shared build-side hash table.

OLTP/OLAP Routing

An is_likely_point_select heuristic classifies incoming queries. Point selects execute on the async Tokio thread, while analytical queries are routed through tokio::task::spawn_blocking to the Rayon pool, preventing long-running OLAP queries from blocking OLTP throughput.