Skip to content

Transaction Engine

ParticleDB provides full ACID transactions using multi-version concurrency control (MVCC) with snapshot isolation. The transaction engine supports three WAL synchronization modes, concurrent insert append, two-phase commit, and row-level locking with FOR UPDATE / FOR SHARE semantics.

Every transaction operates on a consistent snapshot of the database taken at transaction start time. Readers never block writers, and writers never block readers.

BEGIN; -- snapshot taken at this point
SELECT * FROM orders; -- sees data as of snapshot time
-- concurrent INSERT into orders by another transaction
SELECT * FROM orders; -- still sees original snapshot (repeatable read)
COMMIT;

Each row version carries a transaction ID indicating which transaction created it. A transaction can only see row versions created by transactions that committed before its snapshot was taken. Uncommitted and later-committed versions are invisible.

If two concurrent transactions attempt to modify the same row, the second writer detects the conflict and aborts. The application can retry the aborted transaction, which will acquire a new snapshot.

The write-ahead log guarantees durability by persisting every mutation before it is applied in memory. Three synchronization modes let you trade durability for throughput:

Terminal window
particledb start --wal-sync-mode sync

Every WAL entry is followed by an fsync system call. This guarantees zero data loss on crash at the cost of one fsync per write operation. Best for workloads where every transaction must survive a power failure.

Terminal window
particledb start --wal-sync-mode groupsync

WAL entries from concurrent transactions are batched together and flushed with a single fsync. This amortizes the cost of fsync across multiple transactions, providing a sub-millisecond durability window. In TPC-C benchmarks, groupsync reaches 175K TPS peak — up to 1.5x the throughput of PostgreSQL 17 at 8 workers.

Terminal window
particledb start --wal-sync-mode nosync

The WAL is disabled entirely. An AtomicBool flag gates all WAL codepaths — no serialization, no buffer allocation, no fsync. Additional optimizations activate in this mode:

  • txn_table_locks skip — the global transaction table-lock mutex is bypassed via an AtomicBool flag, eliminating ~640K mutex acquisitions/sec at 8 workers.
  • Thread-local WAL buffer disabled — no buffer allocation or flush.

This mode achieves 159K TPS at 8 workers but provides no crash recovery. Suitable for bulk loading, ephemeral analytics, or workloads with external durability guarantees (e.g., replay from an upstream event stream).

INSERT operations use a read lock plus an append lock rather than an exclusive write lock on the table:

Traditional: td.write() -- exclusive lock, blocks all readers
Concurrent: td.read() + append_lock -- read lock + Mutex on append position

try_concurrent_push and try_concurrent_extend append new rows under the shared read lock, with a Mutex serializing only the append position update. This means:

  • INSERTs do not block concurrent SELECTs — readers hold the same read lock.
  • INSERTs do not block concurrent UPDATEs — updates operate on existing rows.
  • Falls back to an exclusive write lock only when the underlying storage needs capacity growth (rare, amortized).

Row counts are maintained with fetch_add (atomic increment) instead of load + store, which is safe for concurrent appenders.

ParticleDB supports three concurrency-control strategies, selected at server start with the --txn-mode flag:

Terminal window
particledb start --txn-mode table-2pl # default
particledb start --txn-mode row-2pl
particledb start --txn-mode fast
ModeLocking GranularityIsolationBest For
table-2plTable-level two-phase lockingSerializableSimplest correctness model (default)
row-2plRow-level via DashMap (PK-based locks)SerializableHigh-concurrency OLTP (TPC-C)
fastNo lockingRead-committedMaximum throughput, single-writer workloads

In row-2pl mode, row locks are tracked in a DashMap keyed by primary key value. DashMap is a concurrent hash map with shard-level locking, providing lock-free read access and fine-grained write contention — far less overhead than a global table mutex.

Transactions acquire row locks on first access and release them at commit or rollback. The DashMap approach means two transactions modifying different rows in the same table never contend with each other.

Read transactions use a lightweight wait_for_writer barrier: if a reader’s snapshot overlaps with an in-flight writer, the reader waits for the writer to commit or abort rather than acquiring a read lock. This avoids the overhead of shared-lock bookkeeping while still preserving snapshot consistency.

In groupsync WAL mode, ParticleDB uses a leader/follower group commit pattern:

  1. The first transaction to reach the WAL flush point becomes the leader.
  2. Subsequent transactions that arrive while the leader’s fsync is in flight register as followers and park on a condition variable.
  3. When fsync completes, the leader wakes all followers — their WAL entries were included in the same physical flush.

This batches multiple transaction commits into a single fsync syscall. In TPC-C benchmarks, group commit provides a +76% throughput improvement over per-transaction sync at 8 workers.

All mutations within a transaction are accumulated in a TXN_BUFFERED_OPS buffer rather than being applied immediately:

  • INSERTs, UPDATEs, and DELETEs are recorded as buffered operations.
  • On COMMIT, the buffer is replayed in order against the storage engine.
  • On ROLLBACK, the buffer is discarded with no storage side effects.

This design ensures that partial transactions never leave visible state and simplifies conflict detection — the engine checks the buffer for write-write conflicts at commit time.

The TXN_OVERLAY provides a read-your-writes view within an open transaction. When a query runs inside a transaction, the engine merges the overlay (buffered but uncommitted writes) with the base snapshot so that statements like INSERT ... ON CONFLICT DO UPDATE can see rows inserted earlier in the same transaction.

ParticleDB supports SQL-standard row-level locking within transactions:

-- Exclusive lock: block other writers and lockers
SELECT * FROM orders WHERE id = 42 FOR UPDATE;
-- Shared lock: allow other readers, block writers
SELECT * FROM orders WHERE id = 42 FOR SHARE;
-- Skip rows locked by other transactions
SELECT * FROM orders WHERE status = 'pending'
FOR UPDATE SKIP LOCKED;
-- Fail immediately if the row is locked
SELECT * FROM orders WHERE id = 42
FOR UPDATE NOWAIT;
ClauseBehavior
FOR UPDATEAcquires an exclusive row lock; blocks other FOR UPDATE/FOR SHARE
FOR SHAREAcquires a shared row lock; allows other FOR SHARE, blocks FOR UPDATE
SKIP LOCKEDSilently skips rows held by other transactions
NOWAITReturns an error immediately if the row is already locked

Row locks are released when the transaction commits or rolls back.

ParticleDB supports the two-phase commit (2PC) protocol for distributed transactions that span multiple systems:

-- Phase 1: Prepare
PREPARE TRANSACTION 'txn_abc123';
-- Phase 2: Commit (or Rollback)
COMMIT PREPARED 'txn_abc123';
-- or
ROLLBACK PREPARED 'txn_abc123';

Prepared transactions survive server restarts — they are persisted in the WAL and recovered on startup. A monitoring process or coordinator can then issue the final COMMIT or ROLLBACK.

BEGIN
├── Snapshot acquired (read timestamp)
├── Read operations (see snapshot)
├── Write operations (buffered + WAL)
│ ├── Conflict detection on write
│ └── Row locks acquired (FOR UPDATE/FOR SHARE)
├── COMMIT
│ ├── WAL flush (per mode: sync/groupsync/nosync)
│ ├── Writes applied to storage
│ ├── Row locks released
│ └── Cache invalidation (batch_cache, zone_maps, etc.)
└── ROLLBACK
├── Buffered writes discarded
└── Row locks released
OptimizationImpact
WAL disabled (nosync)2.6x throughput (40K to 106K TPS)
Concurrent INSERT appendINSERTs no longer block SELECTs/UPDATEs
txn_table_locks skipEliminates ~640K mutex acquisitions/sec
Thread-local WAL bufferEliminates WAL mutex contention
AHashMap for table lookups2-4x faster hash lookups for short string keys
has_prefix_indexes flagSkip prefix index lock reads when no indexes exist
PK index single lookupOne get() instead of contains_key() + get()
Batch insert prefix indexSingle lock per batch instead of per-row
Batch undo string allocationSingle to_string() per batch instead of per-row
Atomic row count incrementfetch_add safe for concurrent appenders
ConfigurationTPS (8 workers)
sync + row-2pl10,069
groupsync + row-2pl83,520
groupsync + table-2pl132,597
nosync + table-2pl159,016

ParticleDB is being validated with Jepsen-style append tests:

  • 65% of append tests pass under the current isolation model.
  • 0 info violations — no ambiguous or lost-update anomalies detected.

Work is ongoing to close the remaining 35% (primarily related to serializable ordering edge cases under high contention).

  • Per-row String allocation for column names (~111 clones per transaction).
  • WAL serialization cost (~3.5 us per transaction in sync/groupsync modes).
  • PK index write-lock contention under heavy concurrent INSERT + UPDATE workloads.