Lakehouse (External Data)

ParticleDB can read Parquet and CSV files from local paths or Amazon S3 directly in SQL queries, enabling lakehouse-style analytics without ingesting data into managed tables.

Reading Parquet Files

Local Files

SELECT * FROM read_parquet('/data/events.parquet');

Amazon S3

SELECT * FROM read_parquet('s3://bucket/path/data.parquet');

S3 paths are parsed automatically. The region is resolved from the standard AWS configuration (environment variables or credentials file).

Filtered Queries

Standard WHERE clauses and column selections work on external files:

SELECT user_id, event_type, timestamp
FROM read_parquet('/data/events.parquet')
WHERE event_type = 'purchase' AND timestamp > 1700000000;

Reading CSV Files

SELECT * FROM read_csv('/data/logs.csv');

The CSV reader automatically infers the schema from the file header and first rows.

Pushdown Optimizations

ParticleDB pushes work down to the file reader layer to minimize I/O and memory usage:

Predicate Pushdown

Filter predicates are evaluated at the Parquet row group level using row group statistics (min/max values). Row groups that cannot contain matching rows are skipped entirely without reading their data.

Supported predicate operators:

Operator	Description
`=`	Equality
`!=`	Not equal
`<`	Less than
`<=`	Less than or equal
`>`	Greater than
`>=`	Greater than or equal

Projection Pushdown

When a query selects only a subset of columns, only those columns are read from the file. This is particularly effective for wide Parquet files where only a few columns are needed.

-- Only reads the user_id and amount columns from the file
SELECT user_id, amount FROM read_parquet('/data/transactions.parquet');

Configuration

The lakehouse reader is configured with sensible defaults that can be tuned:

Setting	Default	Description
`max_row_groups_per_read`	128	Maximum row groups to read in a single batch
`predicate_pushdown`	`true`	Enable predicate pushdown on row groups
`projection_pushdown`	`true`	Enable projection pushdown (read only needed columns)

S3 Backend

The S3 reader uses a pluggable backend via the S3Source trait. The production implementation uses the AWS SDK. For testing, a MockS3Source is available that stores objects in an in-memory HashMap.

Supported File Formats

Format	Function	Schema Discovery	Pushdown
Parquet	`read_parquet()`	From file metadata	Predicate + Projection
CSV	`read_csv()`	Inferred from header	Projection only