Lakehouse (External Data)
ParticleDB can read Parquet and CSV files from local paths or Amazon S3 directly in SQL queries, enabling lakehouse-style analytics without ingesting data into managed tables.
Reading Parquet Files
Section titled “Reading Parquet Files”Local Files
Section titled “Local Files”SELECT * FROM read_parquet('/data/events.parquet');Amazon S3
Section titled “Amazon S3”SELECT * FROM read_parquet('s3://bucket/path/data.parquet');S3 paths are parsed automatically. The region is resolved from the standard AWS configuration (environment variables or credentials file).
Filtered Queries
Section titled “Filtered Queries”Standard WHERE clauses and column selections work on external files:
SELECT user_id, event_type, timestampFROM read_parquet('/data/events.parquet')WHERE event_type = 'purchase' AND timestamp > 1700000000;Reading CSV Files
Section titled “Reading CSV Files”SELECT * FROM read_csv('/data/logs.csv');The CSV reader automatically infers the schema from the file header and first rows.
Pushdown Optimizations
Section titled “Pushdown Optimizations”ParticleDB pushes work down to the file reader layer to minimize I/O and memory usage:
Predicate Pushdown
Section titled “Predicate Pushdown”Filter predicates are evaluated at the Parquet row group level using row group statistics (min/max values). Row groups that cannot contain matching rows are skipped entirely without reading their data.
Supported predicate operators:
| Operator | Description |
|---|---|
= | Equality |
!= | Not equal |
< | Less than |
<= | Less than or equal |
> | Greater than |
>= | Greater than or equal |
Projection Pushdown
Section titled “Projection Pushdown”When a query selects only a subset of columns, only those columns are read from the file. This is particularly effective for wide Parquet files where only a few columns are needed.
-- Only reads the user_id and amount columns from the fileSELECT user_id, amount FROM read_parquet('/data/transactions.parquet');Configuration
Section titled “Configuration”The lakehouse reader is configured with sensible defaults that can be tuned:
| Setting | Default | Description |
|---|---|---|
max_row_groups_per_read | 128 | Maximum row groups to read in a single batch |
predicate_pushdown | true | Enable predicate pushdown on row groups |
projection_pushdown | true | Enable projection pushdown (read only needed columns) |
S3 Backend
Section titled “S3 Backend”The S3 reader uses a pluggable backend via the S3Source trait. The production
implementation uses the AWS SDK. For testing, a MockS3Source is available that stores
objects in an in-memory HashMap.
Supported File Formats
Section titled “Supported File Formats”| Format | Function | Schema Discovery | Pushdown |
|---|---|---|---|
| Parquet | read_parquet() | From file metadata | Predicate + Projection |
| CSV | read_csv() | Inferred from header | Projection only |