Skip to content

Lakehouse (External Data)

ParticleDB can read Parquet and CSV files from local paths or Amazon S3 directly in SQL queries, enabling lakehouse-style analytics without ingesting data into managed tables.

SELECT * FROM read_parquet('/data/events.parquet');
SELECT * FROM read_parquet('s3://bucket/path/data.parquet');

S3 paths are parsed automatically. The region is resolved from the standard AWS configuration (environment variables or credentials file).

Standard WHERE clauses and column selections work on external files:

SELECT user_id, event_type, timestamp
FROM read_parquet('/data/events.parquet')
WHERE event_type = 'purchase' AND timestamp > 1700000000;
SELECT * FROM read_csv('/data/logs.csv');

The CSV reader automatically infers the schema from the file header and first rows.

ParticleDB pushes work down to the file reader layer to minimize I/O and memory usage:

Filter predicates are evaluated at the Parquet row group level using row group statistics (min/max values). Row groups that cannot contain matching rows are skipped entirely without reading their data.

Supported predicate operators:

OperatorDescription
=Equality
!=Not equal
<Less than
&lt;=Less than or equal
>Greater than
>=Greater than or equal

When a query selects only a subset of columns, only those columns are read from the file. This is particularly effective for wide Parquet files where only a few columns are needed.

-- Only reads the user_id and amount columns from the file
SELECT user_id, amount FROM read_parquet('/data/transactions.parquet');

The lakehouse reader is configured with sensible defaults that can be tuned:

SettingDefaultDescription
max_row_groups_per_read128Maximum row groups to read in a single batch
predicate_pushdowntrueEnable predicate pushdown on row groups
projection_pushdowntrueEnable projection pushdown (read only needed columns)

The S3 reader uses a pluggable backend via the S3Source trait. The production implementation uses the AWS SDK. For testing, a MockS3Source is available that stores objects in an in-memory HashMap.

FormatFunctionSchema DiscoveryPushdown
Parquetread_parquet()From file metadataPredicate + Projection
CSVread_csv()Inferred from headerProjection only