Vector Search

ParticleDB has native vector support for similarity search, enabling semantic search, recommendation systems, and retrieval-augmented generation (RAG) directly in SQL.

VECTOR Type

The VECTOR(n) type stores a dense array of n 32-bit floats. Specify the dimension when creating the column:

CREATE TABLE documents (
    id        BIGINT PRIMARY KEY,
    title     TEXT NOT NULL,
    content   TEXT,
    embedding VECTOR(384)
);

Insert vectors as JSON-style array strings:

INSERT INTO documents (id, title, content, embedding) VALUES
    (1, 'Introduction to Machine Learning', 'ML is a subset of AI...',
     '[0.021, -0.034, 0.056, ...]'),
    (2, 'Database Internals', 'Modern databases use columnar storage...',
     '[0.089, 0.012, -0.045, ...]'),
    (3, 'Neural Networks Explained', 'A neural network consists of layers...',
     '[-0.011, 0.067, 0.023, ...]');

Distance Functions

ParticleDB provides three distance/similarity functions for comparing vectors:

L2_DISTANCE (Euclidean Distance)

Measures the straight-line distance between two vectors. Lower values mean more similar. Best for embeddings where magnitude matters.

SELECT
    id,
    title,
    L2_DISTANCE(embedding, '[0.021, -0.034, 0.056, ...]') AS distance
FROM documents
ORDER BY distance
LIMIT 5;

COSINE_SIMILARITY

Measures the cosine of the angle between two vectors. Returns values between -1 and 1, where 1 means identical direction. Best for normalized embeddings (most text embedding models).

SELECT
    id,
    title,
    COSINE_SIMILARITY(embedding, '[0.021, -0.034, 0.056, ...]') AS similarity
FROM documents
ORDER BY similarity DESC
LIMIT 5;

INNER_PRODUCT / DOT_PRODUCT

Computes the dot product of two vectors. Higher values indicate more similarity. Best for Maximum Inner Product Search (MIPS) use cases.

SELECT
    id,
    title,
    INNER_PRODUCT(embedding, '[0.021, -0.034, 0.056, ...]') AS score
FROM documents
ORDER BY score DESC
LIMIT 5;

The `<=>` Operator

Shorthand distance operator for use in ORDER BY:

SELECT id, title
FROM documents
ORDER BY embedding <=> '[0.021, -0.034, 0.056, ...]'
LIMIT 10;

Vector Indexes

Without an index, vector search performs an exact brute-force scan (O(n) per query). For large datasets, create an approximate nearest-neighbor (ANN) index to get sub-millisecond queries.

HNSW Index

Hierarchical Navigable Small World graph. Best general-purpose vector index — high recall, fast queries, supports incremental inserts.

CREATE INDEX idx_doc_embedding ON documents
USING hnsw (embedding)
WITH (
    m = 16,
    ef_construction = 200,
    metric = 'cosine'
);

HNSW Parameters

Parameter	Default	Description
`m`	16	Maximum number of connections per node per layer. Higher values improve recall but increase memory and build time. Typical range: 8-64.
`ef_construction`	200	Search width during index construction. Higher values produce a better graph but take longer to build. Typical range: 100-500.
`metric`	`l2`	Distance metric: `l2` (Euclidean), `cosine`, or `inner_product`. Must match the distance function used in queries.

IVF Flat Index

Inverted File index with flat (exact) distance computation within each cluster. Faster to build than HNSW, suitable for very large datasets where some recall trade-off is acceptable.

CREATE INDEX idx_doc_embedding_ivf ON documents
USING ivf_flat (embedding)
WITH (
    num_lists = 100,
    metric = 'l2'
);

IVF Parameters

Parameter	Default	Description
`num_lists`	100	Number of Voronoi cells (clusters). More lists = faster queries but lower recall. Rule of thumb: `sqrt(num_rows)`.
`metric`	`l2`	Distance metric: `l2`, `cosine`, or `inner_product`.

Choosing Between HNSW and IVF

Factor	HNSW	IVF Flat
Query latency	Lower (sub-ms)	Higher (depends on nprobe)
Build time	Slower	Faster
Memory	Higher (graph structure)	Lower (centroids only)
Recall	Higher (typically > 95%)	Depends on num_lists and nprobe
Incremental inserts	Supported	Requires periodic rebuild
Best for	< 10M vectors, high recall	> 10M vectors, acceptable recall trade-off

Query Patterns

Basic K-Nearest Neighbors

Find the K most similar items:

-- Top 10 most similar documents to a query vector
SELECT id, title, L2_DISTANCE(embedding, '[0.02, -0.03, ...]') AS dist
FROM documents
ORDER BY dist
LIMIT 10;

When an HNSW or IVF index exists on the embedding column, ParticleDB’s optimizer automatically rewrites this pattern into an indexed ANN search.

Filtered Vector Search

Combine similarity search with traditional SQL filters:

-- Find similar documents, but only in the "science" category
SELECT id, title, COSINE_SIMILARITY(embedding, '[0.02, -0.03, ...]') AS sim
FROM documents
WHERE category = 'science'
ORDER BY sim DESC
LIMIT 10;

Similarity Threshold

Return all items within a distance threshold:

-- Find all documents with cosine similarity > 0.8
SELECT id, title, COSINE_SIMILARITY(embedding, '[0.02, -0.03, ...]') AS sim
FROM documents
WHERE COSINE_SIMILARITY(embedding, '[0.02, -0.03, ...]') > 0.8
ORDER BY sim DESC;

Join by Similarity

Find similar items across tables:

-- For each product, find the 3 most similar products
SELECT
    p1.name AS product,
    p2.name AS similar_product,
    L2_DISTANCE(p1.embedding, p2.embedding) AS distance
FROM products p1
CROSS JOIN products p2
WHERE p1.id != p2.id
ORDER BY p1.id, distance
LIMIT 3;

RAG Pipelines

For retrieval-augmented generation workflows, ParticleDB provides CREATE RAG PIPELINE to define reusable search configurations.

Create a RAG Pipeline

CREATE RAG PIPELINE product_search
    TABLE products
    VECTOR_COLUMN embedding
    TEXT_COLUMN description
    TOP_K 5
    MODE vector
    MODEL 'text-embedding-3-small';

Hybrid Search (Vector + BM25)

Combine vector similarity with keyword relevance for better recall:

CREATE RAG PIPELINE doc_search
    TABLE documents
    VECTOR_COLUMN embedding
    TEXT_COLUMN content
    TOP_K 10
    MODE hybrid
    VECTOR_WEIGHT 0.7
    MODEL 'text-embedding-3-small';

Parameter	Description
`TABLE`	Source table containing the knowledge base
`VECTOR_COLUMN`	Column with vector embeddings
`TEXT_COLUMN`	Column with text content (used for BM25 in hybrid mode)
`TOP_K`	Number of results to retrieve
`MODE`	`vector` (similarity only) or `hybrid` (vector + BM25)
`VECTOR_WEIGHT`	Weight for vector score in hybrid mode (0.0 to 1.0). BM25 weight = 1 - vector_weight.
`MODEL`	Embedding model name for automatic query embedding

Drop a RAG Pipeline

DROP RAG PIPELINE product_search;

End-to-End Example

A complete semantic search application:

-- 1. Create table with vector column
CREATE TABLE articles (
    id        BIGINT PRIMARY KEY,
    title     TEXT NOT NULL,
    body      TEXT NOT NULL,
    category  TEXT,
    embedding VECTOR(768)
);

-- 2. Insert data (embeddings from your embedding model)
INSERT INTO articles (id, title, body, category, embedding) VALUES
    (1, 'Understanding HNSW', 'HNSW is a graph-based index...', 'databases',
     '[0.12, -0.05, ...]'),
    (2, 'Intro to Transformers', 'The transformer architecture...', 'ml',
     '[0.08, 0.11, ...]');

-- 3. Create HNSW index for fast search
CREATE INDEX idx_articles_emb ON articles
USING hnsw (embedding)
WITH (m = 16, ef_construction = 200, metric = 'cosine');

-- 4. Search: find articles about "approximate nearest neighbor search"
SELECT
    title,
    category,
    COSINE_SIMILARITY(embedding, '[0.10, -0.03, ...]') AS relevance
FROM articles
ORDER BY relevance DESC
LIMIT 5;

-- 5. Filtered search: only in "databases" category
SELECT title, COSINE_SIMILARITY(embedding, '[0.10, -0.03, ...]') AS relevance
FROM articles
WHERE category = 'databases'
ORDER BY relevance DESC
LIMIT 5;

Performance Tips

Always create an index for datasets larger than ~10K vectors. Brute-force scan is O(n) per query.
Match the metric between your index (WITH (metric = ...)) and your query function (L2_DISTANCE, COSINE_SIMILARITY, or INNER_PRODUCT).
Normalize embeddings before storing if using cosine similarity — pre-normalized vectors make cosine equivalent to (negative) inner product, which can be faster.
Tune m and ef_construction for your recall/latency trade-off. Start with m=16, ef_construction=200 and adjust based on benchmarking.
Use filtered search to reduce the candidate set before similarity computation when possible.