Full-Text Search

ParticleDB provides text search through SQL pattern matching operators and string functions. For semantic search, combine these with Vector Search using hybrid RAG pipelines.

LIKE

Case-sensitive pattern matching with two wildcards:

% matches any sequence of zero or more characters
_ matches exactly one character

-- Find emails from a specific domain
SELECT name, email
FROM users
WHERE email LIKE '%@company.com';

-- Find 3-letter product codes starting with "A"
SELECT * FROM products
WHERE sku LIKE 'A__';

-- Find names starting with "Jo"
SELECT * FROM users
WHERE name LIKE 'Jo%';

-- Combine wildcards: names containing "son" anywhere
SELECT * FROM users
WHERE name LIKE '%son%';

NOT LIKE

Negate the pattern:

-- Users NOT from gmail
SELECT name, email
FROM users
WHERE email NOT LIKE '%@gmail.com';

ILIKE

Case-insensitive version of LIKE. Same wildcard rules, but ignores case when comparing:

-- Find "smith", "Smith", "SMITH", etc.
SELECT * FROM users
WHERE last_name ILIKE '%smith%';

-- Case-insensitive domain search
SELECT * FROM users
WHERE email ILIKE '%@Company.COM';

ILIKE is a PostgreSQL extension supported by ParticleDB. Use it whenever case-insensitive matching is needed without manually calling LOWER() on both sides.

Escape Characters

To match a literal % or _ in a LIKE pattern, use a backslash escape:

-- Find discounts stored as "10%", "20%", etc.
SELECT * FROM promotions
WHERE label LIKE '%\%%';

-- Find column names containing underscores
SELECT * FROM metadata
WHERE column_name LIKE '%\_%';

String Functions for Text Search

STARTS_WITH / ENDS_WITH

Test whether a string begins or ends with a given prefix or suffix. These return a boolean and are often more readable than equivalent LIKE patterns.

-- Find URLs starting with "https"
SELECT * FROM links
WHERE STARTS_WITH(url, 'https');

-- Find files ending with ".csv"
SELECT * FROM uploads
WHERE ENDS_WITH(filename, '.csv');

-- Equivalent LIKE patterns (STARTS_WITH = 'prefix%', ENDS_WITH = '%suffix')
SELECT * FROM links  WHERE url LIKE 'https%';
SELECT * FROM uploads WHERE filename LIKE '%.csv';

POSITION / STRPOS

Find the position of a substring within a string. Returns a 1-based index, or 0 if not found.

-- Find the position of "@" in email addresses
SELECT email, POSITION(email, '@') AS at_pos
FROM users;

-- STRPOS is an alias
SELECT email, STRPOS(email, '@') AS at_pos
FROM users;

-- Filter: only rows where a substring exists
SELECT * FROM logs
WHERE POSITION(message, 'ERROR') > 0;

SUBSTRING / SUBSTR

Extract a portion of a string. Uses 1-based indexing.

-- Extract domain from email (after the @)
SELECT
    email,
    SUBSTRING(email, POSITION(email, '@') + 1) AS domain
FROM users;

-- First 10 characters
SELECT SUBSTRING(title, 1, 10) AS title_preview
FROM articles;

REPLACE

Replace all occurrences of a substring:

-- Normalize whitespace
SELECT REPLACE(title, '  ', ' ') AS clean_title
FROM articles;

-- Redact sensitive data
SELECT REPLACE(phone, SUBSTRING(phone, 4, 4), '****') AS masked_phone
FROM users;

REGEXP_REPLACE

Pattern-based replacement using regular expressions:

-- Extract domain from URL
SELECT
    url,
    REGEXP_REPLACE(url, '^https?://(?:www\.)?([^/]+)/.*$', '\1') AS domain
FROM pages;

-- Remove all non-alphanumeric characters
SELECT REGEXP_REPLACE(title, '[^a-zA-Z0-9 ]', '') AS clean_title
FROM articles;

ParticleDB automatically detects common regex patterns (like URL domain extraction) and uses optimized string-slicing fast paths instead of the regex engine.

Combining String Functions

Build complex text processing pipelines by composing functions:

-- Normalize and search
SELECT *
FROM products
WHERE LOWER(TRIM(name)) LIKE '%wireless%mouse%';

-- Extract and aggregate
SELECT
    LOWER(SUBSTRING(email, POSITION(email, '@') + 1)) AS domain,
    COUNT(*) AS user_count
FROM users
GROUP BY domain
ORDER BY user_count DESC
LIMIT 10;

Text Search with Aggregation

STRING_AGG

Concatenate matching text values across rows:

-- Collect all tags for each article into a comma-separated string
SELECT
    article_id,
    STRING_AGG(tag, ', ') AS all_tags
FROM article_tags
GROUP BY article_id;

Search within Aggregated Text

-- Find articles whose tags include "machine-learning"
SELECT article_id, STRING_AGG(tag, ', ') AS tags
FROM article_tags
GROUP BY article_id
HAVING STRING_AGG(tag, ', ') LIKE '%machine-learning%';

Hash Functions for Text Fingerprinting

Use hash functions to create text fingerprints for deduplication or integrity checks:

-- MD5 fingerprint for deduplication
SELECT
    MD5(content) AS content_hash,
    COUNT(*) AS duplicates
FROM articles
GROUP BY MD5(content)
HAVING COUNT(*) > 1;

-- SHA-256 for content integrity
SELECT id, title, SHA256(content) AS content_sha
FROM documents;

Available hash functions: MD5, SHA1, SHA256, SHA512.

Hybrid Search with RAG Pipelines

For production search applications, combine keyword matching with vector similarity using a hybrid RAG pipeline:

-- Create a hybrid pipeline that combines BM25 keyword scoring
-- with vector similarity
CREATE RAG PIPELINE article_search
    TABLE articles
    VECTOR_COLUMN embedding
    TEXT_COLUMN content
    TOP_K 10
    MODE hybrid
    VECTOR_WEIGHT 0.7
    MODEL 'text-embedding-3-small';

In hybrid mode:

Vector component (weight 0.7): Finds semantically similar content even when exact keywords don’t match
BM25 component (weight 0.3): Boosts results that contain the exact query terms

This gives better results than either approach alone — vector search handles paraphrases and synonyms, while BM25 ensures exact keyword matches rank highly.

See Vector Search for full details on RAG pipelines.

Performance Tips

LIKE patterns starting with % (e.g., LIKE '%keyword%') require a full table scan. When possible, anchor the pattern at the start (e.g., LIKE 'prefix%').
ILIKE internally lower-cases both sides. For frequently queried columns, consider storing a pre-lowered version.
POSITION/STRPOS is vectorized — it operates on entire Arrow string columns at once, not row by row.
For large-scale text search, hybrid RAG pipelines (vector + BM25) provide both precision and semantic understanding.
REGEXP_REPLACE with common URL extraction patterns uses an optimized fast path that avoids the regex engine entirely.