Skip to content

Full-Text Search

ParticleDB provides text search through SQL pattern matching operators and string functions. For semantic search, combine these with Vector Search using hybrid RAG pipelines.

Case-sensitive pattern matching with two wildcards:

  • % matches any sequence of zero or more characters
  • _ matches exactly one character
-- Find emails from a specific domain
SELECT name, email
FROM users
WHERE email LIKE '%@company.com';
-- Find 3-letter product codes starting with "A"
SELECT * FROM products
WHERE sku LIKE 'A__';
-- Find names starting with "Jo"
SELECT * FROM users
WHERE name LIKE 'Jo%';
-- Combine wildcards: names containing "son" anywhere
SELECT * FROM users
WHERE name LIKE '%son%';

Negate the pattern:

-- Users NOT from gmail
SELECT name, email
FROM users
WHERE email NOT LIKE '%@gmail.com';

Case-insensitive version of LIKE. Same wildcard rules, but ignores case when comparing:

-- Find "smith", "Smith", "SMITH", etc.
SELECT * FROM users
WHERE last_name ILIKE '%smith%';
-- Case-insensitive domain search
SELECT * FROM users
WHERE email ILIKE '%@Company.COM';

ILIKE is a PostgreSQL extension supported by ParticleDB. Use it whenever case-insensitive matching is needed without manually calling LOWER() on both sides.


To match a literal % or _ in a LIKE pattern, use a backslash escape:

-- Find discounts stored as "10%", "20%", etc.
SELECT * FROM promotions
WHERE label LIKE '%\%%';
-- Find column names containing underscores
SELECT * FROM metadata
WHERE column_name LIKE '%\_%';

Test whether a string begins or ends with a given prefix or suffix. These return a boolean and are often more readable than equivalent LIKE patterns.

-- Find URLs starting with "https"
SELECT * FROM links
WHERE STARTS_WITH(url, 'https');
-- Find files ending with ".csv"
SELECT * FROM uploads
WHERE ENDS_WITH(filename, '.csv');
-- Equivalent LIKE patterns (STARTS_WITH = 'prefix%', ENDS_WITH = '%suffix')
SELECT * FROM links WHERE url LIKE 'https%';
SELECT * FROM uploads WHERE filename LIKE '%.csv';

Find the position of a substring within a string. Returns a 1-based index, or 0 if not found.

-- Find the position of "@" in email addresses
SELECT email, POSITION(email, '@') AS at_pos
FROM users;
-- STRPOS is an alias
SELECT email, STRPOS(email, '@') AS at_pos
FROM users;
-- Filter: only rows where a substring exists
SELECT * FROM logs
WHERE POSITION(message, 'ERROR') > 0;

Extract a portion of a string. Uses 1-based indexing.

-- Extract domain from email (after the @)
SELECT
email,
SUBSTRING(email, POSITION(email, '@') + 1) AS domain
FROM users;
-- First 10 characters
SELECT SUBSTRING(title, 1, 10) AS title_preview
FROM articles;

Replace all occurrences of a substring:

-- Normalize whitespace
SELECT REPLACE(title, ' ', ' ') AS clean_title
FROM articles;
-- Redact sensitive data
SELECT REPLACE(phone, SUBSTRING(phone, 4, 4), '****') AS masked_phone
FROM users;

Pattern-based replacement using regular expressions:

-- Extract domain from URL
SELECT
url,
REGEXP_REPLACE(url, '^https?://(?:www\.)?([^/]+)/.*$', '\1') AS domain
FROM pages;
-- Remove all non-alphanumeric characters
SELECT REGEXP_REPLACE(title, '[^a-zA-Z0-9 ]', '') AS clean_title
FROM articles;

ParticleDB automatically detects common regex patterns (like URL domain extraction) and uses optimized string-slicing fast paths instead of the regex engine.


Build complex text processing pipelines by composing functions:

-- Normalize and search
SELECT *
FROM products
WHERE LOWER(TRIM(name)) LIKE '%wireless%mouse%';
-- Extract and aggregate
SELECT
LOWER(SUBSTRING(email, POSITION(email, '@') + 1)) AS domain,
COUNT(*) AS user_count
FROM users
GROUP BY domain
ORDER BY user_count DESC
LIMIT 10;

Concatenate matching text values across rows:

-- Collect all tags for each article into a comma-separated string
SELECT
article_id,
STRING_AGG(tag, ', ') AS all_tags
FROM article_tags
GROUP BY article_id;
-- Find articles whose tags include "machine-learning"
SELECT article_id, STRING_AGG(tag, ', ') AS tags
FROM article_tags
GROUP BY article_id
HAVING STRING_AGG(tag, ', ') LIKE '%machine-learning%';

Use hash functions to create text fingerprints for deduplication or integrity checks:

-- MD5 fingerprint for deduplication
SELECT
MD5(content) AS content_hash,
COUNT(*) AS duplicates
FROM articles
GROUP BY MD5(content)
HAVING COUNT(*) > 1;
-- SHA-256 for content integrity
SELECT id, title, SHA256(content) AS content_sha
FROM documents;

Available hash functions: MD5, SHA1, SHA256, SHA512.


For production search applications, combine keyword matching with vector similarity using a hybrid RAG pipeline:

-- Create a hybrid pipeline that combines BM25 keyword scoring
-- with vector similarity
CREATE RAG PIPELINE article_search
TABLE articles
VECTOR_COLUMN embedding
TEXT_COLUMN content
TOP_K 10
MODE hybrid
VECTOR_WEIGHT 0.7
MODEL 'text-embedding-3-small';

In hybrid mode:

  • Vector component (weight 0.7): Finds semantically similar content even when exact keywords don’t match
  • BM25 component (weight 0.3): Boosts results that contain the exact query terms

This gives better results than either approach alone — vector search handles paraphrases and synonyms, while BM25 ensures exact keyword matches rank highly.

See Vector Search for full details on RAG pipelines.


  • LIKE patterns starting with % (e.g., LIKE '%keyword%') require a full table scan. When possible, anchor the pattern at the start (e.g., LIKE 'prefix%').
  • ILIKE internally lower-cases both sides. For frequently queried columns, consider storing a pre-lowered version.
  • POSITION/STRPOS is vectorized — it operates on entire Arrow string columns at once, not row by row.
  • For large-scale text search, hybrid RAG pipelines (vector + BM25) provide both precision and semantic understanding.
  • REGEXP_REPLACE with common URL extraction patterns uses an optimized fast path that avoids the regex engine entirely.