Monitoring
ParticleDB exposes a Prometheus-compatible metrics endpoint and includes a pre-built Grafana dashboard. Metrics collection is designed for zero hot-path impact — counters use cache-line-aligned atomics (~1ns per observation) and histograms use thread-local shards with no cross-thread synchronization.
Prometheus Metrics Endpoint
Section titled “Prometheus Metrics Endpoint”ParticleDB runs a dedicated HTTP server for metrics, separate from the main HTTP API so it can be independently firewalled.
# Default: port 9090particledb start --metrics-port 9090Scrape metrics at:
GET http://<host>:9090/metricsThe endpoint returns metrics in Prometheus text exposition format.
Scrape Configuration
Section titled “Scrape Configuration”Add ParticleDB to your prometheus.yml:
scrape_configs: - job_name: 'particledb' scrape_interval: 15s static_configs: - targets: ['localhost:9090']For Kubernetes with service discovery:
scrape_configs: - job_name: 'particledb' kubernetes_sd_configs: - role: pod relabel_configs: - source_labels: [__meta_kubernetes_pod_label_app] regex: particledb action: keep - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port] action: replace target_label: __address__ regex: (.+) replacement: ${1}:9090Core Metrics
Section titled “Core Metrics”Database-Level Gauges
Section titled “Database-Level Gauges”These are refreshed on every /metrics scrape from the execution context.
| Metric | Type | Description |
|---|---|---|
particledb_tables_total | gauge | Number of tables |
particledb_rows_total | gauge | Total rows across all tables |
particledb_table_rows{table="..."} | gauge | Rows per table |
particledb_uptime_seconds | gauge | Server uptime in seconds |
Runtime Metrics (pdb_ prefix)
Section titled “Runtime Metrics (pdb_ prefix)”These are registered at server startup and updated on the hot path.
| Metric | Type | Labels | Description |
|---|---|---|---|
pdb_queries_total | counter | type | Total queries by type (SELECT, INSERT, UPDATE, DELETE) |
pdb_query_duration_seconds | histogram | Query execution latency (100us to 30s buckets) | |
pdb_commit_duration_seconds | histogram | Transaction commit latency | |
pdb_wal_fsync_duration_seconds | histogram | WAL fsync latency (10us to 1s buckets) | |
pdb_connections_active | gauge | protocol | Active connections by protocol (pg, redis, grpc, http) |
pdb_memtable_bytes | gauge | Current memtable size in bytes | |
pdb_sst_files | gauge | level | SST file count per LSM level |
pdb_disk_usage_bytes | gauge | Total disk usage | |
pdb_memory_rss_bytes | gauge | Process RSS memory | |
pdb_uptime_seconds | gauge | Server uptime | |
pdb_version_info | gauge | version, edition, os, arch | Server version metadata (value always 1) |
Connection Metrics
Section titled “Connection Metrics”The PG wire layer automatically maintains pg_active_connections in the global registry, which is included in every /metrics export.
System Metrics Collector
Section titled “System Metrics Collector”ParticleDB runs a background collector that periodically gathers system and database metrics:
- Collection interval: 15 seconds (configurable)
- System metrics: CPU usage, memory usage, disk I/O
- PDB metrics: active queries, cache sizes, table statistics, connection counts
- Retention: 1000 snapshots (~4.2 hours at 15s intervals)
Historical metric snapshots are queryable via __pdb_stat_* virtual tables:
-- Recent CPU and memory usageSELECT * FROM __pdb_stat_system ORDER BY timestamp DESC LIMIT 10;Health Checks
Section titled “Health Checks”ParticleDB exposes a health endpoint on the metrics server:
GET http://<host>:9090/healthReturns 200 OK when the server is healthy. Use this for Kubernetes liveness and readiness probes:
livenessProbe: httpGet: path: /health port: 9090 initialDelaySeconds: 5 periodSeconds: 10
readinessProbe: httpGet: path: /health port: 9090 initialDelaySeconds: 5 periodSeconds: 5Grafana Dashboard
Section titled “Grafana Dashboard”ParticleDB ships with a pre-built Grafana dashboard covering:
- Query Performance: QPS, latency p50/p95/p99, error rate, slow queries
- Resource Utilization: CPU usage, memory usage, disk I/O, active connections
- Storage Metrics: table sizes, zone map coverage, cache hit rate, compaction progress
- Replication Health: Raft commit lag, leader elections, follower status
- Alert Status: active alerts and alert history
Deploying the Dashboard
Section titled “Deploying the Dashboard”Use the built-in dashboard manager to push the dashboard to Grafana:
pdb dashboard deploy --grafana-url http://localhost:3000 --api-key <GRAFANA_API_KEY>Or programmatically:
use spanner_metrics::dashboard::{DashboardManager, DashboardConfig};
let config = DashboardConfig { grafana_url: "http://localhost:3000".to_string(), api_key: "glsa_xxx".to_string(), ..Default::default()};let manager = DashboardManager::new(config);manager.provision_datasource("http://localhost:9090").await?;manager.deploy_dashboard().await?;Manual Import
Section titled “Manual Import”The dashboard JSON template can also be imported directly into Grafana:
- Open Grafana and navigate to Dashboards > Import.
- Paste the dashboard JSON or upload the file.
- Select your Prometheus data source.
- Save.
The template expects a Prometheus data source pointed at ParticleDB’s metrics port (default 9090).
External Monitoring Integrations
Section titled “External Monitoring Integrations”ParticleDB can forward metrics to external monitoring systems via a background exporter.
Datadog (StatsD/DogStatsD)
Section titled “Datadog (StatsD/DogStatsD)”SET monitoring_integration = 'datadog';SET datadog_agent_host = 'localhost:8125';SET monitoring_tags = 'env:production,service:particledb';SET monitoring_interval_seconds = '10';New Relic (OTLP)
Section titled “New Relic (OTLP)”SET monitoring_integration = 'newrelic';-- Configure OTLP endpoint and API key via environment or config fileGrafana Cloud (Prometheus Remote Write)
Section titled “Grafana Cloud (Prometheus Remote Write)”SET monitoring_integration = 'grafana_cloud';-- Configure remote_write endpoint via environment or config fileQuery Performance Tracking
Section titled “Query Performance Tracking”ParticleDB tracks per-query statistics including fingerprinting, top-N tracking, and slow query logging.
Query Stats Virtual Table
Section titled “Query Stats Virtual Table”-- Top 10 queries by total execution timeSELECT fingerprint, count, total_time, avg_time, max_timeFROM __pdb_query_statsORDER BY total_time DESCLIMIT 10;Slow Query Log
Section titled “Slow Query Log”Slow queries are captured in a lock-free ring buffer and exposed via:
SELECT query_text, duration, rows_scanned, timestampFROM __pdb_slow_queriesORDER BY timestamp DESCLIMIT 20;Audit Log
Section titled “Audit Log”When --audit-log is enabled, all DDL and DML operations are recorded:
SELECT * FROM __pdb_audit_logWHERE timestamp > NOW() - INTERVAL '1 hour'ORDER BY timestamp DESC;Add --audit-log-selects to also log SELECT queries (high volume).
Diagnostic Environment Variables
Section titled “Diagnostic Environment Variables”For ad-hoc debugging, enable runtime diagnostics via environment variables:
| Variable | Description |
|---|---|
PDB_WIRE_TRACE=1 | Log every PG wire message with timing |
PDB_WIRE_DUMP=1 | Dump raw wire bytes |
PDB_REDIS_PROFILE=N | Log Redis commands (sample every Nth) |
PDB_OLTP_PROFILE=N | Log OLTP operations (sample every Nth) |
PDB_AGG_TELEMETRY=1 | Log aggregation strategy selection |
PDB_WIRE_TRACE=1 PDB_AGG_TELEMETRY=1 particledb startKey Alerts to Configure
Section titled “Key Alerts to Configure”Recommended Prometheus alerting rules for production deployments:
groups: - name: particledb rules: - alert: HighQueryLatency expr: histogram_quantile(0.99, rate(pdb_query_duration_seconds_bucket[5m])) > 1 for: 5m labels: severity: warning annotations: summary: "p99 query latency above 1s"
- alert: HighConnectionCount expr: pdb_connections_active > 80 for: 2m labels: severity: warning annotations: summary: "Active connections nearing limit"
- alert: WALFsyncSlow expr: histogram_quantile(0.99, rate(pdb_wal_fsync_duration_seconds_bucket[5m])) > 0.1 for: 5m labels: severity: critical annotations: summary: "WAL fsync p99 above 100ms -- check disk health"
- alert: HighMemoryUsage expr: pdb_memory_rss_bytes > 30e9 for: 5m labels: severity: warning annotations: summary: "RSS memory above 30GB"