Nightingale v9 Recording Rules: pre-aggregate complex high-cardinality PromQL into new metrics, bringing alert and dashboard query times from minutes down to seconds.
Overview
Recording Rules = run a PromQL statement at a fixed frequency in advance, write the result back as a new metric, and have subsequent alerts/dashboards query this new metric, saving the cost of recomputing each time.
Sidebar path: Data Query → Metrics → Recording Rules Tab, URL /recording-rules.
Why is it needed? Complex queries (multi-join, high-cardinality sum, histogram_quantile) are heavy every time they execute. When:
- The same query is reused by multiple alert rules and dashboards;
- The query itself is slow (several seconds, or tens of seconds);
- Long-range (
[1h],[1d]) queries recompute every minute during alert evaluation;
Make it a recording rule and pre-compute it every minute (or your specified frequency), storing the result as a new metric. Subsequent alerts/dashboards only query this lightweight new metric and return in milliseconds.
Classic scenarios:
node:cpu:usage_pct=100 - (avg by (instance) (rate(cpu_idle[5m])) * 100), shared by alerts and dashboards;service:http_latency:p99_5m=histogram_quantile(0.99, sum by (service, le) (rate(http_duration_bucket[5m]))), basis for all related alerts;business:order_success_rate:1m= business metric, normalized after joining across multiple data sources.
Create / Edit a Recording Rule
The form is in four sections:
① Basic Configuration
| Field | Required | Description |
|---|---|---|
| Business group | Yes | Rule ownership, for permission isolation |
| Notes | No | Short description of the rule’s purpose, for team collaboration |
| Extra labels | No | key=value format, additional labels written to the new metric. e.g. attach team=infra for downstream filtering |
② Metric Configuration: Define the Output New Metric
Each “metric configuration” block outputs one new metric. Multiple blocks can be added — one rule can output multiple metrics.
| Field | Required | Description |
|---|---|---|
| Write back to data source toggle | Default on | When off, the rule only runs in memory (for debugging), no metric is produced |
| Query A / B / … | At least 1 | Defines one or more PromQL; each has its own data source type + data source + query statement |
| New metric name | Yes | Metric name to output; Prometheus naming convention recommended: <level>:<metric>:<operations>, e.g. node:cpu:usage_pct |
| Expression | Yes | References the previous query results for composition. Default $A uses A’s result directly; can write $A / $B * 100 for secondary computation |
| Target time series source | Yes | Which Prometheus-compatible data source the new metric writes to |
③ Multi-Query + Cross-Data-Source
The “Add Metric Config” button adds another block — one recording rule can output multiple metrics simultaneously, fitting “one fetch, multiple derived metrics” scenarios.
If you add multiple query conditions in a metric config (A, B, C), you can also do cross-data-source join: A from Prometheus, B from MySQL, expression $A * $B — though use cautiously, may not be better-performing than direct federation.
④ Other Configuration
| Field | Description |
|---|---|
| Execution frequency | cron expression, default @every 15s. Don’t be shorter than the raw metric scrape interval — if metrics are scraped every 15s, running a recording rule every 5s is meaningless. Common values: @every 1m / @every 15s |
Hands-on: Replace Slow Alert Queries with Recording Rules
A common example. Originally an alert queries:
histogram_quantile(0.99, sum by (service, le) (rate(http_duration_seconds_bucket[5m]))) > 1
This query has high cardinality + long range, and the alert engine runs it every 30s — heavy. Refactor steps:
- Go to
/recording-rules/add, create a new rule; - New metric name:
service:http_latency:p99_5m; - Query A: the PromQL above (without the
> 1threshold); - Expression:
$A; - Target time series source: choose a data source the alert engine can query;
- Execution frequency:
@every 30s(same as alert evaluation); - Save.
After creation, wait a moment (enough for one run), then go to Ad-hoc Query and verify service:http_latency:p99_5m{} returns data.
Finally, in the alert rule, change PromQL to service:http_latency:p99_5m > 1 — the query reads from a lightweight metric, and alert evaluation finishes in milliseconds.
FAQ
Q1: Recording rule is configured but the new metric can’t be queried?
A: Troubleshoot in order:
- Wait for one execution cycle (default 15s, max 1m);
- Use the “Data Preview” button to see if the PromQL can query the original data — if not, the source PromQL itself has issues;
- Check whether the “target time series source” supports remote_write (most Prometheus-compatible data sources support it, but read-only PromQL gateways may not);
- Check Nightingale Server logs for
recording rule eval failederrors; - Check whether the recording rule’s “Enable” switch is on.
Q2: How to choose the best execution frequency?
A: Recommendation: align with alert evaluation frequency, and never shorter than raw metric scrape interval:
- Raw scrape 15s + alert evaluation 30s → recording rule
@every 30sis optimal; - Scrape 1m + alert evaluation 1m → recording rule
@every 1m; - Want second-level alerts? You need to raise raw metric scrape frequency to seconds first; otherwise running the recording rule faster only yields the same data points.
Q3: Can a recording rule backfill historical data?
A: No. Recording rules only take effect for “future data” — produce metrics from the moment of enabling, on schedule. Historical intervals have no new metric. For backfilling:
- Query directly on the raw data source with PromQL (accept the slowness);
- Or use VictoriaMetrics’ vmalert offline mode for historical replay.
Q4: Will making recording rules for high-cardinality metrics actually be slower?
A: Yes — if done wrong. Recording rules should actively reduce cardinality:
- Use
sum by (subset of dimensions): consolidate hundreds of labels to only keep the critical few; - Don’t retain
instance/podand other ephemeral labels (unless business-required); - The new metric series count should be 1-2 orders of magnitude less than input.
Otherwise the recording rule only turns “slow query” into “slow write + slightly faster query”.