- 快猫星云Flashcat

Nightingale v9 Recording Rules: pre-aggregate complex high-cardinality PromQL into new metrics, bringing alert and dashboard query times from minutes down to seconds.

Overview

Recording Rules = run a PromQL statement at a fixed frequency in advance, write the result back as a new metric, and have subsequent alerts/dashboards query this new metric, saving the cost of recomputing each time.

Sidebar path: Data Query → Metrics → Recording Rules Tab, URL /recording-rules.

Why is it needed? Complex queries (multi-join, high-cardinality sum, histogram_quantile) are heavy every time they execute. When:

The same query is reused by multiple alert rules and dashboards;
The query itself is slow (several seconds, or tens of seconds);
Long-range ([1h], [1d]) queries recompute every minute during alert evaluation;

Make it a recording rule and pre-compute it every minute (or your specified frequency), storing the result as a new metric. Subsequent alerts/dashboards only query this lightweight new metric and return in milliseconds.

Classic scenarios:

node:cpu:usage_pct = 100 - (avg by (instance) (rate(cpu_idle[5m])) * 100), shared by alerts and dashboards;
service:http_latency:p99_5m = histogram_quantile(0.99, sum by (service, le) (rate(http_duration_bucket[5m]))), basis for all related alerts;
business:order_success_rate:1m = business metric, normalized after joining across multiple data sources.

Create / Edit a Recording Rule

The form is in four sections:

① Basic Configuration

Field	Required	Description
Business group	Yes	Rule ownership, for permission isolation
Notes	No	Short description of the rule’s purpose, for team collaboration
Extra labels	No	`key=value` format, additional labels written to the new metric. e.g. attach `team=infra` for downstream filtering

② Metric Configuration: Define the Output New Metric

Each “metric configuration” block outputs one new metric. Multiple blocks can be added — one rule can output multiple metrics.

Field	Required	Description
Write back to data source toggle	Default on	When off, the rule only runs in memory (for debugging), no metric is produced
Query A / B / …	At least 1	Defines one or more PromQL; each has its own data source type + data source + query statement
New metric name	Yes	Metric name to output; Prometheus naming convention recommended: `<level>:<metric>:<operations>`, e.g. `node:cpu:usage_pct`
Expression	Yes	References the previous query results for composition. Default `$A` uses A’s result directly; can write `$A / $B * 100` for secondary computation
Target time series source	Yes	Which Prometheus-compatible data source the new metric writes to

③ Multi-Query + Cross-Data-Source

The “Add Metric Config” button adds another block — one recording rule can output multiple metrics simultaneously, fitting “one fetch, multiple derived metrics” scenarios.

If you add multiple query conditions in a metric config (A, B, C), you can also do cross-data-source join: A from Prometheus, B from MySQL, expression $A * $B — though use cautiously, may not be better-performing than direct federation.

④ Other Configuration

Field	Description
Execution frequency	cron expression, default `@every 15s`. Don’t be shorter than the raw metric scrape interval — if metrics are scraped every 15s, running a recording rule every 5s is meaningless. Common values: `@every 1m` / `@every 15s`

Hands-on: Replace Slow Alert Queries with Recording Rules

A common example. Originally an alert queries:

histogram_quantile(0.99, sum by (service, le) (rate(http_duration_seconds_bucket[5m]))) > 1

This query has high cardinality + long range, and the alert engine runs it every 30s — heavy. Refactor steps:

Go to /recording-rules/add, create a new rule;
New metric name: service:http_latency:p99_5m;
Query A: the PromQL above (without the > 1 threshold);
Expression: $A;
Target time series source: choose a data source the alert engine can query;
Execution frequency: @every 30s (same as alert evaluation);
Save.

After creation, wait a moment (enough for one run), then go to Ad-hoc Query and verify service:http_latency:p99_5m{} returns data.

Finally, in the alert rule, change PromQL to service:http_latency:p99_5m > 1 — the query reads from a lightweight metric, and alert evaluation finishes in milliseconds.

FAQ

Q1: Recording rule is configured but the new metric can’t be queried?

A: Troubleshoot in order:

Wait for one execution cycle (default 15s, max 1m);
Use the “Data Preview” button to see if the PromQL can query the original data — if not, the source PromQL itself has issues;
Check whether the “target time series source” supports remote_write (most Prometheus-compatible data sources support it, but read-only PromQL gateways may not);
Check Nightingale Server logs for recording rule eval failed errors;
Check whether the recording rule’s “Enable” switch is on.

Q2: How to choose the best execution frequency?

A: Recommendation: align with alert evaluation frequency, and never shorter than raw metric scrape interval:

Raw scrape 15s + alert evaluation 30s → recording rule @every 30s is optimal;
Scrape 1m + alert evaluation 1m → recording rule @every 1m;
Want second-level alerts? You need to raise raw metric scrape frequency to seconds first; otherwise running the recording rule faster only yields the same data points.

Q3: Can a recording rule backfill historical data?

A: No. Recording rules only take effect for “future data” — produce metrics from the moment of enabling, on schedule. Historical intervals have no new metric. For backfilling:

Query directly on the raw data source with PromQL (accept the slowness);
Or use VictoriaMetrics’ vmalert offline mode for historical replay.

Q4: Will making recording rules for high-cardinality metrics actually be slower?

A: Yes — if done wrong. Recording rules should actively reduce cardinality:

Use sum by (subset of dimensions): consolidate hundreds of labels to only keep the critical few;
Don’t retain instance/pod and other ephemeral labels (unless business-required);
The new metric series count should be 1-2 orders of magnitude less than input.

Otherwise the recording rule only turns “slow query” into “slow write + slightly faster query”.