- 快猫星云Flashcat

Nightingale supports metric alerting. Based on user-configured alert rules, it periodically queries data sources and triggers an alert when the data meets the configured threshold.

Nightingale splits alerting into two parts: alerting + notification. Alerting refers to the periodic rule evaluation that finally produces alert events; notification refers to the downstream pipeline and delivery process for those events. This chapter covers the alerting part — once we can produce alert events, we consider it a success.

How Alerting Works

Nightingale supports two alerting modes: normal mode and advanced mode (advanced mode is not yet open-sourced; we plan to open-source it later):

Normal mode: The threshold is included in the PromQL — query and threshold are together. Unless you have special needs, use normal mode. Its alerting logic is the same as Prometheus, and performance is good. The only inconvenience is that obtaining the value at recovery time is a bit tricky.
Advanced mode: query and threshold are separate. If you have multiple queries that need arithmetic operations between them, use advanced mode. The trigger values of each query are shown in the alert event’s “live values”, and the recovery value can be easily obtained when the alert recovers.

How Normal Mode Works

In normal mode, Nightingale periodically queries the data source according to the user-configured execution frequency. The query condition is the user-configured PromQL, and the query method is an instant query — i.e. it calls the data source’s /api/v1/query endpoint. Each returned data point produces one alert event. For example, with PromQL cpu_usage_active > 80, Nightingale queries the TSDB with this PromQL; the TSDB returns the data points whose CPU usage exceeds 80% — those are the points that have crossed the threshold — so Nightingale produces alert events for them.

If the user has configured a duration greater than 0 in the alert rule, it gets more complex: Nightingale will execute the query multiple times during the duration window at the execution frequency, and only generate an alert if the data is present every time. If duration is 0, an alert is generated as soon as one query returns data.

If an alert event was previously generated and a later query returns no data, a recovery event is generated — after all, no data is returned, meaning no data points in the TSDB satisfy the threshold any longer, so the TSDB returns nothing. There is also an advanced configuration for recovery called observe duration. After a recovery event is generated, Nightingale will continue to observe for some period; if data is found again within the observe duration, the recovery event is not produced (the alert stays firing). Only if no data is returned in every check within the observe duration is the recovery event finally produced.

From the analysis above, you can see that when an alert recovers in normal mode, the TSDB returns no data, so Nightingale cannot obtain the value at recovery time. This is a common pain point for many users using normal mode. Nightingale designed a workaround for this; see How to get the recovery value at alert recovery time.

How Advanced Mode Works

In advanced mode, the threshold condition is not part of the PromQL — the PromQL contains only filter conditions. For example:

cpu_usage_active{cpu="cpu-total"}

When Nightingale queries the TSDB with this PromQL, the TSDB always returns all CPU usage data points (slightly worse performance). Then Nightingale evaluates the returned data in memory against the user-configured threshold rules, as shown below:

Advanced mode alerting

The key difference between advanced and normal mode is whether the threshold check happens in the PromQL (delegated to the TSDB) or in Nightingale’s memory. In advanced mode, when a recovery event is triggered, the TriggerValue in the recovery event is automatically populated with the value at recovery time — much easier than in normal mode.

Advanced mode also has a data missing evaluation, commonly known as NoData alerting. Nightingale’s behavior is: periodically query the data source; if data is returned, cache it in memory. The next query also returns data — all is well. If a subsequent query does not return a particular series, that series should trigger an alert.

Feature Description

With the principles understood, let’s configure an alert rule to demonstrate Nightingale’s metric alerting.

Where to Create

The menu entry is Alerting - Rule Management - Alert Rules, as shown below:

Rule creation entry

First, choose a business group on the left. If there is no business group, you need to create one first — alert rules can be numerous, requiring categorized management and access control, so alert rules are bound to business groups.

A business group is a flat list, but it can be rendered as a tree. As long as the business group name contains /, it will be rendered as a tree. For example, DBA/MySQL and DBA/Redis are rendered as the tree style shown above. The prerequisite is that in the System Configuration - Site Configuration menu, the business group display mode is set to tree, and the business group separator is set to /.

Below we walk through the meaning of each alert rule configuration item.

🟢 Tip: On the rule configuration page, every form field has a tooltip (the small question-mark icon — hover to see usage hints). Be sure to read them.

Basic Configuration

Rule name: The name of the alert rule, e.g. “High machine load”. Variables like {{ $labels.instance }} can be referenced in the rule name, but this is strongly discouraged because it makes the resulting alert event names differ from one another, making it inconvenient to aggregate alert events.
Additional labels: Labels configured here are appended to the labels of generated alert events. They can later be used for alert event aggregation and filtering.
Notes: A more detailed description of the alert rule. Supports variables like $labels and $value.

Rule Configuration

Data source: Choose the data source type and filter conditions to specify which data sources this alert rule applies to. Since many companies have multiple Prometheus deployments, this makes rules easier to manage.
Alert condition: This is the PromQL. You can include filter conditions and arithmetic in the PromQL. For example, this PromQL: http_api_request_success{region="beijing"} / http_api_request_total{region="beijing"} < 0.995 means: compute the success rate of all HTTP requests in the beijing region — if the success rate is less than 99.5%, alert. If the alert engine retrieves data with this PromQL, it means anomalies exist; if the anomaly persists across multiple queries and finally meets the duration, an alert event is produced.
Multiple alert conditions and severity suppression: A single alert rule can include multiple PromQL queries; in that case a severity suppression toggle appears automatically. If severity suppression is enabled and two conditions both produce alerts, only the higher-severity alert is sent — the lower-severity one is suppressed to reduce noise.
Execution frequency and duration: Both have tooltips on the page; hover for usage hints. Execution frequency is equivalent to Prometheus’ evaluation_interval; duration is equivalent to Prometheus’ for. When duration is 0, an alert event is generated as soon as the query returns data once.

Event relabel

This section provides a usage doc on the page; please refer to it. Prometheus has a relabel mechanism that many readers will be familiar with (if not, please Google it — it’s a useful design). Prometheus does relabel against time-series data; Nightingale does relabel against generated alert events.

For example, suppose there is originally a label instance=10.1.2.3:9090. You can use relabel to extract the IP from it and produce a new label ident=10.1.2.3. Nightingale’s alert self-healing feature needs to extract machine info from alert events — specifically, the value of the ident label. Using relabel to write the machine info into the ident label makes subsequent self-healing easier (this assumes you have configured hostname="$ip" in Categraf).

Effective Configuration

This section also has on-page usage info — please refer to it. The most important setting here is the effective time window — for example, an alert rule that only fires during the day, or one that only fires at night.

Notification Configuration

Nightingale notification configuration

In old versions, alert recipients and notification media were configured directly in the alert rule, which made batch modifications cumbersome. The new version extracts notification logic into a separate concept — notification rules — which handle everything after an alert event is produced. We’ll go through notification rules in detail later.

In the notification configuration section, other fields have tooltips — hover for hints. We won’t repeat them here.

Alert self-healing means: after an alert is generated, automatically run a specific script on the alerting machine (or a designated control machine). Where does the “alerting machine” info come from? From the ident label of the alert event. Which self-healing script to execute? It is specified by the alert self-healing field under the notification configuration.

Additional information is similar to Prometheus alert rule Annotations. After an alert event is generated, Nightingale appends this additional information to the event, which can later be referenced in message templates and finally rendered in DingTalk, Feishu, email, and other notifications.

Hands-on Demo

To generate an alert event quickly, I configured a PromQL that will definitely trigger:

cpu_usage_active > 0

🟢 cpu_usage_active is a metric collected by Categraf, representing CPU usage. CPU usage is obviously always greater than 0, so this rule fires very quickly. If you don’t use Categraf, you won’t have this metric — please use a metric from your own TSDB for testing.

In the example above, to speed up alert event generation, I set the execution frequency to 15s and the duration to 0. This way Nightingale queries the data source every 15s; if data is returned, an alert event is generated.

After a short wait, you’ll notice that the status field on the left of the alert rule turns into a red exclamation mark, indicating that an alert event has been triggered. Clicking it shows the alert events generated by this rule in a side panel. You can also see the current active alerts (uncovered alerts) and all historical alerts in the alert events menu.

Nightingale alert events

The first alert event above is the one we just tested; the others are from earlier tests, ignore them. Now that an alert event has been generated, the alert rule configuration is OK. The next step is to configure the notification rule, specifying which alert events should go to whom and through which notification medium (phone, SMS, email, Feishu, DingTalk, WeCom, etc., collectively called notification media).