- 快猫星云Flashcat

Introduces the principles and data flow of the Nightingale alert engine to help users understand the alert process and troubleshoot alert issues.

The functional focus of Nightingale is the alert engine. In order to be flexible, the entire alert process involves many functional points. This article introduces the relevant knowledge from the perspective of principles and data flow. Understanding this knowledge will be helpful for you to use Nightingale and troubleshoot alert issues.

Data Flow Principle Overview

Nightingale Alert Data Flow Principle Overview

The user configures alert rules in the Web UI, and the rules are saved in the DB (usually MySQL).
The alert engine (the n9e process has a built-in alert engine, and the n9e-edge process also has a built-in alert engine in edge mode) syncs alert rules from the DB into memory (usually n9e-edge cannot directly read the DB, but obtains the alert rules by calling the interface of the central n9e).
The alert engine creates a goroutine (coroutine, which can be understood as a lightweight thread) for each alert rule, and periodically queries the storage according to the frequency configured in the alert rule, judges the data for abnormalities, and finally generates alert events.
After an alert event is generated, it must first be persisted to the DB (usually MySQL), and then go through the subsequent notification rules.
The notification rules contain two parts: one is several event processors (such as relabel, event update, event drop, ai summary, etc.), and the other is several alert notification configurations (such as Critical alert events associated with phone and SMS notification media, Warning alert events only associated with email media).

Alert Rules

The core of an alert rule is to configure a query condition. For example, for a Prometheus data source, you configure a PromQL, and for a ClickHouse data source, you configure SQL, and then configure a threshold (in the Prometheus scenario, the threshold is included in the PromQL and does not need to be configured separately). When the threshold is reached and the duration is met, an alert is triggered.

The alert engine creates a goroutine (coroutine) for each rule, periodically queries the data source, and judges whether the alert conditions are met. Taking the Prometheus data source as an example, the principle is:

Nightingale periodically calls the data source’s /api/v1/query interface, passing the current time and PromQL as query conditions to this interface.
If the data source returns multiple records, multiple alert events are likely to be generated. Next, the duration is checked. If the duration is 0, an alert event is generated immediately. If the duration is greater than 0, the record is put into a cache. When the duration condition is met, an alert event is generated. During the duration, if no data is found in subsequent execution cycles, the record will be removed from the cache and no alert event will be generated.

A common problem here is that the alert engine did not find data when querying, so it could not generate an alert event, but later troubleshooting found that there was data meeting the threshold at that time point, which is puzzling. In this case, there may be two reasons:

Because the reporting of monitoring data is delayed, Nightingale here is just a client and the data source is the server. If the data source does not return data, you need to look at the issue on the server side. Why did the data on the server side not return? Usually, the data is delayed due to various factors.
The query timed out. Relevant logs can usually be found in the log file. You can increase the query timeout in the data source configuration page, or troubleshoot why the data source response is slow. In addition, there may be hardware issues, for example, whether there is packet loss on the network cards on both the client and server sides. For timeout logs, you can search for the keyword: alert-${datasource-id}-${alert-rule-id}

Where:

${datasource-id} is the ID of the data source, which can be seen on the data source details page
${alert-rule-id} is the ID of the alert rule, which can be seen in the URL when editing the alert rule

When troubleshooting alert issues, first check whether an alert event has been generated. If an alert event has been generated, it means the alert rule is fine, and then troubleshoot the subsequent notification-related issues. If no alert event is generated, then it is a problem with the alert rule and data source. First confirm the configuration of the alert rule before discussing other matters.

Event Persistence

After an alert event is generated, it needs to be written to the DB (usually MySQL), so that you can see this event in the alert event list. Sometimes writing fails, and the failure will usually be reflected in the logs. Just troubleshoot the WARNING and ERROR logs.

Associating Alert Rules with Notification Rules

After an alert event is generated, which subsequent notification rule should be followed? That is, how to establish an association between alert rules and notification rules? There are two ways to establish an association:

Configure the notification rule directly in the alert rule. That is, all alert events generated by this alert rule will go through these notification rules.
Don’t configure notification rules in the alert rule, but configure subscription rules instead. That is, in the subscription rule, filter alert events according to various conditions, and the filtered alert events will go through the notification rules configured in the subscription rule.

Both methods are fine. The former is more intuitive. If there are no special requirements, the former is recommended. However, for some global event processing, for example, if I want all alert events generated in Nightingale to go through a Callback processor, you can use a subscription rule to subscribe to all alert events and associate them with a global notification rule, and configure the Callback processor in this global notification rule.

Notification Rule Configuration

The figure below is the editing page of the notification rule. I have marked the role of each section in the figure:

Nightingale Notification Rule Configuration

Most form fields have a small question mark icon at the title position. Hovering the mouse over it will show prompt information. You can configure according to the prompt information.

💡 This page contains some notification test buttons. After clicking, you can select an existing alert event to test the notification rule, making it easy for you to quickly verify whether the notification rule meets expectations. Also note that the persistence of alert events occurs before the notification rule, so each event processor in the notification rule will not modify the alert events in the DB.

Event Processors

💡 Event Pipeline does not have a separate menu entry. As part of the notification rule, click the small gear icon in the “Event Processing” section in the notification rule editing page to expand the configuration drawer for event processors.

The event processor is an advanced mechanism that allows you to do various processing on alert events, for example:

Relabel alert events, split some labels, modify some labels, etc.
Update alert events. Nightingale passes the alert event to a third party (such as CMDB) interface, the third party can modify the alert event, and then return the modified content to continue the subsequent event processing logic, which is convenient for integration with external systems.
Drop alert events. Some alert events do not need to be notified, and you can do complex judgments here to drop those that meet the conditions.
Generate AI summaries. Pass the alert event to DeepSeek and other AI to help generate summaries and solutions, put the AI-generated content into the event, and subsequently send it out through the notification medium.

There are two concepts to note here:

The event processing Pipeline is the drawer that opens when you click the button on the right side of Notification Rule - Event Processing. Inside is the list of Pipelines.
Each Pipeline can contain multiple Processors. If you want to improve reusability, you can also do it simply, with each Pipeline containing only one Processor.

Each processor has a documentation link on the page. After clicking, you can view the detailed documentation. You can also refer to the materials at the following two links:

Notification Configuration

This part has been explained before and will not be repeated here. Please refer to:

Notification Rule Design Intent and Usage Description

FAQ

How to import Prometheus alert rules?

Many people in the Prometheus ecosystem share alert rules, such as this project:

https://github.com/samber/awesome-prometheus-alerts/tree/master/dist/rules

Each directory contains alert rules in yaml format. For example, the host-and-hardware directory contains common node-exporter alert rules. Want to import these rules directly into Nightingale? Please refer to the following operations.

Version Notes

Please use Nightingale v8.2.0 or higher.

Import Steps

As shown in the screenshot above. Select Import on the alert rule page to import alert rules in Prometheus format. Note that the content of the yaml format rule starts with groups, containing multiple groups, each group has name and rules, and rules is also an array containing specific alert rules. Nightingale will ignore the name of the group when processing and directly import the contents in rules.

After the import is complete, you usually need to associate notification rules to make alert notifications. The method is: batch select alert rules, then click More Operations in the upper right corner, batch update alert rules:

In the batch update layer, select the field as: notification rule, then select the corresponding notification rule, and click OK. The screenshot is as follows:

Common Questions

Q1: What to do if there are too many alerts and they are missed?

Use Alert Aggregation to merge similar alerts into one;
Use Mute Rules to suppress known noise;
Use Notification Escalation to make important alerts go through higher-priority channels;
Integrate with FlashDuty for alert on-duty and scheduling.

Q2: How to avoid alert storms?