- 快猫星云Flashcat

Nightingale v9 Active Alerts page: centrally view currently unrecovered alert events, filter by business group / severity / data source, handle via claim / mute / delete, and categorize bulk alerts via aggregation rules.

Overview

Active Alerts = the set of events that are still in an alerting state and have not recovered.

Sidebar path: Alerts → Alert Events → Active Alerts tab, URL /alert-cur-events. At the top of the same page there are two tabs, Active Alerts / Historical Alerts:

Active Alerts: events with status Triggered that have not recovered. Once an alert recovers it disappears from here and moves to historical alerts.
Historical Alerts: all alert events that have ever occurred (including recovered ones), used for retrospective analysis.

Applicable scenarios:

On-call engineers immediately see “what failures are currently burning”
Quickly converge bulk alerts by business group / severity / data source
Aggregate by label dimensions (e.g., by business group, by host ident) to see the overall picture of “outbreaks across the board”
Handling flow: view details → claim → (if needed) mute related derived alerts → notify for collaboration

Active alerts only show events that have not recovered. If an alert from a rule has recovered, it’s no longer in this list; to see history, switch to the Historical Alerts tab.

List Page

Active Alerts List Page

Top Filter Bar

My Business Groups / All Business Groups: single-choice switch; “My Business Groups” only shows events from business groups the current account belongs to.
Business Group dropdown: further select a specific business group within the chosen range.
Fuzzy search: search across event rule names + labels. Multiple keywords are separated by spaces (AND logic). For example, entering disk dev-doris-001 will match events whose rule name contains “disk” AND labels contain “dev-doris-001”. Double-clicking a label in the list also adds that label as a keyword to the search.
Auto refresh: in the upper right, Off / 5s / 15s / 30s / 1min / 5min; 30s is recommended for on-call wall displays.
Time range: defaults to unlimited (shows all unrecovered events, including “old veterans” that have been burning for weeks). Can be set to the last 1 hour, etc., to view only newly erupted ones.

Category	Options	Description
Monitoring type	Metric / Host / Log	The `prod` field of the alert rule; metric, host, log types
Alert severity	S1 (Critical) / S2 (Warning) / S3 (Info)	Multi-select
Data source	All enabled data sources in the current instance	Multi-select, supports search

List Columns

Column	Meaning
Event	First line: data source type logo + data source name + rule name (click to open the details drawer); second line: visual display of event labels
Trigger time	The most recent time this alert state was detected
Duration	The time from the “first trigger” until now, with a 12-cell color bar — the more to the right and the redder, the longer it has persisted (0-8h green, 8-16h yellow, 16-24h+ red)
Claimer	Unclaimed / shows the claimer; click “Claim” to take ownership
Action	The three-dot menu at the end of the row: mute / delete / claim / unclaim

The bottom of the list supports pagination and “30/50/100 rows per page” switching.

Aggregation Rules: Fold Hundreds of Alerts into a Few Cards

When there are > 50 active alerts, the flat list makes it hard to see “which type of problem is most common.” Aggregation rules use Go Template to categorize events, folding the same type of events into a card displayed above the table.

Aggregate Alerts by Rule Name

As shown above, after choosing “By RuleName”, 24 events are grouped into 11 aggregation results: 5 under tes1, 4 under Host Unreachable Alert… Click any card and the table below only shows events corresponding to that card.

Common aggregation expressions (entered under “Add Rule”):

Scenario	Template
By business group + severity	`Group:{{.GroupName}} Severity:{{.Severity}}`
By rule name	`{{.RuleName}}`
By host ident	`{{.TagsMap.instance}}` or `{{.TagsMap.ident}}`
By service label	`{{.TagsMap.service}}`

Available fields (common): .GroupName (business group), .RuleName (rule name), .Severity (severity number), .TagsMap.<key> (any label value).

If you’re not familiar with Go Template, you can also use the built-in examples provided in the “Add Rule” button, then customize them.

Bulk Operations

After checking list rows, bulk operation buttons appear above the list:

Bulk Operation Buttons

Bulk delete (OSS + PLUS)
Bulk claim / bulk unclaim (PLUS exclusive)

Suitable for “claiming dozens of derived events under the same rule at once” or “clearing out completely invalid zombie events all at once.”

Details Drawer

Clicking on the event title (blue link) slides out the alert details drawer from the right:

Alert Details Drawer

Detail Fields

Field	Description
Trend chart	The curve at the top, showing actual values of the alert metric before and after the trigger; red dashed line = trigger moment; time range and Step can be adjusted in the upper right
Rule title	Click to jump to the corresponding alert rule edit page `/alert-rules/edit/:id`
Hash	Event unique fingerprint (generated from rule + labels), used to directly copy and compare when troubleshooting duplicate alerts
Business group	The business group the event belongs to
Rule note	The note filled in when the alert rule was created (“why is this rule alerting”)
Data source	The data source instance name that triggered the event
Alert severity	S1 / S2 / S3
Event status	`Triggered` (active) or `Recovered` (recovered, only visible in historical alerts)
Event labels	All key=value labels attached to the data point at trigger time
First trigger time / Latest detection time	The time it first entered the alerting state vs. the most recent time it was confirmed to still be alerting
Trigger value	The metric value at the trigger moment
PromQL (or the query statement for the corresponding data source)	The query used by the alert rule; the ▶️ button beside it directly executes a re-computation in the data source
Evaluation frequency / Duration	The query cycle of the rule; the duration the condition must be satisfied before triggering
Notification rule / Notification record	The matched notification rule; “view details” shows which channels this event was pushed to and whether it succeeded

Action Buttons

The four action buttons at the bottom of the drawer:

Mute: pre-fills the mute rule form with the current event’s labels, navigates to /alert-mutes/add. Commonly used to quickly suppress derived alerts that erupt in bulk (e.g., N derived events caused by a single machine going down).
Delete: physically delete the current event.

⚠️ Only delete when you’re sure the metric will never be reported again (label changes / machine decommissioned) — such events will not automatically recover and are pure noise if kept. Otherwise let it recover naturally.
Claim (PLUS): assign the event to yourself; the “Claimer” column in the list will show the current user. Used in on-call scenarios for “I’ve taken over, others don’t need to handle it again.”
Generate share link: generates a login-free accessible event details link (with token, valid for 7 days by default), to be sent in IM groups for collaborators to view; the link is a read-only snapshot showing the event state at generation time.

FAQ

Q1: Why is an alert in the list still shown as active when it’s clearly recovered?

A: Active alerts are based on whether the alert engine determines recovery, not “whether the collected data is still abnormal”. Two most common causes:

The metric has stopped being reported (machine powered off / labels changed): the alert engine cannot query data and cannot determine “recovery”; it stays in Triggered state. In this case manually delete the event, or add a host unreachable alert rule to cover the scenario.
The rule has “notify recovery” but no recovery condition configured: check the “recovery threshold” and “duration” of the alert rule to ensure the data can be determined as recovered within the specified duration after returning to normal.

Q2: What’s the difference between “claim” and “mute”?

Claim: marks “I’m handling this alert”; doesn’t block any subsequent actions — the event remains in the list, and notification rules continue to push (if the rule is configured with repeated notifications). It’s mainly a collaboration signal: letting other on-call colleagues know someone has taken over.
Mute: label-based event filtering; events hitting a mute rule no longer generate alert notifications, and they don’t appear in the alert event list. Used for “known issues, temporarily don’t bother again.”

See Mute Rules for details.

Q3: After deleting an event, will the alert rule regenerate this event?

A: Yes. Deleting just clears the current event record; the alert rule itself is not disabled. At the next query cycle, if the condition is still met, a new active alert event will be regenerated (the Hash may be the same).

So don’t use deletion as “muting”: for long-term silence, go to Mute Rules or directly disable the alert rule.

Q4: The “Duration” in the list shows a week, a month, or even years — is this normal?

A: These are usually forgotten “zombie alerts”, caused by:

The data source was deleted or disabled, and the event can never receive a “recovery” signal;
After a label schema change, old events can no longer be covered by new rules;
Early test rules were not cleaned up.

Periodic cleanup approach: use the left “Data Source” filter to filter, sort in reverse chronological order, and batch-delete events exceeding a reasonable SLA.