Nightingale v9 Active Alerts page: centrally view currently unrecovered alert events, filter by business group / severity / data source, handle via claim / mute / delete, and categorize bulk alerts via aggregation rules.
Overview
Active Alerts = the set of events that are still in an alerting state and have not recovered.
Sidebar path: Alerts → Alert Events → Active Alerts tab, URL /alert-cur-events. At the top of the same page there are two tabs, Active Alerts / Historical Alerts:
- Active Alerts: events with status
Triggeredthat have not recovered. Once an alert recovers it disappears from here and moves to historical alerts. - Historical Alerts: all alert events that have ever occurred (including recovered ones), used for retrospective analysis.
Applicable scenarios:
- On-call engineers immediately see “what failures are currently burning”
- Quickly converge bulk alerts by business group / severity / data source
- Aggregate by label dimensions (e.g., by business group, by host ident) to see the overall picture of “outbreaks across the board”
- Handling flow: view details → claim → (if needed) mute related derived alerts → notify for collaboration
Active alerts only show events that have not recovered. If an alert from a rule has recovered, it’s no longer in this list; to see history, switch to the Historical Alerts tab.
List Page

Top Filter Bar
- My Business Groups / All Business Groups: single-choice switch; “My Business Groups” only shows events from business groups the current account belongs to.
- Business Group dropdown: further select a specific business group within the chosen range.
- Fuzzy search: search across event rule names + labels. Multiple keywords are separated by spaces (AND logic). For example, entering
disk dev-doris-001will match events whose rule name contains “disk” AND labels contain “dev-doris-001”. Double-clicking a label in the list also adds that label as a keyword to the search. - Auto refresh: in the upper right,
Off / 5s / 15s / 30s / 1min / 5min; 30s is recommended for on-call wall displays. - Time range: defaults to unlimited (shows all unrecovered events, including “old veterans” that have been burning for weeks). Can be set to the last 1 hour, etc., to view only newly erupted ones.
Left Filter Panel
| Category | Options | Description |
|---|---|---|
| Monitoring type | Metric / Host / Log | The prod field of the alert rule; metric, host, log types |
| Alert severity | S1 (Critical) / S2 (Warning) / S3 (Info) | Multi-select |
| Data source | All enabled data sources in the current instance | Multi-select, supports search |
List Columns
| Column | Meaning |
|---|---|
| Event | First line: data source type logo + data source name + rule name (click to open the details drawer); second line: visual display of event labels |
| Trigger time | The most recent time this alert state was detected |
| Duration | The time from the “first trigger” until now, with a 12-cell color bar — the more to the right and the redder, the longer it has persisted (0-8h green, 8-16h yellow, 16-24h+ red) |
| Claimer | Unclaimed / shows the claimer; click “Claim” to take ownership |
| Action | The three-dot menu at the end of the row: mute / delete / claim / unclaim |
The bottom of the list supports pagination and “30/50/100 rows per page” switching.
Aggregation Rules: Fold Hundreds of Alerts into a Few Cards
When there are > 50 active alerts, the flat list makes it hard to see “which type of problem is most common.” Aggregation rules use Go Template to categorize events, folding the same type of events into a card displayed above the table.

As shown above, after choosing “By RuleName”, 24 events are grouped into 11 aggregation results: 5 under tes1, 4 under Host Unreachable Alert… Click any card and the table below only shows events corresponding to that card.
Common aggregation expressions (entered under “Add Rule”):
| Scenario | Template |
|---|---|
| By business group + severity | Group:{{.GroupName}} Severity:{{.Severity}} |
| By rule name | {{.RuleName}} |
| By host ident | {{.TagsMap.instance}} or {{.TagsMap.ident}} |
| By service label | {{.TagsMap.service}} |
Available fields (common): .GroupName (business group), .RuleName (rule name), .Severity (severity number), .TagsMap.<key> (any label value).
If you’re not familiar with Go Template, you can also use the built-in examples provided in the “Add Rule” button, then customize them.
Bulk Operations
After checking list rows, bulk operation buttons appear above the list:

- Bulk delete (OSS + PLUS)
- Bulk claim / bulk unclaim (PLUS exclusive)
Suitable for “claiming dozens of derived events under the same rule at once” or “clearing out completely invalid zombie events all at once.”
Details Drawer
Clicking on the event title (blue link) slides out the alert details drawer from the right:

Detail Fields
| Field | Description |
|---|---|
| Trend chart | The curve at the top, showing actual values of the alert metric before and after the trigger; red dashed line = trigger moment; time range and Step can be adjusted in the upper right |
| Rule title | Click to jump to the corresponding alert rule edit page /alert-rules/edit/:id |
| Hash | Event unique fingerprint (generated from rule + labels), used to directly copy and compare when troubleshooting duplicate alerts |
| Business group | The business group the event belongs to |
| Rule note | The note filled in when the alert rule was created (“why is this rule alerting”) |
| Data source | The data source instance name that triggered the event |
| Alert severity | S1 / S2 / S3 |
| Event status | Triggered (active) or Recovered (recovered, only visible in historical alerts) |
| Event labels | All key=value labels attached to the data point at trigger time |
| First trigger time / Latest detection time | The time it first entered the alerting state vs. the most recent time it was confirmed to still be alerting |
| Trigger value | The metric value at the trigger moment |
| PromQL (or the query statement for the corresponding data source) | The query used by the alert rule; the ▶️ button beside it directly executes a re-computation in the data source |
| Evaluation frequency / Duration | The query cycle of the rule; the duration the condition must be satisfied before triggering |
| Notification rule / Notification record | The matched notification rule; “view details” shows which channels this event was pushed to and whether it succeeded |
Action Buttons
The four action buttons at the bottom of the drawer:
- Mute: pre-fills the mute rule form with the current event’s labels, navigates to
/alert-mutes/add. Commonly used to quickly suppress derived alerts that erupt in bulk (e.g., N derived events caused by a single machine going down). - Delete: physically delete the current event.
⚠️ Only delete when you’re sure the metric will never be reported again (label changes / machine decommissioned) — such events will not automatically recover and are pure noise if kept. Otherwise let it recover naturally.
- Claim (PLUS): assign the event to yourself; the “Claimer” column in the list will show the current user. Used in on-call scenarios for “I’ve taken over, others don’t need to handle it again.”
- Generate share link: generates a login-free accessible event details link (with token, valid for 7 days by default), to be sent in IM groups for collaborators to view; the link is a read-only snapshot showing the event state at generation time.
FAQ
Q1: Why is an alert in the list still shown as active when it’s clearly recovered?
A: Active alerts are based on whether the alert engine determines recovery, not “whether the collected data is still abnormal”. Two most common causes:
- The metric has stopped being reported (machine powered off / labels changed): the alert engine cannot query data and cannot determine “recovery”; it stays in Triggered state. In this case manually delete the event, or add a host unreachable alert rule to cover the scenario.
- The rule has “notify recovery” but no recovery condition configured: check the “recovery threshold” and “duration” of the alert rule to ensure the data can be determined as recovered within the specified duration after returning to normal.
Q2: What’s the difference between “claim” and “mute”?
A:
- Claim: marks “I’m handling this alert”; doesn’t block any subsequent actions — the event remains in the list, and notification rules continue to push (if the rule is configured with repeated notifications). It’s mainly a collaboration signal: letting other on-call colleagues know someone has taken over.
- Mute: label-based event filtering; events hitting a mute rule no longer generate alert notifications, and they don’t appear in the alert event list. Used for “known issues, temporarily don’t bother again.”
See Mute Rules for details.
Q3: After deleting an event, will the alert rule regenerate this event?
A: Yes. Deleting just clears the current event record; the alert rule itself is not disabled. At the next query cycle, if the condition is still met, a new active alert event will be regenerated (the Hash may be the same).
So don’t use deletion as “muting”: for long-term silence, go to Mute Rules or directly disable the alert rule.
Q4: The “Duration” in the list shows a week, a month, or even years — is this normal?
A: These are usually forgotten “zombie alerts”, caused by:
- The data source was deleted or disabled, and the event can never receive a “recovery” signal;
- After a label schema change, old events can no longer be covered by new rules;
- Early test rules were not cleaned up.
Periodic cleanup approach: use the left “Data Source” filter to filter, sort in reverse chronological order, and batch-delete events exceeding a reasonable SLA.