- 快猫星云Flashcat

Self-healing processor (Event Recover) — runs a remediation script after an alert fires to collect more context or attempt automatic recovery.

Overview

The self-healing processor (Event Recover) is a self-healing mechanism for alerts. After an alert fires, it executes a configured script that can collect more alert-related information or run a recovery task.

Use Cases

Case 1: Slow service response — query host load

If a “slow service response” alert fires, the script can query metrics related to the machine’s load.

Case 2: Host disk usage alert — query or clean up via script

If a disk-usage alert fires, in some operations scenarios (for example, logs that need to be cleaned up periodically), the script can query the size of related log directories and automatically perform log cleanup.

Configuration

1. Configure the Event Recover processor

Configuration screenshot

Fill in the self-healing template, the target machine, and parameters. If the target machine is empty, execution uses the machine identified by the ident tag of the alert. If “save execution result” is enabled, the result is saved into the alert message. The current mechanism waits synchronously for the result before sending the alert message. If the self-healing task is not finished within the configured wait time, Nightingale will not wait longer and will send the alert message immediately.

2. Configure the self-healing template

Configuration screenshot If no self-healing template is configured, you can configure one under the Self-Healing menu.

3. Execution results

After the self-healing task finishes, you can view the result on the alert detail page. Configuration screenshot

Notes

In practice, do not configure the wait time for the self-healing task too long. The alert is only sent after the wait expires; a long wait hurts the timeliness of alerts.

FAQ

Q1: The self-healing task ran, but the alert still went out — didn’t it work?

A: The self-healing processor does not automatically suppress the alert notification — it only runs a script and waits for the result. Whether the notification is sent is decided by the notification rules. Common patterns:

Want “try to fix first, do not notify if the fix succeeds”: use Inhibit by Query Data — after the script fixes the issue, have it write a marker metric; if that metric exists in the next cycle, inhibit the alert.
Just want to attach diagnostic information to the alert: enable “save execution result” in this processor and the result will appear as an annotation in the notification.

Q2: Where does the self-healing script’s execution machine come from?

A: Filled in when configuring the processor:

Specify host: pick one or more machines manually;
Leave empty: default to the host identified by the alert event’s ident tag — “run on whichever machine triggered the alert”. The event tags must include ident.

Q3: Does the task keep running after timeout?

A: Yes. The processor only waits for the “configured wait time” and then releases the alert notification; the self-healing task keeps running on the target machine. The result can be viewed later in Task History or Self-Healing Scripts.

Q4: Can multiple self-healing scripts be triggered at the same time?

A: A single processor node only configures one script. To run multiple scripts in parallel, add multiple Event Recover nodes in the same pipeline.