- 快猫星云Flashcat

Nightingale v9 hires an on-duty SRE co-pilot for your ops team — never off-shift, never quits, and remembers every rule. This article walks through 6 real-world ops scenarios: on-call true/false triage, fault localization, new business onboarding, daily operations, new-hire training, and alert self-healing — explaining where Nightingale AI plugs into your workflow and what problems it solves.

Nightingale AI in One Sentence

Nightingale v9 gives your team a senior SRE co-pilot who is online 7x24.

He never goes off shift, never quits, remembers every alert rule, every machine, every historical alert in this Nightingale instance; he knows which data sources the company uses, which metrics matter most, how notification templates are composed; most importantly — he learns the experience and methodology of your team’s most senior SRE, and the next time a newcomer takes over, he guides them through tasks the way that “old hand” would.

This co-pilot shows up in the 6 most common ops scenarios below.

Scenario 1: On-call Wakes You Up at Midnight — Triage True/False First

How You Handle It Today

At 3 AM, a “server-01 is down” alert blows up your phone. You open Nightingale:

target_up == 0, heartbeat has been gone for 2 minutes;
but ping server-01 is still reachable;
agent stuck? network glitch? or really down?

You SSH in, check processes, check dmesg, check switch logs… you wrestle with it for 20 minutes before daring to draw a conclusion. If you call ops out to the data center and it turns out to be a false alarm, that’s even more awkward.

What It Looks Like With Nightingale AI

Open Copilot right from the notification card or alert event detail page, and ask:

“Is this alert real? Should we act immediately?”

Nightingale AI pulls four layers of evidence to deliver a verdict:

Heartbeat layer — last BeatTime, lag_seconds, last reported CPU/memory values from Redis;
Metrics layer — last value and recent trend of cpu_usage_active / mem_available_percent / system_load1 / network bytes over the last 10 minutes;
Neighbor layer — are other machines in the same business group also abnormal right now (individual fault vs cluster fault vs network partition);
Mute layer — is this machine currently in a maintenance window.

Outputs a clear verdict: “real outage / agent zombie / network jitter / under maintenance”, with key evidence + suggested action + possible counter-examples (to avoid false confidence).

What You Save

Probability of a midnight drive to the data center → significantly reduced;
Single connectivity-loss triage from 20 minutes → 30 seconds;
Even new on-call staff can render senior-SRE-level judgments.

“agent loss ≠ host outage” — a community-proven root cause of high-frequency false alerts. AI combines multi-layer evidence to drive that false-positive rate down.

Scenario 2: Fault Localization — Finding the Root Cause in a Mess

How You Handle It Today

The business reports: “the order API is slow”. You open Nightingale:

look at which alerts are firing;
switch to the dashboard to look at P99 trend;
dig through database slow query logs;
switch back to alerts to see if any related infrastructure alerts are firing;
look up similar historical timestamps…

An experienced engineer can conclude in 15 minutes; a newcomer might not pinpoint anything in an hour.

What It Looks Like With Nightingale AI

“There was a problem with the order business just now — help me check if this MySQL primary-replica lag alert caused it”

AI runs through a senior SRE’s standard playbook:

Run PromQL to pull the historical trend of the metrics involved in the lag alert;
Read the alert rule definition (PromQL, threshold, duration);
Check whether other alerts fired in the same business group within the same time window (data source side, network side, application side);
Check whether the host’s recent CPU / IO / network metrics show anomalies;
Output “timeline + key evidence + most likely cause + mitigation suggestions”.

What You Save

A comprehensive triage from 30 minutes → 1 minute to get the evidence chain (you still make the final call, but the evidence is laid out);
Cross-data-source / cross-rule queries no longer require manual switching back and forth;
Newcomers can also get a senior SRE’s view of the evidence combination.

AI’s fault localization does not draw conclusions for you — it organizes the evidence chain for you to inspect. The final “is this the cause, do we take this action” call is still made by humans.

Scenario 3: New Business Online — Set Up the Monitoring System in 5 Minutes

How You Handle It Today

The order service is going to production. SRE standard actions:

create 4 alert rules (CPU, memory, disk, request success rate);
configure notification rules (DingTalk during the day, phone calls at night);
create a monitoring dashboard;
configure a few mute rules for change windows;
add self-healing scripts (clean up when disk is full).

Manually filling forms takes 1-2 hours. If you have to tune PromQL thresholds back and forth, possibly half a day.

What It Looks Like With Nightingale AI

Open Nightingale AI and state your need in plain language:

“Add a P2 alert for all production hosts (label env=prod) when CPU usage > 90% for 5 minutes; DingTalk during business hours, phone call outside business hours”

AI does:

automatically picks the business group (uses the current one if you opened it from a business group page);
automatically finds the cpu_usage_active metric + filters by env=prod label;
composes the PromQL;
configures threshold 90, duration 5m, severity P2;
picks the matching notification rule;
creates the alert rule directly for you.

The same one-sentence approach also works for: dashboards, mute rules, subscribe rules, notification rules, notification templates, self-healing scripts.

What You Save

Onboarding monitoring configs from 1-2 hours → 5 minutes;
“Can’t write PromQL” is no longer a blocker — AI knows which metrics are in your library and picks them automatically;
Newcomers can get hands-on with monitoring configs too.

Scenario 4: Daily Operations — Turn Trivial Tasks into Conversations

How You Handle It Today

Daily ops are interrupted by countless small requests:

“add hostname to the DingTalk alert template” — look up field tables + tweak style, 10 minutes;
“mute all alerts for web01 for 2 hours” — click through 7 form steps, 3 minutes;
“how do I integrate Slack notifications” — flip through docs, trial-and-error, half an hour;
“why does DingTalk fail with 9499” — packet capture, doc search, 1 hour.

What It Looks Like With Nightingale AI

What you want to do	What you say	What AI does
Modify the notification template	“Add hostname + trigger value to DingTalk template, color by severity”	Outputs a paste-ready snippet in Go template syntax; auto-handles both alert and recovery branches
Add a temporary mute	“Mute all alerts for host=web01 for 2 hours”	Parses labels + time window + business group, directly creates the mute rule
Integrate a new notification platform	“How to integrate Slack”	Provides full Webhook URL + body + Headers + field-level gotchas
Troubleshoot delivery failures	“DingTalk fails with 9499”	Explains the error code + 5-layer checklist (URL/signature/Headers/network/rate limit) + curl verification command
Query alert events	“Which P1 alerts in the last 1 hour”	Hits the API → summarizes as Markdown table
List resources	“Which business groups do I have” / “list alert rules”	Returns the list directly

What You Save

5-10 of these small actions per day, each from 3-30 minutes → 30 seconds;
No more pulling in a senior colleague at any time to answer “how do I configure this”.

Scenario 5: New Hires Take Over — Independent On-call in Three Days

How You Handle It Today

When a senior SRE leaves or a new hire joins, the painful realities surface:

Team SOPs are scattered across Confluence / Lark docs, no one maintains them, and new hires can’t tell which ones are still valid;
The senior’s “experience” — e.g. “when you see a MySQL primary-replica lag alert, first check pt-heartbeat, then the replication threads, finally the binlog size” — lives only in their head;
A new hire on-call who hits an unfamiliar scenario either makes a phone call or muscles through it, and the MTTR goes up.

What It Looks Like With Nightingale AI

Nightingale v9 provides a mechanism called Skill, essentially “digitized team SOPs”:

A senior SRE writes their troubleshooting methodology in Markdown as a Skill, spelling out “in what scenario, by what method, with what output”;
Uploads it to Nightingale;
AI automatically loads this Skill when a matching scenario shows up, and answers per your methodology.

A real example:

A senior DBA writes a “MySQL slow query troubleshooting four-step playbook (look at Top 10 slow logs → check pool hit rate → check locks → finally EXPLAIN)” as a Skill and uploads it.

When a new hire on-call receives a MySQL lag alert and asks AI “how do I handle this alert”, AI does not give a vague “try EXPLAIN” — it walks the new hire through the DBA’s four-step process — pull the slow log first, then look at the pool, then locks, and only then have them EXPLAIN.

Nightingale v9 also bundles 19 out-of-the-box Skills, covering alert creation/troubleshooting, host diagnostics, notification configuration, PromQL/SQL generation, and semi-self-healing recommendations — usable right after install.

See:

Skill Authoring Tutorial — end-to-end tutorial on writing your first team Skill
Built-in Skill Overview — what each of the 19 built-in Skills does

What You Save

Independent on-call cycle for new hires from 2-4 weeks → 3-7 days;
The experience a departing senior SRE used to take with them, now stays;
For the first time, team SOPs have a carrier that will actually be executed.

This is the biggest value Nightingale AI brings to the team — not that it can write PromQL, but that experience becomes deposit-able, reusable, and transferable.

Scenario 6: Alert Self-healing — Stop Letting “Mechanical Alerts” Disturb People

How You Handle It Today

Some alerts fire every week, and the handling action is fixed:

“Disk full” — clean up files older than 7 days under /var/log;
“Service down” — restart the corresponding process;
“Nginx config change not taking effect” — nginx -s reload.

Same thing every time, but you don’t dare fully automate — what if the script is buggy and takes down production? Who’s responsible?

What It Looks Like With Nightingale AI

Nightingale v9 offers semi-self-healing: AI recommends → human confirms → system executes.

Open Copilot from the alert event detail or notification card, and ask:

“Can this alert self-heal? Help me handle it”

AI does:

Pulls three layers of evidence: current event + alert rule + last 30 days of history;
Searches matching candidates in your existing self-healing script library;
Runs a safety review on each candidate:
- Does the intent match?
- Is the business group boundary correct?
- Are there enough labels for the script?
- Are any dangerous commands triggered (rm -rf / and the like)?
- Is the target host online right now?
For candidates that pass the review — gives you a “✅ One-click execute” button, with script preview + risk warning + false-positive risk.

That last confirmation must be clicked by a human. AI never runs commands itself.

If your self-healing script library has no match, AI switches to “Self-healing Script Generation Copilot” and drafts a new script per your needs (with stdin parsing, timeout recommendation, dangerous-command guardrails, risk and rollback notes) for you to review and store.

What You Save

90% of “mechanical alerts” — closed loop with AI recommendation + a human click to confirm;
Concerns about “scripts being buggy and breaking prod” — AI auto-filters dangerous commands during generation/recommendation, with dry-run recommendations;
The senior SRE no longer gets interrupted by the weekly disk-full alert.

Where These Scenarios Are Triggered

Entry points are very simple:

┌──────────────────────────────────────────────────────────────┐
│  Top-right [Nightingale AI] icon — site-wide chat entry      │
│      Open on any page, ask about any scenario, AI picks      │
├──────────────────────────────────────────────────────────────┤
│  Alert event detail — "can it self-heal" / "why fired"       │
│  Notification template editor — "AI Generate" button         │
│  PromQL input — AI Generate next to alert/dashboard PromQL   │
│  SQL / log query — AI Generate next to SQL / LogQL / ES DSL  │
│  Self-healing script editor — "AI Generate Script" button    │
└──────────────────────────────────────────────────────────────┘

There are many entry points, but configuration is one-time: integrate an LLM at AI Config → LLM Management once, and all AI capabilities across the platform use the same configuration.

What AI Won’t Do — Its Boundaries

As an ops lead, what you probably care about most is not “how much can it do”, but “will it overstep”. Nightingale AI honors these bottom lines by design:

Boundary	Description
Doesn’t run commands itself	All “execute” actions — running self-healing scripts, modifying production — the last click must be by a human. AI only recommends, never acts.
No privilege escalation when reading data	AI runs in the current logged-in user’s permission context. Business groups you can’t see, resources you can’t change — AI can’t access either.
No silent data exfiltration	The Nightingale server does not send your alert/metric/host list back to any “Nightingale official service” — there is no central AI service.
No dangerous command output	When writing self-healing scripts, refuses to output `rm -rf /` / `mkfs` / `shutdown` / `iptables -F` and similar dangerous operations, rewriting them as safe variants (dry-run / scoped).
Doesn’t take the blame for you	AI’s root cause analyses and self-heal recommendations always come with a “false-positive risk” note — telling you “this conclusion may be wrong in such-and-such cases”. The final judgment and signoff are yours.

Data Fully Under Your Control

Many ops leaders’ first reaction to “AI” is: “will my alert data be sent to OpenAI?”

Nightingale AI’s design on this point is: you decide where the LLM lives. Nightingale binds to no particular vendor.

Deployment	Data flow
Use OpenAI public API	Data goes over the public internet → OpenAI
Use Alibaba Tongyi / Volcengine Doubao	Data stays inside Alibaba Cloud / Volcengine Cloud
Use the company’s internal LLM gateway	Data does not leave the company
Use local Ollama / vLLM with open-source models	Data fully stays in-domain, completely offline

Strict data residency requirements? Just run Ollama with DeepSeek / Qwen / Llama — a single GPU server is enough, and it’s plenty effective for frontline SRE scenarios.

See the list of supported providers at LLM Management.

30-minute Hands-on

Integrate an LLM (5 min) — go to LLM Management and create an LLM configuration: fill in API URL / API Key / model name → click Test Connection → save once you see “Connection successful” → set it as default.
Run through the 6 scenarios (15 min) — find a recent alert and use each of the 6 scenarios above:
- On the event detail page, ask “is this alert real / can it self-heal”;
- In the global chat box, ask “which P1 alerts in the last 1 hour”;
- Next to any PromQL input, click the AI button and have it write a query;
- Try the “AI Generate” button in the notification template editor;
- …
After this lap, you’ll know which scenarios in your team are best to roll out first.
Write your first team Skill (10 min) — turn the most frequent “incident handling SOP” in the team into a Skill and upload it, so new hires benefit too. See Skill Authoring Tutorial.