Nightingale v9 hires an on-duty SRE co-pilot for your ops team — never off-shift, never quits, and remembers every rule. This article walks through 6 real-world ops scenarios: on-call true/false triage, fault localization, new business onboarding, daily operations, new-hire training, and alert self-healing — explaining where Nightingale AI plugs into your workflow and what problems it solves.
Nightingale AI in One Sentence
Nightingale v9 gives your team a senior SRE co-pilot who is online 7x24.
He never goes off shift, never quits, remembers every alert rule, every machine, every historical alert in this Nightingale instance; he knows which data sources the company uses, which metrics matter most, how notification templates are composed; most importantly — he learns the experience and methodology of your team’s most senior SRE, and the next time a newcomer takes over, he guides them through tasks the way that “old hand” would.
This co-pilot shows up in the 6 most common ops scenarios below.
Scenario 1: On-call Wakes You Up at Midnight — Triage True/False First
How You Handle It Today
At 3 AM, a “server-01 is down” alert blows up your phone. You open Nightingale:
target_up == 0, heartbeat has been gone for 2 minutes;- but
ping server-01is still reachable; - agent stuck? network glitch? or really down?
You SSH in, check processes, check dmesg, check switch logs… you wrestle with it for 20 minutes before daring to draw a conclusion. If you call ops out to the data center and it turns out to be a false alarm, that’s even more awkward.
What It Looks Like With Nightingale AI
Open Copilot right from the notification card or alert event detail page, and ask:
“Is this alert real? Should we act immediately?”
Nightingale AI pulls four layers of evidence to deliver a verdict:
- Heartbeat layer — last
BeatTime,lag_seconds, last reported CPU/memory values from Redis; - Metrics layer — last value and recent trend of
cpu_usage_active/mem_available_percent/system_load1/ network bytes over the last 10 minutes; - Neighbor layer — are other machines in the same business group also abnormal right now (individual fault vs cluster fault vs network partition);
- Mute layer — is this machine currently in a maintenance window.
Outputs a clear verdict: “real outage / agent zombie / network jitter / under maintenance”, with key evidence + suggested action + possible counter-examples (to avoid false confidence).
What You Save
- Probability of a midnight drive to the data center → significantly reduced;
- Single connectivity-loss triage from 20 minutes → 30 seconds;
- Even new on-call staff can render senior-SRE-level judgments.
“agent loss ≠ host outage” — a community-proven root cause of high-frequency false alerts. AI combines multi-layer evidence to drive that false-positive rate down.
Scenario 2: Fault Localization — Finding the Root Cause in a Mess
How You Handle It Today
The business reports: “the order API is slow”. You open Nightingale:
- look at which alerts are firing;
- switch to the dashboard to look at P99 trend;
- dig through database slow query logs;
- switch back to alerts to see if any related infrastructure alerts are firing;
- look up similar historical timestamps…
An experienced engineer can conclude in 15 minutes; a newcomer might not pinpoint anything in an hour.
What It Looks Like With Nightingale AI
“There was a problem with the order business just now — help me check if this MySQL primary-replica lag alert caused it”
AI runs through a senior SRE’s standard playbook:
- Run PromQL to pull the historical trend of the metrics involved in the lag alert;
- Read the alert rule definition (PromQL, threshold, duration);
- Check whether other alerts fired in the same business group within the same time window (data source side, network side, application side);
- Check whether the host’s recent CPU / IO / network metrics show anomalies;
- Output “timeline + key evidence + most likely cause + mitigation suggestions”.
What You Save
- A comprehensive triage from 30 minutes → 1 minute to get the evidence chain (you still make the final call, but the evidence is laid out);
- Cross-data-source / cross-rule queries no longer require manual switching back and forth;
- Newcomers can also get a senior SRE’s view of the evidence combination.
AI’s fault localization does not draw conclusions for you — it organizes the evidence chain for you to inspect. The final “is this the cause, do we take this action” call is still made by humans.
Scenario 3: New Business Online — Set Up the Monitoring System in 5 Minutes
How You Handle It Today
The order service is going to production. SRE standard actions:
- create 4 alert rules (CPU, memory, disk, request success rate);
- configure notification rules (DingTalk during the day, phone calls at night);
- create a monitoring dashboard;
- configure a few mute rules for change windows;
- add self-healing scripts (clean up when disk is full).
Manually filling forms takes 1-2 hours. If you have to tune PromQL thresholds back and forth, possibly half a day.
What It Looks Like With Nightingale AI
Open Nightingale AI and state your need in plain language:
“Add a P2 alert for all production hosts (label env=prod) when CPU usage > 90% for 5 minutes; DingTalk during business hours, phone call outside business hours”
AI does:
- automatically picks the business group (uses the current one if you opened it from a business group page);
- automatically finds the
cpu_usage_activemetric + filters byenv=prodlabel; - composes the PromQL;
- configures threshold 90, duration 5m, severity P2;
- picks the matching notification rule;
- creates the alert rule directly for you.
The same one-sentence approach also works for: dashboards, mute rules, subscribe rules, notification rules, notification templates, self-healing scripts.
What You Save
- Onboarding monitoring configs from 1-2 hours → 5 minutes;
- “Can’t write PromQL” is no longer a blocker — AI knows which metrics are in your library and picks them automatically;
- Newcomers can get hands-on with monitoring configs too.
Scenario 4: Daily Operations — Turn Trivial Tasks into Conversations
How You Handle It Today
Daily ops are interrupted by countless small requests:
- “add hostname to the DingTalk alert template” — look up field tables + tweak style, 10 minutes;
- “mute all alerts for web01 for 2 hours” — click through 7 form steps, 3 minutes;
- “how do I integrate Slack notifications” — flip through docs, trial-and-error, half an hour;
- “why does DingTalk fail with 9499” — packet capture, doc search, 1 hour.
What It Looks Like With Nightingale AI
| What you want to do | What you say | What AI does |
|---|---|---|
| Modify the notification template | “Add hostname + trigger value to DingTalk template, color by severity” | Outputs a paste-ready snippet in Go template syntax; auto-handles both alert and recovery branches |
| Add a temporary mute | “Mute all alerts for host=web01 for 2 hours” | Parses labels + time window + business group, directly creates the mute rule |
| Integrate a new notification platform | “How to integrate Slack” | Provides full Webhook URL + body + Headers + field-level gotchas |
| Troubleshoot delivery failures | “DingTalk fails with 9499” | Explains the error code + 5-layer checklist (URL/signature/Headers/network/rate limit) + curl verification command |
| Query alert events | “Which P1 alerts in the last 1 hour” | Hits the API → summarizes as Markdown table |
| List resources | “Which business groups do I have” / “list alert rules” | Returns the list directly |
What You Save
- 5-10 of these small actions per day, each from 3-30 minutes → 30 seconds;
- No more pulling in a senior colleague at any time to answer “how do I configure this”.
Scenario 5: New Hires Take Over — Independent On-call in Three Days
How You Handle It Today
When a senior SRE leaves or a new hire joins, the painful realities surface:
- Team SOPs are scattered across Confluence / Lark docs, no one maintains them, and new hires can’t tell which ones are still valid;
- The senior’s “experience” — e.g. “when you see a MySQL primary-replica lag alert, first check
pt-heartbeat, then the replication threads, finally the binlog size” — lives only in their head; - A new hire on-call who hits an unfamiliar scenario either makes a phone call or muscles through it, and the MTTR goes up.
What It Looks Like With Nightingale AI
Nightingale v9 provides a mechanism called Skill, essentially “digitized team SOPs”:
- A senior SRE writes their troubleshooting methodology in Markdown as a Skill, spelling out “in what scenario, by what method, with what output”;
- Uploads it to Nightingale;
- AI automatically loads this Skill when a matching scenario shows up, and answers per your methodology.
A real example:
A senior DBA writes a “MySQL slow query troubleshooting four-step playbook (look at Top 10 slow logs → check pool hit rate → check locks → finally EXPLAIN)” as a Skill and uploads it.
When a new hire on-call receives a MySQL lag alert and asks AI “how do I handle this alert”, AI does not give a vague “try EXPLAIN” — it walks the new hire through the DBA’s four-step process — pull the slow log first, then look at the pool, then locks, and only then have them EXPLAIN.
Nightingale v9 also bundles 19 out-of-the-box Skills, covering alert creation/troubleshooting, host diagnostics, notification configuration, PromQL/SQL generation, and semi-self-healing recommendations — usable right after install.
See:
- Skill Authoring Tutorial — end-to-end tutorial on writing your first team Skill
- Built-in Skill Overview — what each of the 19 built-in Skills does
What You Save
- Independent on-call cycle for new hires from 2-4 weeks → 3-7 days;
- The experience a departing senior SRE used to take with them, now stays;
- For the first time, team SOPs have a carrier that will actually be executed.
This is the biggest value Nightingale AI brings to the team — not that it can write PromQL, but that experience becomes deposit-able, reusable, and transferable.
Scenario 6: Alert Self-healing — Stop Letting “Mechanical Alerts” Disturb People
How You Handle It Today
Some alerts fire every week, and the handling action is fixed:
- “Disk full” — clean up files older than 7 days under
/var/log; - “Service down” — restart the corresponding process;
- “Nginx config change not taking effect” —
nginx -s reload.
Same thing every time, but you don’t dare fully automate — what if the script is buggy and takes down production? Who’s responsible?
What It Looks Like With Nightingale AI
Nightingale v9 offers semi-self-healing: AI recommends → human confirms → system executes.
Open Copilot from the alert event detail or notification card, and ask:
“Can this alert self-heal? Help me handle it”
AI does:
- Pulls three layers of evidence: current event + alert rule + last 30 days of history;
- Searches matching candidates in your existing self-healing script library;
- Runs a safety review on each candidate:
- Does the intent match?
- Is the business group boundary correct?
- Are there enough labels for the script?
- Are any dangerous commands triggered (
rm -rf /and the like)? - Is the target host online right now?
- For candidates that pass the review — gives you a “✅ One-click execute” button, with script preview + risk warning + false-positive risk.
That last confirmation must be clicked by a human. AI never runs commands itself.
If your self-healing script library has no match, AI switches to “Self-healing Script Generation Copilot” and drafts a new script per your needs (with stdin parsing, timeout recommendation, dangerous-command guardrails, risk and rollback notes) for you to review and store.
What You Save
- 90% of “mechanical alerts” — closed loop with AI recommendation + a human click to confirm;
- Concerns about “scripts being buggy and breaking prod” — AI auto-filters dangerous commands during generation/recommendation, with dry-run recommendations;
- The senior SRE no longer gets interrupted by the weekly disk-full alert.
Where These Scenarios Are Triggered
Entry points are very simple:
┌──────────────────────────────────────────────────────────────┐
│ Top-right [Nightingale AI] icon — site-wide chat entry │
│ Open on any page, ask about any scenario, AI picks │
├──────────────────────────────────────────────────────────────┤
│ Alert event detail — "can it self-heal" / "why fired" │
│ Notification template editor — "AI Generate" button │
│ PromQL input — AI Generate next to alert/dashboard PromQL │
│ SQL / log query — AI Generate next to SQL / LogQL / ES DSL │
│ Self-healing script editor — "AI Generate Script" button │
└──────────────────────────────────────────────────────────────┘
There are many entry points, but configuration is one-time: integrate an LLM at AI Config → LLM Management once, and all AI capabilities across the platform use the same configuration.
What AI Won’t Do — Its Boundaries
As an ops lead, what you probably care about most is not “how much can it do”, but “will it overstep”. Nightingale AI honors these bottom lines by design:
| Boundary | Description |
|---|---|
| Doesn’t run commands itself | All “execute” actions — running self-healing scripts, modifying production — the last click must be by a human. AI only recommends, never acts. |
| No privilege escalation when reading data | AI runs in the current logged-in user’s permission context. Business groups you can’t see, resources you can’t change — AI can’t access either. |
| No silent data exfiltration | The Nightingale server does not send your alert/metric/host list back to any “Nightingale official service” — there is no central AI service. |
| No dangerous command output | When writing self-healing scripts, refuses to output rm -rf / / mkfs / shutdown / iptables -F and similar dangerous operations, rewriting them as safe variants (dry-run / scoped). |
| Doesn’t take the blame for you | AI’s root cause analyses and self-heal recommendations always come with a “false-positive risk” note — telling you “this conclusion may be wrong in such-and-such cases”. The final judgment and signoff are yours. |
Data Fully Under Your Control
Many ops leaders’ first reaction to “AI” is: “will my alert data be sent to OpenAI?”
Nightingale AI’s design on this point is: you decide where the LLM lives. Nightingale binds to no particular vendor.
| Deployment | Data flow |
|---|---|
| Use OpenAI public API | Data goes over the public internet → OpenAI |
| Use Alibaba Tongyi / Volcengine Doubao | Data stays inside Alibaba Cloud / Volcengine Cloud |
| Use the company’s internal LLM gateway | Data does not leave the company |
| Use local Ollama / vLLM with open-source models | Data fully stays in-domain, completely offline |
Strict data residency requirements? Just run Ollama with DeepSeek / Qwen / Llama — a single GPU server is enough, and it’s plenty effective for frontline SRE scenarios.
See the list of supported providers at LLM Management.
30-minute Hands-on
-
Integrate an LLM (5 min) — go to LLM Management and create an LLM configuration: fill in API URL / API Key / model name → click Test Connection → save once you see “Connection successful” → set it as default.
-
Run through the 6 scenarios (15 min) — find a recent alert and use each of the 6 scenarios above:
- On the event detail page, ask “is this alert real / can it self-heal”;
- In the global chat box, ask “which P1 alerts in the last 1 hour”;
- Next to any PromQL input, click the AI button and have it write a query;
- Try the “AI Generate” button in the notification template editor;
- …
After this lap, you’ll know which scenarios in your team are best to roll out first.
-
Write your first team Skill (10 min) — turn the most frequent “incident handling SOP” in the team into a Skill and upload it, so new hires benefit too. See Skill Authoring Tutorial.
Related Docs
- LLM Management — detailed steps to integrate an LLM
- Skill Management — what a Skill is and how to manage one
- Skill Authoring Tutorial — write a team Skill from scratch
- Built-in Skill Overview — capability list of the 19 out-of-the-box Skills
- Notification Templates → AI Generate Template