List and use cases of the 19 Skills bundled in the Nightingale v9 binary: covering alert rule creation/troubleshooting, host diagnostics, notification configuration, data queries, PromQL/SQL generation, self-healing recommendations and other high-frequency ops chains.

Overview

Nightingale v9 embeds a set of out-of-the-box Skills in the binary, which appear in the AI Config → Skill Management list right after install — no manual upload required. These Skills are polished for “the highest-frequency scenarios Nightingale itself sees”: from categraf deployment, alert rule creation/troubleshooting, host diagnostics, notification configuration, data queries, to PromQL/SQL generation and semi-self-healing recommendations — basically covering the daily actions of frontline SREs.

How to identify them: the Skill detail page shows the author as system, and does not allow web-side modification/deletion (the replace/delete entries are hidden). If you want to override a built-in Skill’s behavior, just create a Skill with the same name — when names collide, the user Skill in the database wins.

Whether a Skill is invoked depends on its description: the AI matches based on “user question + Skill description”. The “trigger scenarios” listed for each Skill below come directly from the keywords in its own description; a user question hitting those keywords causes the Skill to be auto-injected into the context. See Skill Management → Tips for writing prompts.

Below, the 19 built-in Skills are grouped into five categories by purpose.


1. Deployment & Integration

categraf-deploy-guide — Categraf Deployment Guide

  • Trigger scenarios: how to install categraf / how to deploy categraf / run categraf with Docker / install categraf on Windows / register categraf as a system service / categraf reports to Nightingale / how to write config.toml / how to verify categraf is collecting data.
  • What it does: a tutorial / guidance Skill that calls no tools and directly outputs copy-paste-runnable commands and config snippets. Covers binary + systemd, Docker, Windows, K8s notes, key configs, common verification commands.
  • Not in scope: installed but integration failed → defer to n9e-host-onboard-diagnose; integrated but metrics abnormal → defer to n9e-host-health-diagnose.

2. Creation (Turning Natural Language into Configuration)

These Skills all call the corresponding write API (create_*), auto-assemble the payload to complete creation, and save you from clicking through fields in forms.

n9e-create-alert-rule — Create Alert Rules

  • Trigger scenarios: “create an alert rule” in Nightingale.
  • Supported data sources: Prometheus / Loki / Elasticsearch / OpenSearch / TDengine / ClickHouse / MySQL / PostgreSQL / Doris / VictoriaLogs / Host — all data source types.
  • Two invocation modes:
    • Prometheus simplified path (most common) — just provide PromQL + threshold + comparator;
    • General path — for other data sources, automatically reads datasources/<cate>.md template and fills values.
  • Related tools: create_alert_rule, list_busi_groups, list_datasources, list_metrics, list_notify_rules, etc.

n9e-create-alert-mute — Create Alert Mute Rules

  • Trigger scenarios: mute / silence / suppress alerts, e.g. “mute all alerts for host=web01 for 2 hours”.
  • What it does: login to get token → parse mute conditions (labels / time window / business group) → call create API.

n9e-create-alert-subscribe — Create Alert Subscribe Rules

  • Trigger scenarios: subscribe to alerts / add an alert subscription / configure alert event forwarding, e.g. “subscribe to all CPU-related alerts and notify the ops group”.
  • What it does: filters alert events by conditions and forwards events to designated recipients via notification rules.

n9e-create-notify-rule — Create Notification Rules (linear 4 steps)

  • Trigger scenarios: the user has already specified “what severity, what time window, what channel, send to whom” — create a rule step by step.
  • What it does: login → look up user groups → look up channels → assemble payload → create.
  • For complex scenarios, use n9e-notify-rule-copilot (see Copilot section below).

n9e-create-dashboard — Create Monitoring Dashboard

  • Trigger scenarios: build a monitoring dashboard / Dashboard.
  • What it does: the user only provides panel titles, panel types (stat / timeseries / table, etc.) and PromQL; the Skill auto-generates the full dashboard config (layout, data source binding, styles).
  • Related tools: create_dashboard, list_files, read_file, grep_files.

n9e-modify-task-tpl — Generate / Modify Alert Self-healing Scripts

  • Trigger scenarios: write self-healing scripts (disk cleanup / restart service / log cleanup / dump process / reload nginx, etc.); or ask “how does a self-healing script receive parameters passed from an alert”, “what’s the stdin format”, “what should timeout be”, “why is is_recovered always false”, “what to do when the script stays running”.
  • Coverage layer: script body layer only (task_tpl table); if the user wants to change alert rules, recipients, or templates, it redirects to the corresponding Skill.

3. Notification Configuration Copilots (Three-Layer Division)

Nightingale’s notification chain has three layers: channel / template / rule — one Skill per layer, each specialized to address natural-language needs. Don’t cross lanes:

User’s words Which Skill
URL / Webhook address / signature / AppID / how to integrate platform X n9e-notify-channel-copilot
Template / body / fields / card color / {{ ... }} variables n9e-generate-message-template
Send to whom / tiered routing / business-hours / route by business group / filter by labels n9e-notify-rule-copilot

n9e-notify-channel-copilot — Notification Channel Copilot

  • Trigger scenarios: modify DingTalk / Lark / WeCom / email / SMS / voice / Webhook etc. channel URL, body, signature, headers, proxy, TLS, @mention / recipient field; or ask “how to integrate platform X”, “why is sending failing / 9499 / Bad Request”.
  • What it does: based on the NotifyChannelConfig model, gives paste-ready config + field-level gotcha warnings.

n9e-generate-message-template — Generate / Modify Message Templates

  • Trigger scenarios: write notification templates / change message format / add hostname / recovery value / severity / DingTalk / Lark / Lark / email / SMS / phone templates.
  • What it does: outputs snippets in Go text/template / html/template syntax that can be pasted directly into the template editor; auto-injects common variables like $event / $labels / $value.

n9e-notify-rule-copilot — Notification Rule Copilot

  • Trigger scenarios: split natural-language needs like “P1 during business hours via DingTalk + phone, off-hours phone only”, “route by business group/label”, “tiered to different channels”, “no phone on recovery” into the correct NotifyConfig array; or edit / copy / fine-tune existing rules.
  • What it does: ① catch fuzzy/complex natural language needs and split into multiple NotifyConfig; ② edit / copy / tweak existing rules; ③ field-level gotcha warnings; ④ guide test sending and diffing against real alerts.
  • Simple creation → n9e-create-notify-rule, complex or editing scenarios → this Skill.

4. Query (View Data Without Changing Config)

n9e-query-alert-events — Query Alert Events

  • Trigger scenarios: view alerts / query active alerts / search historical alerts / view alert details / stats on alert events, e.g. “P1 alerts in the last 1 hour”, “details for alert ID 123”.
  • What it does: login → call search_active_alerts / search_history_alerts / get_alert_event_detail etc.

n9e-query-datasource — Query Various Data Sources

  • Trigger scenarios: query metrics / view monitoring data / search logs / run PromQL or SQL queries.
  • Supported data sources: Prometheus / VictoriaMetrics / Elasticsearch / Loki / ClickHouse / MySQL / PostgreSQL / TDengine / Doris / OpenSearch / VictoriaLogs.
  • What it does: based on data source type, auto-reads the corresponding datasources/*.md to get the parameter format, then sends the query.

promql-generator — Natural Language → PromQL

  • Trigger scenarios: generate PromQL from a natural-language description.
  • Related tools: list_metrics (fuzzy search metric names by keyword), get_metric_labels (get label dimensions of a metric).
  • Workflow: understand intent → search metrics → get labels → assemble PromQL.

sql-generator — Natural Language → SQL

  • Trigger scenarios: generate SQL from natural language; supports MySQL / Doris / ClickHouse / PostgreSQL.
  • Related tools: list_databases / list_tables / describe_table.
  • Workflow: understand intent → see databases → see tables → see fields → assemble SQL.

5. Troubleshooting / Diagnostics (From Symptom to Root Cause)

This group represents the deepest accumulation of Nightingale on high-frequency community issues. Each Skill bakes in “experience-based misjudgments” as part of the SOP, avoiding conclusions drawn from a single piece of evidence.

ops-troubleshooting — Comprehensive Fault Localization (Alert → Root Cause)

  • Trigger scenarios: fault localization / alert triage / problem diagnosis / troubleshoot / root cause analysis / query metrics / query logs (the broadest entry).
  • What it does: multi-step diagnostics across alerts / rules / data sources / metrics / logs / dashboards / hosts / business groups; budget of 25 iterations, completing a full analysis in one round.
  • Difference from n9e-alert-rule-troubleshoot: this Skill is alert seen → find root cause; that one is should have alerted but didn’t → trace the chain.

n9e-alert-rule-troubleshoot — Why Didn’t the Alert Fire

  • Trigger scenarios: “alert not fired”, “rule not triggered”, “rule not effective”, “should have alerted but didn’t”, “why didn’t I receive the alert”.
  • Chain covered: pull the rule → run PromQL → pull eval logs → find event hash → pull processing logs → cross-check mute rules → fall back to self-monitoring metrics.
  • Version requirement: Release 22 and above only.

n9e-host-health-diagnose — Comprehensive Host Loss Judgment

  • Trigger scenarios: why is this machine lost / is the host-loss alert a false positive / has categraf stuck / why can I still ping when the heartbeat stopped.
  • Core stance: agent loss ≠ host outage. Concluding “down” from just target_up==0 / BeatTime stopping is a high-frequency false positive source in the community (real agent crash / network partition / DNS / broken proxy / Redis write latency / active maintenance all look the same).
  • What it does: comprehensive multi-layer judgment of real outage / agent zombie / network jitter / under maintenance.

n9e-host-onboard-diagnose — Host Onboarding Failure Diagnosis

  • Trigger scenarios: a newly installed categraf is not visible in Nightingale / machine list OS shows unknown / Helm installed on 3 hosts but only 1 visible / agent fails to register / Windows agent installed but not showing.
  • Mutually exclusive with n9e-host-health-diagnose: this Skill handles “never onboarded in the first place”; that one handles “previously onboarded, currently lost”.
  • Core stance: a host not appearing is not a single root cause — some segment of the onboarding chain is broken. Don’t tell the user to just modify categraf for heartbeat.enable — common root causes also include omit_hostname / ident shell / TLS / token / edge redis / multi-cluster routing.

n9e-recommend-self-heal — Alert Semi-self-healing Recommendation

  • Trigger scenarios: open Copilot from the alert event detail or notification card, ask “can this alert self-heal”, “recommend a self-healing script”, “help me handle this”, “one-click fix”.
  • Product form: semi-self-healing — AI recommends → human confirms → system executes via ibex → write back task_record (linked by event_id), forming a closed loop.
  • This Skill only recommends: execution is via the frontend button calling the ibex API; for writing the script itself, use n9e-modify-task-tpl.

  • Skill Management — general Skill mechanism, creation methods, prompt writing
  • LLM Management — the underlying model configuration that Skill calls rely on
  • Anthropic Agent Skills spec — built-in Skills are fully compatible with this spec, you can directly download/export or upload community Skill packs
快猫星云 联系方式 快猫星云 联系方式
快猫星云 联系方式
快猫星云 联系方式
快猫星云 联系方式
快猫星云