Complete tutorial for writing Nightingale v9 Skills: from frontmatter field reference, body structure templates, and supporting file usage, to packaging and uploading, debugging, plus an end-to-end practical case (writing a 'MySQL Slow Query Troubleshooting Runbook' Skill).
What This Doc Covers
Skill Management introduces the concept and operation flow; Built-in Skill Overview lists the 19 out-of-the-box Skills. This doc is a hands-on guide:
- when to use each frontmatter field and what to watch out for;
- how to structure the body, copying the mature templates from built-in Skills;
- how to use supporting files (scripts, reference data, sub-docs) to make a Skill more powerful;
- an end-to-end walkthrough of writing a “MySQL Slow Query Troubleshooting Runbook” Skill, from blank to live.
Prerequisite: first set up a default LLM in LLM Management, otherwise you won’t be able to test the Skill after writing it.
1. The Basic Form of a Skill
A Skill is a Markdown file with YAML frontmatter, with the fixed filename SKILL.md. Minimal form:
---
name: my-first-skill
description: Use when the user asks about XXX, covering YYY scenarios
---
# Body here
Your domain knowledge / operation guide / templates / examples ...
Adding supporting files makes it a Skill package:
my-first-skill/
├── SKILL.md # Main file (required)
├── runbooks/
│ ├── disk-full.md
│ └── oom.md
└── scripts/
└── collect-context.sh
Just package it as .zip or .tar.gz and upload.
2. Frontmatter Field Reference
These fields are all optional except name and description which are required.
Core Fields
| Field | Type | Description |
|---|---|---|
name |
string | Required. Unique identifier of the Skill. Use kebab-case English, e.g. mysql-slow-query-runbook, payment-incident-sop. Cannot collide with the 19 built-in Skills (same name gets shadowed by the built-in), but you can deliberately use the same name to “override” a built-in — Nightingale’s rule is “user Skill wins over built-in”. |
description |
string | Required, the most important field. AI uses this to decide whether to load your Skill. The more specific and keyword-rich, the better. Dedicated section below. Max 4096 chars. |
Trigger / Matching Enhancement Fields
| Field | Type | Use |
|---|---|---|
tags |
string array | Keyword tags, used by the A2A AgentCard discovery layer. Optional in SKILL.md; discovery side falls back to default. |
examples |
string array | Typical prompt examples. Shown to callers in A2A AgentCard, hinting “how to use this Skill”. |
Runtime Behavior Fields (Advanced)
| Field | Type | Use |
|---|---|---|
builtin_tools |
string array | Declares which built-in tools this Skill needs. The Agent attaches these tools when loading this Skill. e.g. [list_metrics, get_metric_labels]. |
recommended_tools |
string array | Recommended tools list (soft suggestion). Difference from builtin_tools: former is “I need these”, latter is “having these would be better”. |
max_iterations |
int | Override the Agent’s default ReAct iteration cap (default 10). Multi-step Skills (creating a dashboard = 7-10 steps minimum) should set this to 20-25, otherwise they get truncated by the default cap. |
Metadata / Compliance Fields
| Field | Type | Use |
|---|---|---|
license |
string | License. e.g. MIT / Apache-2.0. Lets users understand the compliance boundary when sharing. |
compatibility |
string | Environment dependency description. e.g. "Requires git, docker" / "Requires internet". AI skips this Skill if capabilities aren’t available. |
allowed-tools |
string | Pre-authorized tool allowlist separated by spaces. e.g. "Bash(git:*) Bash(docker:*) Read". Only tools listed here skip the secondary user confirmation; otherwise the Agent asks the user for consent for each tool call. |
metadata |
map | Custom key-value pairs. Tag the Skill / version / owner, etc. e.g. {owner: sre-team, version: 1.0}. |
Complete Example: Frontmatter of the Built-in n9e-create-alert-rule
---
name: n9e-create-alert-rule
description: 在夜莺(n9e)平台上创建告警规则。支持 Prometheus/Loki/ES/MySQL/TDengine/ClickHouse/Doris/Host 等所有数据源类型。
max_iterations: 20
builtin_tools:
- create_alert_rule
- list_busi_groups
- list_datasources
- list_metrics
- list_notify_rules
- list_files
- read_file
- list_databases
- list_tables
- describe_table
---
Three key points:
- description one sentence to topic + enumerate scope — first state the main scenario (create alert rule), then list supported data source types, so AI doesn’t misfire;
max_iterations: 20— creating an alert rule is multi-step (pick business group → pick data source → find metric → look at labels → read template → assemble payload → create), and 10 isn’t enough;builtin_toolsseparates “action tools” from “exploration tools” —create_alert_ruleis the final write action; the others (list_*/read_file/describe_tableetc.) are read-only auxiliary tools for AI to get metadata. Both kinds must be listed when writing the Skill, or the Agent can’t get the tools and has to guess.
3. Writing a Good description: Let AI Accurately Hit Your Skill
If you only remember one tip: the description field decides everything.
Why It Matters
Nightingale AI decides whether to use your Skill like this:
- user input → extract keywords;
- feed keywords + all enabled Skills’
descriptionto the LLM; - LLM picks the top 1-2 most matching Skills to load into context.
If your description says “fault handling”, when the LLM sees “why is my machine slow”, the match confidence is extremely low, and your Skill won’t be loaded.
4 description Writing Patterns
Pattern 1: Verb-driven + multi-language keywords
description: Helps the user determine whether a machine is truly down / agent zombie / network jitter / under maintenance.
Triggers when the user asks "why is this machine lost", "is the host loss alert a false positive",
"is categraf stuck", "why can I still ping when the heartbeat stopped", etc.
Key: verbs first (judge / lost / stuck), copy typical user wording directly (match what the user actually says).
Pattern 2: Scenario list + exclusion declaration
description: Create alert rules on the Nightingale platform. Supports Prometheus/Loki/ES/MySQL/TDengine/ClickHouse/
Doris/Host and all data source types.
Key: one sentence stating main scenario + enumerate coverage, to prevent AI from using it when it shouldn’t.
Pattern 3: English verbs + Chinese verbs full coverage
description: This skill should be used when the user asks to "troubleshoot",
"diagnose", "debug alert", "investigate incident", "故障定位", "告警排查",
"问题诊断", "排障", "查告警", "分析告警", "根因分析", or discusses
monitoring/alerting/observability issues...
Key: write both English and Chinese, team members may use either.
Pattern 4: Core stance + scope boundary
description: Helps users generate, modify, or troubleshoot Nightingale alert self-healing scripts. Use when the user asks
to "write a disk cleanup/restart service/clean log/dump process/reload nginx" type of self-healing script, or
asks "how does a self-healing script get parameters passed from the alert", "what's the stdin format",
"what should timeout be", etc. **This skill focuses on the script body layer** — if the user wants to change
alert rules, recipients, or notification templates, redirect to the corresponding skill.
Key: write the boundary explicitly (“focus on XX layer”, “doesn’t involve YY”) to avoid AI misfiring in inappropriate scenarios.
How Not to Write description
❌ description: Fault handling
❌ description: This is a useful skill
❌ description: Helps ops people
❌ description: SOP for ops team
These vague, abstract descriptions with no concrete keywords will almost never match.
4. Body Structure Template: Just Copy It
After looking at all 19 built-in Skills, almost every “heavy” Skill follows this template. Copy it and you’ll write a high-quality Skill.
---
name: <your skill name>
description: <per the four patterns above>
builtin_tools:
- <tools you'll use>
---
# Skill: <Chinese title of this Skill>
## Scope
**Use this skill for**:
- User wording scenario 1
- User wording scenario 2
- User wording scenario 3
**Don't use this skill for**:
- Confusable but should-not-route scenario 1 → defer to `xxx-skill`
- Confusable but should-not-route scenario 2 → defer to `yyy-skill`
## One-line Principle
<The single most important stance, e.g. "drawing conclusions from a single layer of evidence
is almost always wrong". The model reads this and is primed; all subsequent steps revolve around this principle.>
## Workflow
### Step 1: <action name>
<How to do it specifically, what tool to call, what fields to look at, how to judge>
### Step 2: <action name>
<How to do it specifically ...>
### Step 3: <action name>
<How to do it specifically ...>
## Output Format
<Explicitly tell the model what the Final Answer looks like, ideally provide a Markdown skeleton>
## Notes / Red Lines
1. **<What not to do>**: reason is ...
2. **<What you must do>**: reason is ...
## Example
### User Input
<a real user wording>
### Workflow
1. <what was done step by step>
2. <what tool was called, what was the result>
### Output
<what the final reply looks like>
Five Writing Mindsets
- Lead with one sentence: use the most concise wording to explain what this Skill does and when to use it.
- Structured: use
## subtasksto split big tasks; under each subtask, use lists/code blocks for steps. The clearer the Markdown structure, the more consistently the model executes. - Less “I suggest”, more imperative: the model mimics your tone. “Run
top -cand screenshot the first 10 lines” is more deterministic than “you could try top”. - Explicit boundary: state “what not to do” clearly. e.g. “do not modify production config, only output suggested commands”.
- Provide examples: a concrete “input → output” example beats a thousand words.
5. Supporting Files: Make Your Skill More Powerful
A Skill package isn’t just one SKILL.md. Any supporting file can be placed under the root, and the body references them via relative paths; AI reads them on demand via read_file / list_files / grep_files.
Use 1: Per-scenario Templates
The built-in n9e-create-alert-rule Skill is a good example:
n9e-create-alert-rule/
├── SKILL.md
└── datasources/
├── clickhouse.md
├── doris.md
├── elasticsearch.md
├── host.md
├── loki.md
├── mysql.md
├── pgsql.md
├── prometheus.md
├── tdengine.md
└── victorialogs.md
In SKILL.md:
General path — pass `cate` + `rule_config_json`. For `rule_config_json` structure,
**first use `read_file` to read `datasources/<cate>.md` for the template**, then fill in actual values.
This way SKILL.md doesn’t need to embed template details for 11 data sources; the Agent reads the relevant sub-doc on demand. SKILL.md limit is 64KB, single file limit is 16MB — use supporting files for big content.
Use 2: Script Snippets
If your Skill is “write self-healing scripts”, store common script snippets as separate files:
disk-cleanup-skill/
├── SKILL.md
└── scripts/
├── clean-logs.sh
├── clean-docker.sh
└── clean-yum-cache.sh
SKILL.md guidance: “Based on user scenario, first read_file scripts/<scenario>.sh as the skeleton, then modify as needed.”
Use 3: Reference Data / Cheat Sheet
promql-skill/
├── SKILL.md
└── cheatsheet/
├── common-queries.md # Common PromQL templates
└── label-conventions.md # Internal label naming conventions
Packaging Limits
- Up to 100 files per archive
- Total uncompressed size ≤ 50MB
- Single file ≤ 16MB
- SKILL.md itself ≤ 64KB
- Not allowed: symlinks, absolute paths,
..path traversal - Auto-filtered:
.DS_Store,._*,__MACOSX/and other system noise
6. Packaging & Upload
Method 1: Online Create (Simple Skills)
Applicable: only one SKILL.md, no supporting files, short content scenarios.
Steps:
- Go to AI Config → Skill Management, click
+→ Online Create Skill; - Fill the form: name, description, prompt instructions (the body of SKILL.md), and license/compatibility/pre-authorized tools/metadata in advanced settings;
- Save.
See Skill Management → Online Create.
Method 2: Local Upload (Recommended)
Applicable: with supporting files, version-managed, distributed across teams/instances.
Steps:
-
Organize the local directory per the “basic form”:
my-skill/ ├── SKILL.md └── (other supporting files) -
Package as
.zipor.tar.gz:# zip cd my-skill && zip -r ../my-skill.zip . -x "*.DS_Store" "__MACOSX/*" # or tar.gz tar czf my-skill.tar.gz -C parent_dir my-skillTop-level directory is allowed (e.g.
my-skill/SKILL.md), or SKILL.md can sit directly at the package root. Both work. -
Go to AI Config → Skill Management, click
+→ Local Upload Skill, select the file to upload. -
After upload, the Skill appears in the list, enabled by default.
Replace / Download / Delete
- Replace: in the detail page three-dot menu, select “Replace” to upload a new archive; fully overwrites the current Skill (including all supporting files).
- Download: export the current Skill as a zip. This is the standard action for intra-team distribution and version backup — polish a Skill on one instance, download it, upload to another.
- Delete: after secondary confirmation, all associated files are cascade-deleted, unrecoverable. Recommend “downloading” a backup before deleting.
7. Debugging a Skill: How Do I Know It’s Working?
The most common awkwardness after writing a Skill is “I enabled it, but AI doesn’t seem to be using it”. Troubleshooting flow:
Step 1: description Match Check
Ask something that should hit your Skill, and see if AI’s opening references your domain knowledge / uses your defined terms / follows your workflow.
- No reference, vague answer → description didn’t match, go back and make description more specific;
- Referenced but didn’t follow your method → body isn’t explicit enough, write the workflow more imperatively;
- Answering something completely different → user input deviates too much from description, possibly description keywords are mismatched.
Step 2: Manually Specify the Skill to Verify
In the chat box, say directly “use the <your skill name> skill to help me handle XXX”. If results are good when manually specified but auto-match doesn’t work — the problem is in description, not the body.
Step 3: See If Other Skills Grabbed Your Work
If the user’s question was routed to another Skill (AI answer’s tone resembles which built-in Skill), it means that Skill’s description is more specific than yours. Two ways out:
- Write the boundary: add “does not involve XX” in your Skill’s description, kicking out confusing scenarios;
- Rename + disable conflicting Skill: in extreme cases, name your Skill the same as that built-in Skill, letting your user version override the built-in.
Step 4: max_iterations Not Enough
If AI stops midway without giving a complete answer, check if tool call count was truncated. Raise max_iterations (20-25).
Step 5: Missing builtin_tools
AI replies “I can’t query this” — most likely builtin_tools doesn’t list the corresponding tool. Add it back in frontmatter.
8. End-to-End Hands-on: Writing a “MySQL Slow Query Troubleshooting Runbook” Skill
Below is a real Skill walked through from blank to live.
Scenario
Your team receives MySQL slow query alerts every week. The senior DBA’s troubleshooting actions are fixed: pull Top 10 slow logs → look at EXPLAIN → look at SHOW PROCESSLIST for locks → look at buffer pool hit rate. But new hires receiving the alert only say “try restarting mysql”.
Write the DBA’s troubleshooting actions as a Skill.
Step 1: Make a Directory
mkdir -p mysql-slow-query-runbook/queries
cd mysql-slow-query-runbook
Step 2: Write SKILL.md
---
name: mysql-slow-query-runbook
description: MySQL 慢查询告警排障 Runbook。当用户问"MySQL 慢了 / 慢查询告警怎么处理 / DB 响应慢 /
qps 上升延迟变高 / lock 等待 / 缓冲池命中率"等时使用。覆盖:慢日志 Top 10、EXPLAIN 解读、
ProcessList 锁分析、innodb_buffer_pool 命中率检查。**不涉及**:MySQL 部署/初始化(转
`agent_deploy_guide`)、写新告警规则(转 `n9e-create-alert-rule`)。
max_iterations: 20
builtin_tools:
- list_databases
- list_tables
- describe_table
- query_log
- search_active_alerts
- get_alert_event_detail
- read_file
metadata:
owner: dba-team
version: 1.0
---
# MySQL Slow Query Troubleshooting Runbook
## Scope
**Use this skill for**:
- User alert labels include `service=mysql` or `category=db` slow query/latency alerts
- User wording: "MySQL is slow", "DB response is slow", "more slow queries", "lock waiting", "high qps and high latency"
**Don't use this skill for**:
- MySQL install / initialization → defer to `agent_deploy_guide`
- Want to add a new alert rule → defer to `n9e-create-alert-rule`
- Non-MySQL databases (PostgreSQL / TDengine etc.) → reply that the user's current DB type is not in this Runbook's scope
## One-line Principle
**Look at symptoms first, then phenomena, and code-layer SQL last.** Don't ask the user to EXPLAIN right away — 80% of "slow" is pool and lock.
## Four-step Troubleshooting
### Step 1: Phenomena — use `query_log` to pull Top 10 slow logs
Read `queries/top10_slow_queries.sql` as the skeleton, replacing the `WHERE` clause based on the user event's time window.
Focus on queries with `query_time > 1s`, sort by `count * avg_query_time` to get Top 10.
If Top 10 cluster on the same table → likely an index/lock issue on that table;
Scattered across tables → leans toward resource bottleneck (IO / CPU / pool).
### Step 2: Pool — `innodb_buffer_pool` hit rate
Read `queries/buffer_pool_hit_rate.sql`, focus on:
- Ratio of `Innodb_buffer_pool_read_requests` / `Innodb_buffer_pool_reads`
- Ratio < 99% → pool config too small or cold data access ratio high
- Ratio > 99% but slow queries still many → go to step 3
### Step 3: Locks — `SHOW PROCESSLIST`
Read `queries/processlist.sql`, focus on:
- `State` column contains "Waiting for table metadata lock" / "Waiting for row lock" → DDL or long transaction stuck
- `Time` column > 30s `Sleep` state connections → application connection pool leak
### Step 4: Still not localized — only then ask user to EXPLAIN
Run `EXPLAIN` on the Top 1 slow query from step 1, look at:
- `type=ALL` → full table scan, missing index;
- `type=index` but `rows` is large → low index selectivity, need composite index;
- `Extra=Using filesort/temporary` → ORDER BY / GROUP BY can't use index.
## Output Format
Final reply in Markdown, structure:
## Conclusion
<One-line: pool small / lock waiting / missing index / full table scan / other>
## Key Evidence
- Slow log Top 10: ...
- Pool hit rate: ...
- Locks / long transactions: ...
- EXPLAIN key fields: ...
## Recommended Actions
1-3 concrete actionable items, each with impact scope and rollback method
## False-positive Risk
Note in what cases this conclusion may be wrong
## Notes / Red Lines
1. **Don't proactively modify production my.cnf** — only output suggested changes + backup commands, let the user execute.
2. **Don't suggest `kill <thread_id>` directly** — first explain what this connection is doing, let the user judge.
3. **Caution with `OPTIMIZE TABLE`** — large tables will lock; evaluate if low-peak window first.
## Example
### User Input
"Just got a MySQL slow query alert, order_db P99 latency jumped from 20ms to 800ms"
### Workflow
1. Step 1: `query_log` pulls order_db slow log Top 10 in last 10 minutes
2. Find Top 1 is `SELECT * FROM orders WHERE user_id=?` ran 9000 times, avg 600ms
3. Step 2: Pool hit rate 99.7%, rule out pool issue
4. Step 4: `EXPLAIN` shows `type=ALL`, `rows=12000000`, missing `user_id` index
5. Give conclusion + index DDL + online DDL risk assessment
Step 3: Write the Supporting SQL Files
queries/top10_slow_queries.sql:
SELECT
DIGEST_TEXT AS query,
COUNT_STAR AS exec_count,
AVG_TIMER_WAIT/1e9 AS avg_ms,
SUM_TIMER_WAIT/1e9 AS total_ms
FROM performance_schema.events_statements_summary_by_digest
WHERE LAST_SEEN > NOW() - INTERVAL 10 MINUTE
ORDER BY total_ms DESC
LIMIT 10;
queries/buffer_pool_hit_rate.sql:
SHOW STATUS WHERE Variable_name IN (
'Innodb_buffer_pool_read_requests',
'Innodb_buffer_pool_reads',
'Innodb_buffer_pool_pages_total',
'Innodb_buffer_pool_pages_free'
);
queries/processlist.sql:
SELECT id, user, host, db, command, time, state, info
FROM information_schema.processlist
WHERE time > 30
AND command != 'Sleep'
ORDER BY time DESC;
Step 4: Package
cd mysql-slow-query-runbook
zip -r ../mysql-slow-query-runbook.zip . -x "*.DS_Store"
Or:
tar czf mysql-slow-query-runbook.tar.gz mysql-slow-query-runbook
Step 5: Upload + Test
-
Go to AI Config → Skill Management, click
+→ Local Upload Skill, selectmysql-slow-query-runbook.zip; -
The Skill is enabled automatically after upload;
-
Open Nightingale AI at the top right, ask something that should hit:
Just got a MySQL slow query alert, order_db latency went up, help me check -
AI should follow the 4-step Runbook — pull slow log, look at pool, look at locks, only then EXPLAIN.
If it didn’t follow the Runbook, refer to “Section 7: Debugging a Skill”.
Step 6: Distribute to the Team
After polishing satisfactorily, download the archive from the Skill detail page and commit it to the team’s Git repository under nightingale-skills/ as a digital asset for team Runbooks. Other Nightingale instances can upload the same package directly to use.
9. Skill Sources: Where Else Can You Get Skills Besides Writing Your Own
Nightingale’s Skill package format is fully compatible with the Anthropic Agent Skills spec — same root SKILL.md + YAML frontmatter. This means:
- Use the community Skill library directly — many open-source Skills in the anthropics/skills repo can be packaged and uploaded as-is.
- Reuse Skills from Claude / Cursor and other AI clients — as long as it’s in Anthropic Skills format, Nightingale can recognize it.
- Migrate across instances — polish on the dev instance, export it, upload to prod instance.
10. FAQ
Q1: What happens with too many Skills?
The model picks the top 1-2 most relevant Skills from the enabled list to load. No hard limit on number, but every Skill needs the model to read its description for matching — generally no noticeable overhead within 30. Above that:
- merge Skills on the same topic, use
## sub-sectionsto divide; - use the
compatibilityfield to express prerequisites, letting AI skip when not met; - disable rarely-used Skills (enable switch).
Q2: Can a Skill be “always” loaded?
Yes — explicitly specify with SkillConfig.SkillNames in Agent config. But not recommended, it makes context heavier and affects all Q&A. The right way is still to write the description well and let AI use it when appropriate.
Q3: Can supporting files make network calls / execute scripts?
No. Supporting files are materials AI reads via read_file as context — they have no execution rights. If you want the Skill to actually “do things”, rely on the tools attached via builtin_tools (these are Nightingale-controlled built-in tools).
Q4: Can I put encrypted sensitive info (like DB passwords) in a Skill?
Don’t. The full Skill text enters the model context, equivalent to uploading to your connected LLM. All keys / passwords / tokens must be injected from the tool layer (Nightingale’s tools themselves use the current user’s session credentials), and SKILL.md should only have parameter name placeholders.
Q5: What IDE has the best experience for writing Skills?
Any editor that handles Markdown + YAML works. Recommended: VSCode + YAML plugin, with frontmatter syntax validation.
Related Docs
- Skill Management — basic concepts and product operations
- Built-in Skill Overview — the 19 built-in Skills are the best writing reference
- LLM Management — the underlying model Skills need to run
- Nightingale AI Overview — understanding Skill’s position in the whole AI stack from a product view
- Anthropic Agent Skills spec — Skill package format reference
- anthropics/skills — community Skill repo