- 快猫星云Flashcat

This article explains the architectural design of Nightingale, including the centralized cluster design and the edge data center alert-engine deployment mode.

Nightingale has a simple architecture. For testing functionality, a single binary is enough to start. For production, you need to depend on MySQL and Redis. Some companies have multiple data centers, and some edge data centers have poor network quality to the central one. Nightingale has special designs for this scenario as well.

Architecture Diagram

Without considering edge mode, Nightingale has only one main process, n9e, which relies on MySQL and Redis to store management data and can connect to multiple data sources. The technical architecture diagram is shown below:

Nightingale Architecture

Initially supported data sources include: Prometheus, VictoriaMetrics, and ElasticSearch — supporting both visualization and alerting. Later Nightingale focused on the alert engine, so newly supported data sources only support alerting.

Depending on whether monitoring data flows through Nightingale, there are two modes:

Mode 1: monitoring data does not flow through Nightingale. The user handles their own data collection and only configures the TSDB into Nightingale to use it for visualization and alert configuration. The machine list is empty — no machine grouping and no self-healing (because Categraf is not deployed) — but Prometheus-like alert rule configuration is still available. The architecture diagram above is the typical Mode 1.
Mode 2: data flows through Nightingale. Categraf pushes data to Nightingale via the remote write protocol. Nightingale does not store data directly but forwards it to a TSDB. Which TSDBs to forward to is determined by Pushgw.Writers in Nightingale’s config.toml. The Mode 2 architecture diagram is:

Nightingale Data-Flow Architecture

In the diagram above, after receiving monitoring data, Nightingale forwards it to VictoriaMetrics. Of course, it can also forward to Prometheus — if you do, remember to enable Prometheus’s remote receiver feature when starting Prometheus (./prometheus --help | grep receiver will show the specific control parameter), i.e., enable Prometheus’s /api/v1/write interface.

🟢 For new users, we recommend using VictoriaMetrics directly. VictoriaMetrics has better performance, supports cluster mode, and is Prometheus-API-compatible. However, Chinese documentation for VictoriaMetrics is somewhat sparser than for Prometheus.

Single-Node Test Mode

Download the release package from GitHub Releases. After extracting, you’ll find an n9e binary. Just ./n9e to run. The default port is 17000, the default username is root, and the password is root.2020.

The n9e process only depends on the etc and integrations directories at the same level as the binary, with no other service dependencies.

This single-node mode is good for quick testing but not recommended for production. In this mode, Nightingale stores configuration data (user info, alert rules, dashboards, etc.) in a local SQLite database file. After n9e starts, an n9e.db SQLite file is created in the same directory.

Single-Node Production Mode

For production you need to depend on MySQL and Redis. So you need to configure the MySQL and Redis connection info in etc/config.toml.

Key MySQL config example:

[DB]
DBType = "mysql"
DSN = "root:YourPa55word@tcp(localhost:3306)/n9e_v6?charset=utf8mb4&parseTime=True&loc=Local"

The format of DSN (connection string) is username:password@tcp(host:port)/database?params. n9e_v6 is the Nightingale database name — we’ve used this name since V6 (even though we’re now at V8+), and table-creation statements have kept using it.

Key Redis config example:

[Redis]
Address = "127.0.0.1:6379"
RedisType = "standalone"

The above is just a basic example. There are many other configurations in the file — see the comments, or refer to the configuration reference.

Nightingale Cluster

Cluster mode is simple — just set up multiple machines, deploy the n9e process on each (the process needs the etc and integrations directories to work normally), ensure all n9e config files are identical, and share the same MySQL and Redis.

Multiple n9e processes automatically distribute alert rules. For example, with 2 n9e processes and 100 configured alert rules, Nightingale will automatically distribute these 100 rules across the 2 processes — about 50 each (one rule only runs on one n9e instance — no duplicate alerts). If one machine dies, the other will take over its alert rules and continue working.

Edge Mode

The above modes are all centralized, but in actual production there may be multiple data centers, and the network quality between a central and an edge data center may not be great. If the central n9e is responsible for alerts on a TSDB in an edge data center, it will be unstable — sometimes n9e can’t even connect to the edge TSDB. In this case you need Nightingale’s edge data center alert-engine deployment mode.

Nightingale Edge Architecture

Suppose your company has 3 data centers: the central main DC, edge DC A, and edge DC B. Between edge DC A and the central there is a dedicated line with great network quality; between edge DC B and the central there is no dedicated line — only public internet, with unreliable network.

The n9e process is deployed in the central main DC. n9e depends on mysql and redis, so they are also in the central main DC. For high availability, you can deploy multiple n9e instances in the central DC with identical config files, connecting to the same mysql and redis.

In the diagram above, we have 5 data sources:

Central DC has one Loki and one ElasticSearch
Edge DC A has one ElasticSearch and one Prometheus
Edge DC B has one VictoriaMetrics

We want to view data from all 5 data sources in the central n9e, so we configure all 5 data source URLs into Nightingale, in the menu: Integration Center - Data Sources.

The central n9e can directly reach the data sources in the central DC and edge DC A via intranet addresses, but cannot directly reach the data source in edge DC B (no dedicated line). So we must expose edge DC B’s VictoriaMetrics via a public address, and the central n9e accesses it via that public address. That is:

VictoriaMetrics in DC B exposes a public address, e.g., https://ex.a.com
When configuring the data source in Nightingale’s WebUI, set the VictoriaMetrics URL to https://ex.a.com

Lines 1, 2, 3, 4, and 5 in the diagram represent the connections from the central n9e to the 5 data sources. When a user queries data, the request goes to n9e’s web UI, then to the n9e process. n9e acts as a proxy and forwards the request to the backend data sources, then returns the data to the user.

n9e-edge is deployed in edge DC B to handle alert evaluation for DC B’s VictoriaMetrics. n9e-edge syncs alert rules from the central n9e (line A in the diagram), caches them in memory, and performs alert evaluation on the local DC’s VictoriaMetrics. With this architecture, n9e-edge and VictoriaMetrics are connected via intranet, so alerting is reliable. Even if n9e-edge cannot reach the central n9e, it doesn’t affect alerting in DC B because the alert rules are already cached in memory.

Alert events produced by n9e-edge are written back to the central mysql by calling n9e’s API, and notifications are sent via DingTalk, Feishu, Flashduty, etc. If the network between n9e-edge and n9e is broken, alert events cannot be written to mysql, but as long as the outbound internet at n9e-edge’s DC works, alert notifications can still be sent.

In the diagram:

The central n9e handles alert evaluation for the central DC’s Loki and ElasticSearch, plus DC A’s ElasticSearch and Prometheus.
The n9e-edge in edge DC B handles alert evaluation for DC B’s VictoriaMetrics.

So how do you specify the association between data sources and alert engines? On the data source management page:

Nightingale Data Source Management Page

In the diagram above:

URL is the address from which the central n9e reads data — in this example, it should be DC B’s VictoriaMetrics public address.
TSDB Intranet URL is the address from which n9e-edge connects to VictoriaMetrics. If the URL is already an intranet address, this can be left blank — n9e-edge will then use the URL. In the example above, since n9e-edge and VictoriaMetrics are in the same DC, this should be the intranet address for more reliable alerting.
Remote Write URL is VictoriaMetrics’s remote write URL, used for recording rules. n9e-edge processes recording rules and writes results back to the TSDB, so it needs the remote write address. Since this is used by n9e-edge, use the intranet address. If you don’t use Nightingale’s recording rules, this can be left blank.
Associated Alert Engine Cluster — in the picture, edge-b is selected, which is the name of n9e-edge in DC B (specified by the EngineName field in edge.toml). This establishes the association between DC B’s n9e-edge and DC B’s VictoriaMetrics, so this n9e-edge handles alert rules and recording rules for DC B’s VictoriaMetrics.

The new version of Nightingale’s n9e-edge depends on a redis, so you need to deploy a redis in DC B for n9e-edge. Note: the redis used by n9e-edge is not the same one used by the central n9e. In the diagram I deliberately labeled R1 and R2 to indicate two separate redis instances used by n9e and n9e-edge respectively.

Finally, about categraf. If network is good, categraf can report data directly to the central n9e — e.g., both central and DC A categraf can directly connect to the central n9e. But since DC B has n9e-edge deployed, DC B’s categraf should connect to DC B’s n9e-edge.

Configuration Example

To achieve the above architecture, how should each component be configured? Here is an example.

Central DC n9e Config

The default config file of the central n9e is etc/config.toml:

[HTTP.APIForService]
Enable = true
[HTTP.APIForService.BasicAuth]
user001 = "ccc26da7b9aba533cbb263a36c07dcc5"
user002 = "ccc26da7b9aba533cbb263a36c07dcc6"

The key part is HTTP.APIForService. Enable defaults to false for security, meaning the n9e-edge architecture is not supported by default. To enable it, set to true. When n9e-edge calls the n9e API, BasicAuth can be used — that’s the HTTP.APIForService.BasicAuth section. The example above configures two users, user001 and user002, with passwords ccc26da7b9aba533cbb263a36c07dcc5 and ccc26da7b9aba533cbb263a36c07dcc6 — one would be enough, I configured two just for demonstration. Also, if your n9e is exposed to the public internet, you must change the default BasicAuth password, otherwise it’s easy to attack.

Edge DC n9e-edge Config

The default config of edge DC n9e-edge is etc/edge/edge.toml. First, n9e-edge needs to call the central n9e’s API, so configure the central n9e’s address:

[CenterApi]
Addrs = ["http://N9E-CENTER-SERVER:17000"]
BasicAuthUser = "user001"
BasicAuthPass = "ccc26da7b9aba533cbb263a36c07dcc5"
# unit: ms
Timeout = 9000

N9E-CENTER-SERVER:17000 is the central n9e address — adjust to your environment. BasicAuthUser and BasicAuthPass are the BasicAuth credentials of the central n9e. If the central n9e does not have BasicAuth enabled, you can leave these blank. Again: you must change the default BasicAuth password — otherwise it’s easy to attack.

The new version of n9e-edge depends on redis, so configure the redis address — by default it should be at the bottom of edge.toml; modify it yourself. If you’re on an old version that doesn’t depend on redis, you don’t need to configure it. How to tell? Check whether the default edge.toml you downloaded contains redis config — if yes, it depends on redis.

Edge DC categraf Config

Mainly two things: the writer address and the heartbeat address — both should be the n9e-edge address:

...
[[writers]]
url = "http://N9E-EDGE:19000/prometheus/v1/write"

...
[heartbeat]
enable = true

# report os version cpu.util mem.util metadata
url = "http://N9E-EDGE:19000/v1/n9e/heartbeat"
...

N9E-EDGE:19000 is the n9e-edge address. Note that n9e-edge by default listens on port 19000 — this can be changed in edge.toml.

ibex Config

The ibex part is the self-healing feature. Some companies don’t enable it for security reasons. If you want to enable it, similarly configure in edge.toml:

[Ibex]
Enable = true
RPCListen = "0.0.0.0:20090"

Then have the edge DC’s categraf connect to the edge DC’s n9e-edge port 20090. That is, the categraf’s config.toml needs:

[ibex]
enable = true
## ibex flush interval
interval = "1000ms"
## n9e ibex server rpc address
servers = ["N9E-EDGE-IP:20090"]
## temp script dir
meta_dir = "./meta"

N9E-EDGE-IP:20090 is the n9e-edge RPC address. Note this is an RPC address, not an HTTP address — so don’t unnecessarily prefix it with http://.

Other Applicable Scenarios

Besides poor network scenarios, sometimes for security reasons there are network partitions. For example, only a single jump host in a particular network zone can reach the central n9e and other machines cannot. In this case, deploy n9e-edge on the jump host, and have other machines’ categraf connect to the n9e-edge on the jump host.