- 快猫星云Flashcat

Detailed explanation of the Nightingale configuration file.

The configuration file for the central n9e is etc/config.toml, and the configuration file for the edge alert engine n9e-edge is etc/edge/edge.toml. Here we walk through the n9e configuration file in sections.

Global

[Global]
RunMode = "release"

This is a configuration item for Nightingale developers. Regular users do not need to worry about it — always keep it as release.

Log

[Log]
# stdout, stderr, file
Output = "stdout"
# log write dir
Dir = "logs"
# log level: DEBUG INFO WARNING ERROR
Level = "DEBUG"
# # rotate by time
# KeepHours = 4
# # rotate by size
# RotateNum = 3
# # unit: MB
# RotateSize = 256

Output: log output destination, supports stdout, stderr, file. Only in file mode will logs be written to a file and the other configuration items below take effect.
Dir: the directory where log files are stored.
Level: log level, supports DEBUG, INFO, WARNING, ERROR.
KeepHours: log file retention time in hours. Log files can be rotated by time or by size — use this item for time-based rotation, producing one log file per hour. For size-based rotation, use the two configurations below.
RotateNum: number of log files to retain.
RotateSize: log file size in MB.

HTTP

[HTTP]
# http listening address
Host = "0.0.0.0"
# http listening port
Port = 17000
# https cert file path
CertFile = ""
# https key file path
KeyFile = ""
# whether print access log
PrintAccessLog = false
# whether enable pprof
PProf = true
# expose prometheus /metrics?
ExposeMetrics = true
# http graceful shutdown timeout, unit: s
ShutdownTimeout = 30
# max content length: 64M
MaxContentLength = 67108864
# http server read timeout, unit: s
ReadTimeout = 20
# http server write timeout, unit: s
WriteTimeout = 40
# http server idle timeout, unit: s
IdleTimeout = 120

Host: HTTP listening address, usually 0.0.0.0, meaning listening on all NICs.
Port: HTTP listening port.
CertFile: HTTPS cert file path.
KeyFile: HTTPS key file path.
PrintAccessLog: whether to print access logs.
PProf: whether to enable pprof. If enabled, pprof information is available under /api/debug/pprof/.
ExposeMetrics: whether to expose Prometheus’s /metrics endpoint to expose Nightingale’s own monitoring metrics.
ShutdownTimeout: HTTP server graceful shutdown timeout, in seconds.
MaxContentLength: maximum HTTP request length, in bytes.
ReadTimeout: HTTP read timeout, in seconds.
WriteTimeout: HTTP write timeout, in seconds.
IdleTimeout: HTTP idle timeout, in seconds.

HTTP.ShowCaptcha

[HTTP.ShowCaptcha]
Enable = false

Enable: whether to enable the captcha feature.

HTTP.APIForAgent

[HTTP.APIForAgent]
Enable = true
# [HTTP.APIForAgent.BasicAuth]
# user001 = "ccc26da7b9aba533cbb263a36c07dcc5"

Enable: whether to enable the Agent-facing API. Normally this must be enabled, so this is typically true.
BasicAuth: the Agent-facing API supports BasicAuth. Configure BasicAuth credentials here. For intranet communication BasicAuth is usually not needed; for public network communication it is recommended, and the password must not use the default to avoid attacks.
In the example above, user001 is the BasicAuth username and ccc26da7b9aba533cbb263a36c07dcc5 is the BasicAuth password. To configure multiple users, add more entries, e.g.:

[HTTP.APIForAgent.BasicAuth]
user001 = "ccc26da7b9aba533cbb263a36c07dcc5"
user002 = "d4f5e6a7b8c9d0e1f2g3h4i5j6k7l8m9"

Note: if BasicAuth is configured, the Agent’s n9e configuration file must also include the corresponding credentials, otherwise the Agent will not be able to connect to the central n9e.

In the default config, Enable is set to true and HTTP.APIForAgent.BasicAuth is empty, meaning the Agent-facing APIs are enabled and BasicAuth is not.

HTTP.APIForService

[HTTP.APIForService]
Enable = false
[HTTP.APIForService.BasicAuth]
user001 = "ccc26da7b9aba533cbb263a36c07dcc5"

Enable: whether to enable the Service-facing API. Communication between the edge alert engine n9e-edge and the central n9e relies on this set of APIs on the central side, so if you use n9e-edge, you need to enable it (set to true).
BasicAuth: the Service-facing API supports BasicAuth. Configure BasicAuth credentials here. For intranet communication BasicAuth is usually not needed; for public network communication it is recommended, and the password must absolutely not use the default to avoid attacks.
In the example above, user001 is the BasicAuth username and ccc26da7b9aba533cbb263a36c07dcc5 is the BasicAuth password. To configure multiple users, add more entries, e.g.:

[HTTP.APIForService.BasicAuth]
user001 = "ccc26da7b9aba533cbb263a36c07dcc5"
user002 = "d4f5e6a7b8c9d0e1f2g3h4i5j6k7l8m9"

Note: if BasicAuth is configured, the edge alert engine n9e-edge configuration must also include the corresponding credentials, otherwise n9e-edge will not be able to connect to the central n9e. In the default config, Enable is set to false, meaning the Service-facing APIs are disabled. In this case n9e-edge cannot connect to the central n9e either.

HTTP.JWTAuth

[HTTP.JWTAuth]
# unit: min
AccessExpired = 1500
# unit: min
RefreshExpired = 10080
RedisKeyPrefix = "/jwt/"

Nightingale authentication uses JWT. The two expiration times here are in minutes. AccessExpired is the access token expiration and RefreshExpired is the refresh token expiration. You can ask GPT about how access/refresh tokens work in JWT — we won’t repeat it here. Nightingale stores some JWT-related information in Redis, and RedisKeyPrefix is the Redis key prefix; it generally does not need to be changed.

HTTP.ProxyAuth

[HTTP.ProxyAuth]
# if proxy auth enabled, jwt auth is disabled
Enable = false
# username key in http proxy header
HeaderUserNameKey = "X-User-Name"
DefaultRoles = ["Standard"]

If you want to embed Nightingale into your own system, you can consider ProxyAuth, similar to Grafana’s ProxyAuth. The idea is that the user logs in to your own system, then you forward the username via the X-User-Name header to Nightingale, which will treat the user as logged in. DefaultRoles is the default role — if you do not pass a role, Nightingale treats the user as a Standard role.

In practice, as far as I have observed, there are no community users using this feature, so use it with caution.

HTTP.RSA

[HTTP.RSA]
OpenRSA = false

When logging in to Nightingale, the user password is transmitted in plaintext. If the Nightingale site is HTTPS that’s fine, but if it’s HTTP, it’s recommended to enable RSA encryption so the password is not transmitted in plaintext.

DB

[DB]
# mysql postgres sqlite
DBType = "sqlite"
# postgres: host=%s port=%s user=%s dbname=%s password=%s sslmode=%s
# postgres: DSN="host=127.0.0.1 port=5432 user=root dbname=n9e_v6 password=1234 sslmode=disable"
# mysql: DSN="root:1234@tcp(localhost:3306)/n9e_v6?charset=utf8mb4&parseTime=True&loc=Local"
DSN = "n9e.db"
# enable debug mode or not
Debug = false
# unit: s
MaxLifetime = 7200
# max open connections
MaxOpenConns = 32
# max idle connections
MaxIdleConns = 8

DBType and DSN are the most critical configurations and work together. DBType supports mysql, postgres, and sqlite. DSN is the database connection information — if sqlite, it is the database file path; if mysql or postgres, it is the database connection string.

Starting from v8, Nightingale defaults DBType to sqlite so users can experience it quickly without installing a database. However, in production, please use mysql or postgres.

For Postgres and MySQL DSN configuration, refer to the commented examples. Other settings are database-connection-related; adjust according to your environment. For small-to-medium environments, setting MaxOpenConns to 32 and MaxIdleConns to 8 is sufficient.

Redis

[Redis]
# standalone cluster sentinel miniredis
RedisType = "miniredis"
# address, ip:port or ip1:port,ip2:port for cluster and sentinel(SentinelAddrs)
Address = "127.0.0.1:6379"
# Username = ""
# Password = ""
# DB = 0
# UseTLS = false
# TLSMinVersion = "1.2"
# Mastername for sentinel type
# MasterName = "mymaster"
# SentinelUsername = ""
# SentinelPassword = ""

Besides storing JWT login authentication information, Redis is also used to store heartbeat metadata reported by machines. The machine-disconnect alert rule supported by Nightingale is judged based on the heartbeat times in Redis. If there is no heartbeat for a long time, the machine is considered disconnected.

If Redis responds slowly, it may cause false positives in disconnect alerts. That is, the machine is actually alive, but the heartbeat info in Redis is not updated in time, so Nightingale incorrectly judges it as disconnected. Starting from V8.beta11, Redis-operation-related monitoring metrics have been added. Pay attention to these metrics to detect slow Redis responses in time.

RedisType supports standalone, cluster, sentinel, and miniredis. Since v8, Nightingale defaults to miniredis for quick experience without installing Redis. However, in production, please use one of the other modes.

Address is the Redis connection address. It is configured differently depending on RedisType:

standalone: when RedisType is standalone, Address is the address of the Redis instance in the format ip:port.
cluster: when RedisType is cluster, Address is the addresses of the Redis cluster in the format ip1:port,ip2:port.
sentinel: when RedisType is sentinel, Address is the Sentinel addresses in the format ip1:port,ip2:port. In sentinel mode you also need to configure MasterName, SentinelUsername, and SentinelPassword.
UseTLS: whether to use TLS.
TLSMinVersion: TLS minimum version, only effective when UseTLS is true.

Alert

Starting from a certain version, Nightingale merged the webapi and alert engine modules to reduce deployment complexity. The Alert section here is the alert engine configuration.

Alert.Heartbeat

[Alert.Heartbeat]
# auto detect if blank
IP = ""
# unit ms
Interval = 1000
EngineName = "default"

IP: the IP address of the alert engine. If empty, Nightingale will auto-detect. Each alert engine writes heartbeat info to MySQL, so every alert engine knows the list of all live alert engines, and can then perform sharding of alert rules. For example, if there are 100 alert rules and two n9e instances form a cluster, each will roughly handle 50 rules. When one alert engine dies, the other will take over all 100 rules.
Interval: heartbeat interval, in milliseconds.
EngineName: alert engine name. The central side normally keeps default, while edge alert engines (n9e-edge) can use custom EngineNames like edge1, edge2, etc. Engines with the same EngineName are treated as one cluster.

Center

Configurations specific to the central n9e — the edge alert engine n9e-edge does not have these. These correspond to the old n9e-webapi-specific configurations.

[Center]
MetricsYamlFile = "./etc/metrics.yaml"
I18NHeaderKey = "X-Language"

[Center.AnonymousAccess]
PromQuerier = true
AlertDetail = true

MetricsYamlFile: path to the metrics configuration file. The metric descriptions you see in the Quick View come from this file. After the Metrics View was released, this file became less important, and there are even plans to remove the Quick View feature.
I18NHeader: this is a developer-only configuration item — regular users need not worry.
Center.AnonymousAccess: anonymous-access-related configuration. PromQuerier controls whether anonymous queries to data sources are allowed; AlertDetail controls whether anonymous viewing of alert details is allowed. Can be enabled on intranet; must be disabled on the public internet.

Dashboards have a public-access feature that can even be configured to require no login. However, this requires PromQuerier to be set to true. If PromQuerier = false, even if the dashboard is set to public access, login is still required.

Pushgw

Although Nightingale does not store monitoring data directly, it provides multiple interfaces to receive monitoring data, such as the Prometheus remote write protocol interface and the OpenTSDB protocol interface. After receiving the data, Nightingale forwards it to the backend time-series database, so Nightingale acts like a Pushgateway. Pushgateway-related configurations are under Pushgw.

[Pushgw]
# use target labels in database instead of in series
LabelRewrite = true
ForceUseServerTS = true

LabelRewrite: Nightingale has a machine management menu where you can tag machines, and these tags are appended to time-series data related to the machine. But if a tag in the reported data conflicts with a tag from machine management, which one wins? If LabelRewrite is true, the machine-management tag wins; otherwise the reported tag wins.
ForceUseServerTS: whether to forcefully use the server’s timestamp to override the timestamp in the received monitoring data. Without this option, many companies’ machine clocks are not synchronized, causing confusion. So Nightingale provides this option — we recommend enabling it and using the server’s timestamp uniformly.

Pushgw.DebugSample

[Pushgw.DebugSample]
ident = "xx"
__name__ = "cpu_usage_active"

This is for debugging and troubleshooting. It is essentially a filter for monitoring metrics — if a reported metric matches the filter, it will be printed to logs. Normally there is no need to configure this — leave it commented out.

Pushgw.WriterOpt

[Pushgw.WriterOpt]
QueueMaxSize = 1000000
QueuePopSize = 1000
QueueNumber = 0

This section is commented out by default because users normally do not need to care. If Nightingale receives too much data, gets congested in memory, and eventually drops metrics, then consider tuning this section.

Nightingale creates QueueNumber queues in memory. When monitoring data is received, it is placed into these queues. The default of QueueNumber is 0, meaning no specific number — queues are created based on the number of CPU cores. Each queue’s maximum capacity is QueueMaxSize, default 1,000,000, meaning each queue can store up to 1 million entries.

Each queue corresponds to a goroutine, which pops up to QueuePopSize metrics from the queue at a time (default 1000) and writes them as a batch to the backend time-series database. This makes good use of multi-core CPUs. So QueueNumber essentially equals the concurrency of writing to the backend.

Pushgw.Writers

This section configures the remote write addresses of the backend time-series databases. Any time-series database that supports the remote write protocol can be configured here. Usually configuring one is sufficient; if you want to write to multiple time-series databases simultaneously, you can configure multiple.

[[Pushgw.Writers]]
Url = "http://127.0.0.1:9090/api/v1/write"
BasicAuthUser = "xx"
BasicAuthPass = "xx"

[[Pushgw.Writers]]
Url = "http://127.0.0.1:8482/api/v1/write"
BasicAuthUser = "xx"
BasicAuthPass = "xx"

Url: the remote write address of the time-series database.
BasicAuth: if the time-series database requires BasicAuth, configure the username and password.
Headers: if the time-series database requires additional headers, configure them here.
Timeout: write timeout in milliseconds.
DialTimeout: connection timeout in milliseconds.

Pushgw.Writers.WriteRelabels

Before writing data to the time-series database, relabel operations can be performed. This configuration is for those relabel operations. It’s similar to Prometheus’s relabel configuration, except Prometheus uses YAML and Nightingale uses TOML.

Ibex

Configuration for the self-healing engine Ibex, i.e., the remote-script-execution feature. Originally this was a separate module called ibex, later merged into n9e, so its configuration is now also in n9e.

[Ibex]
Enable = true
RPCListen = "0.0.0.0:20090"

Enable: whether to enable the Ibex server feature.
RPCListen: the RPC listening address of Ibex.

n9e-edge Configuration

The configuration file for the edge alert engine n9e-edge is etc/edge/edge.toml. Most configurations are the same as the central n9e. For more information see: “Edge Mode in Nightingale Architecture”.