- 快猫星云Flashcat

Nightingale is an open-source cloud-native monitoring system. This article explains how to monitor processes with Nightingale.

Process monitoring is divided into two parts: one is overall process count statistics on the operating system, and the other is metric collection for a single process.

Overall Process Count

Taking Categraf as an example, Categraf provides the processes plugin to count the number of processes on a host — including total process count, processes in Running state, processes in Sleeping state, and so on. We have prepared a dedicated dashboard for the data collected by the processes plugin:

https://github.com/ccfos/nightingale/blob/main/integrations/Linux/dashboards/categraf-processes.json

What are these metrics useful for? Typically, scenarios involving unexpected mass launches of processes. For example, the author once encountered: a crontab was written poorly, the script hung, and the cron script did not check whether the previous process had exited. As a result, each crontab execution started a new process, eventually leading to a large number of identically named processes running on the host and causing an incident. In such cases, the metrics collected by the processes plugin can help spot the issue.

Single-Process Metrics

Single-process metrics refer to the CPU, memory, file handles, and other metrics consumed by a process. There are several ways to collect them.

Instrument inside the process. For example, Java programs can use Micrometer or Spring Boot Actuator to collect metrics, while Go programs can use the Prometheus Go client library to collect metrics.
Collect from outside the process. For example, use Process Exporter or Categraf’s procstat plugin.

Generally speaking, in-process instrumentation is the more recommended approach. It can collect not only general metrics like CPU and memory, but also more runtime metrics — Java programs can collect JVM metrics, and Go programs can collect goroutine and gc metrics. All excellent open-source software exposes its own monitoring metrics. As business R&D personnel vary in skill level, some may not understand the importance of instrumentation. In such cases, external collection can serve as a supplement.

Spring Boot Actuator can be configured to expose Prometheus-formatted metrics directly, so no additional plugin is needed — just use Categraf’s prometheus plugin, or configure scrape rules in Prometheus or vmagent.

Using Categraf’s procstat plugin as an example, its documentation is available here. Key metrics to focus on:

procstat_lookup_count — process count; if 0, the corresponding process is down
procstat_rlimit_num_fds_soft — soft limit for process file handles; if it is 1024, the system parameters are usually not well tuned
procstat_cpu_usage_total — process CPU usage
procstat_mem_usage_total — process memory usage
procstat_num_fds_total — number of file handles opened by the process
procstat_read_bytes_total — total bytes read by the process
procstat_write_bytes_total — total bytes written by the process

A reference dashboard for single-process metrics:

https://github.com/ccfos/nightingale/blob/main/integrations/Procstat/dashboards/categraf-procstat.json

FAQ

1. How do I monitor multiple processes with the procstat plugin?

A sample configuration:

[[instances]]
search_exec_substring = "mysqld"
gather_total = true
gather_per_pid = true
gather_more_metrics = [
    "threads",
    "fd",
    "io",
    "uptime",
    "cpu",
    "mem",
    "limit",
]

[[instances]]
search_exec_substring = "n9e-plus"
gather_total = true
gather_per_pid = true
gather_more_metrics = [
    "threads",
    "fd",
    "io",
    "uptime",
    "cpu",
    "mem",
    "limit",
]

2. What does the jvm parameter in gather_more_metrics of the procstat configuration do?

If gather_more_metrics contains jvm, the target process is treated as a Java process, and the system’s jstat command is invoked to collect basic JVM metrics. jstat is a tool bundled with the JDK installation, located in the JDK’s bin directory. A common pitfall here is: users configure jvm in gather_more_metrics, jstat is available on the host, and a test command can collect data successfully:

./categraf --test --inputs procstat

But after restarting Categraf for normal collection, the data can no longer be collected. The usual reason is: Categraf is managed by systemd, and systemd does not know the JDK’s environment variables, so the jstat command cannot be found. The fix is to configure Categraf’s service file to add the JDK’s environment variables. For example:

Environment="PATH=/usr/lib/jvm/java-11-openjdk-amd64/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"

Common Questions

Q1: How do I monitor whether a process is running?

A: Use Categraf’s procstat plugin and configure exe = "nginx" or pattern = "supervisord". It will report metrics like proc_pid_count{name="nginx"} — configure an alert on < 1.

Q2: Can I monitor a process’s memory / CPU usage?

A: Yes. procstat reports metrics like proc_resident_memory_bytes / proc_cpu_usage. Configure alerts by process name and threshold.