Nightingale is an open-source cloud-native monitoring system. This article explains how to monitor processes with Nightingale.
Process monitoring is divided into two parts: one is overall process count statistics on the operating system, and the other is metric collection for a single process.
Overall Process Count
Taking Categraf as an example, Categraf provides the processes plugin to count the number of processes on a host — including total process count, processes in Running state, processes in Sleeping state, and so on. We have prepared a dedicated dashboard for the data collected by the processes plugin:
https://github.com/ccfos/nightingale/blob/main/integrations/Linux/dashboards/categraf-processes.json
What are these metrics useful for? Typically, scenarios involving unexpected mass launches of processes. For example, the author once encountered: a crontab was written poorly, the script hung, and the cron script did not check whether the previous process had exited. As a result, each crontab execution started a new process, eventually leading to a large number of identically named processes running on the host and causing an incident. In such cases, the metrics collected by the processes plugin can help spot the issue.
Single-Process Metrics
Single-process metrics refer to the CPU, memory, file handles, and other metrics consumed by a process. There are several ways to collect them.
- Instrument inside the process. For example, Java programs can use Micrometer or Spring Boot Actuator to collect metrics, while Go programs can use the Prometheus Go client library to collect metrics.
- Collect from outside the process. For example, use Process Exporter or Categraf’s procstat plugin.
Generally speaking, in-process instrumentation is the more recommended approach. It can collect not only general metrics like CPU and memory, but also more runtime metrics — Java programs can collect JVM metrics, and Go programs can collect goroutine and gc metrics. All excellent open-source software exposes its own monitoring metrics. As business R&D personnel vary in skill level, some may not understand the importance of instrumentation. In such cases, external collection can serve as a supplement.
Spring Boot Actuator can be configured to expose Prometheus-formatted metrics directly, so no additional plugin is needed — just use Categraf’s
prometheusplugin, or configure scrape rules in Prometheus or vmagent.
Using Categraf’s procstat plugin as an example, its documentation is available here. Key metrics to focus on:
- procstat_lookup_count — process count; if 0, the corresponding process is down
- procstat_rlimit_num_fds_soft — soft limit for process file handles; if it is 1024, the system parameters are usually not well tuned
- procstat_cpu_usage_total — process CPU usage
- procstat_mem_usage_total — process memory usage
- procstat_num_fds_total — number of file handles opened by the process
- procstat_read_bytes_total — total bytes read by the process
- procstat_write_bytes_total — total bytes written by the process
A reference dashboard for single-process metrics:
FAQ
1. How do I monitor multiple processes with the procstat plugin?
A sample configuration:
[[instances]]
search_exec_substring = "mysqld"
gather_total = true
gather_per_pid = true
gather_more_metrics = [
"threads",
"fd",
"io",
"uptime",
"cpu",
"mem",
"limit",
]
[[instances]]
search_exec_substring = "n9e-plus"
gather_total = true
gather_per_pid = true
gather_more_metrics = [
"threads",
"fd",
"io",
"uptime",
"cpu",
"mem",
"limit",
]
2. What does the jvm parameter in gather_more_metrics of the procstat configuration do?
If gather_more_metrics contains jvm, the target process is treated as a Java process, and the system’s jstat command is invoked to collect basic JVM metrics. jstat is a tool bundled with the JDK installation, located in the JDK’s bin directory. A common pitfall here is: users configure jvm in gather_more_metrics, jstat is available on the host, and a test command can collect data successfully:
./categraf --test --inputs procstat
But after restarting Categraf for normal collection, the data can no longer be collected. The usual reason is: Categraf is managed by systemd, and systemd does not know the JDK’s environment variables, so the jstat command cannot be found. The fix is to configure Categraf’s service file to add the JDK’s environment variables. For example:
Environment="PATH=/usr/lib/jvm/java-11-openjdk-amd64/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
Common Questions
Q1: How do I monitor whether a process is running?
A: Use Categraf’s procstat plugin and configure exe = "nginx" or pattern = "supervisord". It will report metrics like proc_pid_count{name="nginx"} — configure an alert on < 1.
Q2: Can I monitor a process’s memory / CPU usage?
A: Yes. procstat reports metrics like proc_resident_memory_bytes / proc_cpu_usage. Configure alerts by process name and threshold.