conntrack 插件,容易遗漏但必须要监控
这是专栏第 8 篇,介绍一下 node-exporter 的 conntrack 插件。这个插件大家平时关注可能较少,但是在一些场景下,比如防火墙、NAT 网关等,需要监控 conntrack 表的使用情况。我就遇到过一次生产事故,就是因为 conntract 表满了,导致新连接无法建立,所以这个插件还是很有用的。
conntrack 插件采集了那些指标
默认普通机器未必会启用 conntrack,所以你可能看不到这个插件的指标。我这里通过 systemctl start firewalld
把防火墙启动了,然后就有了 conntrack 的指标。
[root@aliyun-2c2g40g3m tarball]# curl -s localhost:9100/metrics | grep "node_nf_conntrack_"
# HELP node_nf_conntrack_entries Number of currently allocated flow entries for connection tracking.
# TYPE node_nf_conntrack_entries gauge
node_nf_conntrack_entries 44
# HELP node_nf_conntrack_entries_limit Maximum size of connection tracking table.
# TYPE node_nf_conntrack_entries_limit gauge
node_nf_conntrack_entries_limit 65536
# HELP node_nf_conntrack_stat_drop Number of packets dropped due to conntrack failure.
# TYPE node_nf_conntrack_stat_drop gauge
node_nf_conntrack_stat_drop 0
# HELP node_nf_conntrack_stat_early_drop Number of dropped conntrack entries to make room for new ones, if maximum table size was reached.
# TYPE node_nf_conntrack_stat_early_drop gauge
node_nf_conntrack_stat_early_drop 0
# HELP node_nf_conntrack_stat_found Number of searched entries which were successful.
# TYPE node_nf_conntrack_stat_found gauge
node_nf_conntrack_stat_found 0
# HELP node_nf_conntrack_stat_ignore Number of packets seen which are already connected to a conntrack entry.
# TYPE node_nf_conntrack_stat_ignore gauge
node_nf_conntrack_stat_ignore 0
# HELP node_nf_conntrack_stat_insert Number of entries inserted into the list.
# TYPE node_nf_conntrack_stat_insert gauge
node_nf_conntrack_stat_insert 0
# HELP node_nf_conntrack_stat_insert_failed Number of entries for which list insertion was attempted but failed.
# TYPE node_nf_conntrack_stat_insert_failed gauge
node_nf_conntrack_stat_insert_failed 0
# HELP node_nf_conntrack_stat_invalid Number of packets seen which can not be tracked.
# TYPE node_nf_conntrack_stat_invalid gauge
node_nf_conntrack_stat_invalid 2751
# HELP node_nf_conntrack_stat_search_restart Number of conntrack table lookups which had to be restarted due to hashtable resizes.
# TYPE node_nf_conntrack_stat_search_restart gauge
node_nf_conntrack_stat_search_restart 6261
什么是 Conntrack
要想理解这些指标,首先得知道什么是 Conntrack。Conntrack 是 Linux 内核中的一个模块,用来跟踪连接的状态。比如,你的机器是一个 NAT 网关,那么 Conntrack 就会记录内网 IP 和端口到外网 IP 和端口的映射关系。这样,当外网回包的时候,内核就能根据 Conntrack 表找到对应的内网 IP 和端口,把包转发给内网机器。我们可以通过 conntrack -L
命令查看 Conntrack 表的内容。
Conntrack 表是有限的,所以当表满了,新连接就无法建立。这时,就会出现 nf_conntrack: table full
的错误,导致生产故障。
常用告警规则
通常,我们需要配置如下告警规则:
100 * node_nf_conntrack_entries / node_nf_conntrack_entries_limit > 85
Conntrack 条目使用率超过 85% 就告警,及时做出应对,通常的应对措施是增大 Conntrack 表的大小,或者调整 Conntrack 的超时时间,或者直接设置某些连接不走 Conntrack。
conntrack 插件采集逻辑
具体逻辑在 conntrack_linux.go,只有 Linux 有此插件,其他系统没有。和其他 node-exporter 采集插件类似,还是提供了 init
、NewConntrackCollector
、Update
等函数。采集逻辑在 Update 中。代码如下:
func (c *conntrackCollector) Update(ch chan<- prometheus.Metric) error {
value, err := readUintFromFile(procFilePath("sys/net/netfilter/nf_conntrack_count"))
if err != nil {
return c.handleErr(err)
}
ch <- prometheus.MustNewConstMetric(
c.current, prometheus.GaugeValue, float64(value))
value, err = readUintFromFile(procFilePath("sys/net/netfilter/nf_conntrack_max"))
if err != nil {
return c.handleErr(err)
}
ch <- prometheus.MustNewConstMetric(
c.limit, prometheus.GaugeValue, float64(value))
conntrackStats, err := getConntrackStatistics()
if err != nil {
return c.handleErr(err)
}
ch <- prometheus.MustNewConstMetric(
c.found, prometheus.GaugeValue, float64(conntrackStats.found))
ch <- prometheus.MustNewConstMetric(
c.invalid, prometheus.GaugeValue, float64(conntrackStats.invalid))
ch <- prometheus.MustNewConstMetric(
c.ignore, prometheus.GaugeValue, float64(conntrackStats.ignore))
ch <- prometheus.MustNewConstMetric(
c.insert, prometheus.GaugeValue, float64(conntrackStats.insert))
ch <- prometheus.MustNewConstMetric(
c.insertFailed, prometheus.GaugeValue, float64(conntrackStats.insertFailed))
ch <- prometheus.MustNewConstMetric(
c.drop, prometheus.GaugeValue, float64(conntrackStats.drop))
ch <- prometheus.MustNewConstMetric(
c.earlyDrop, prometheus.GaugeValue, float64(conntrackStats.earlyDrop))
ch <- prometheus.MustNewConstMetric(
c.searchRestart, prometheus.GaugeValue, float64(conntrackStats.searchRestart))
return nil
}
首先是读取 /proc/sys/net/netfilter/nf_conntrack_count
和 /proc/sys/net/netfilter/nf_conntrack_max
文件,获取当前 Conntrack 表的条目数和最大条目数。然后调用 getConntrackStatistics
函数获取 Conntrack 的统计信息,比如 found、invalid、ignore、insert、insertFailed、drop、earlyDrop、searchRestart 等。最后将这些指标发送到 ch 通道。
getConntrackStatistics
读取的是 /proc/net/stat/nf_conntrack
文件的内容并做解析,我们看看 /proc/net/stat/nf_conntrack
的内容:
[root@aliyun-2c2g40g3m tarball]# cat /proc/net/stat/nf_conntrack
entries clashres found new invalid ignore delete delete_list insert insert_failed drop early_drop icmp_error expect_new expect_create expect_delete search_restart
00000022 00000000 00000000 00000000 00000b0c 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000012 00000000 00000000 00000000 00001591
00000022 00000000 00000000 00000000 0000008b 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 0000034f
各个字段的含义,我们查阅一下权威文档:https://man7.org/linux/man-pages/man8/rtstat.8.html
/proc/net/stat/ip_conntrack, /proc/net/stat/nf_conntrack
Conntrack related counters. ip_conntrack is for backwards
compatibility with older userspace only and shows the same
data as nf_conntrack.
entries Number of entries in conntrack table.
searched Number of conntrack table lookups performed.
found Number of searched entries which were successful.
new Number of conntrack entries added which were not
expected before.
invalid Number of packets seen which can not be tracked.
ignore Number of packets seen which are already connected
to a conntrack entry.
delete Number of conntrack entries which were removed.
delete_list Number of conntrack entries which were put to
dying list.
insert Number of entries inserted into the list.
insert_failed Number of entries for which list insertion
was attempted but failed (happens if the same entry is
already present).
drop Number of packets dropped due to conntrack failure.
Either new conntrack entry allocation failed, or protocol
helper dropped the packet.
early_drop Number of dropped conntrack entries to make
room for new ones, if maximum table size was reached.
icmp_error Number of packets which could not be tracked
due to error situation. This is a subset of invalid.
expect_new Number of conntrack entries added after an
expectation for them was already present.
expect_create Number of expectations added.
expect_delete Number of expectations deleted.
search_restart Number of conntrack table lookups which had
to be restarted due to hashtable resizes.
这个文件的数值是 16 进制表示的,所以 node-exporter 中会转换成十进制的数值,另外我的系统上来看,数值有多行,node-exporter 代码里会把这些数值累加起来。
小结
本节介绍了 node-exporter 的 conntrack 插件,这个插件用来监控 Conntrack 表的使用情况,通常用于防火墙、NAT 网关等场景。Conntrack 表是有限的,当表满了,新连接就无法建立,所以需要监控 Conntrack 表的使用情况,及时做出应对。如果有问题欢迎大家留言一起探讨。
监控/可观测性领域的知识太过驳杂,想要找个乙方帮忙建设整套可观测性体系?欢迎联系我们啊:https://flashcat.cloud/contact/