Prometheus 进阶函数 info 来了，苦 join 久已啊

巴辉特 2025-12-25 19:06:10

背景痛点

Prometheus 重度用户通常会遇到的一个典型痛点，是标签 Enrichment。

比如计算 HTTP QPS 的指标中缺少 k8s_cluster_name 标签，这个标签的信息在 target_info 中，此时就需要使用 join 逻辑来联动两个指标，把 target_info 中的 k8s_cluster_name 标签搞到最终的结果中。类似两个数据库 Table 之间的 join 操作。

比如：

sum by (http_status_code, k8s_cluster_name) (
    rate(http_server_request_duration_seconds_count[2m])
  * on (job, instance) group_left (k8s_cluster_name)
    target_info
)

这个操作有两个问题：

1. 复杂

一般用户根本看不懂。毕竟没几个人认认真真学过 Promql，我之前写过 Promql 的教程：https://flashcat.cloud/tags/promql/ 如果你刚入门，那强烈建议读一读。

2. 标签变化问题

即著名的 Churn Problem 问题。延展上面的例子，target_info 中除了有 k8s_cluster_name 标签，可能还有别的标签，比如 k8s_pod_labels_app_kubernetes_io_version 标签。

如果版本发生变化，k8s_pod_labels_app_kubernetes_io_version 标签就会变，标签变化，在 Prometheus 生态里就是新的时间线。

而又因为 lookback delta（默认5分钟）的机制，导致在 5 分钟内，两个时间线会同时存在，进而导致 join 时的 many-to-many 问题！进而导致你的告警规则异常、仪表盘图表异常！

这是 Prometheus 明显的设计缺陷，但当时吧，大家觉得这个场景用得少，也没有更好的办法，就这么一直凑合用着了。直到…

社区在尝试把 OpenTelemetry 和 Prometheus 做整合，把 Prometheus 作为 OpenTelemetry 的后端存储，OpenTelemetry 的 attributes 变化有点频繁，于是，社区更加崩溃了，于是，info 函数来了。

info 函数

先看一个直观例子，上面提到的 Promql 如果使用 info 会是这个样子：

sum by (http_status_code, k8s_cluster_name) (
  info(rate(http_server_request_duration_seconds_count[2m]))
)

呀，简单多了吧。

info 语法

info(v instant-vector, [data-label-selector instant-vector])

info 函数有两个参数：

v：是一个 instant vector，info 会对这个 vector 进行标签丰富
data-label-selector：参数可选，用来对 metadata 信息（target_info 就是 metadata 信息）的筛选

注意，xx_info 指标的标签通常很多，info 会把所有的标签全部附加到 v 上面，如果你不想看到这么多标签，就要在上层继续做聚合，比如上例中的 sum by (http_status_code, k8s_cluster_name)，最终就只剩 http_status_code、k8s_cluster_name 两个标签。

data-label-selector 不好理解？举两个例子：

## example1:
info(
  rate(http_server_request_duration_seconds_count[2m]),
  {k8s_cluster_name=~".+"}
)

## example2:
info(
  rate(http_server_request_duration_seconds_count[2m]),
  {k8s_cluster_name="us-east-0"}
)

选择不同的 info 指标

上面直接使用 info 函数，没有指定是从 target_info 还是从 build_info 等 metadata 信息中提取标签，实际上，默认的 info 是写死了就从 target_info 中提取标签。

如果想从其他 xx_info 提取标签，可以使用第二个参数做过滤，把指标名写到花括号里，用 __name__ 做过滤：

# Use build_info instead of target_info
info(up, {__name__="build_info"})

# Use multiple info metrics (combines labels from both)
info(up, {__name__=~"(target|build)_info"})

# Select build_info and only include the version label
info(up, {__name__="build_info", version=~".+"})

最重要的

最重要的是 info 函数解决了 Churn 问题，info 底层处理时，遇到同时匹配多个的情况，就只取最新时间戳的那一个，以此规避 many-to-many。

启用 info 特性

在 Prometheus 进程启动的时候传个参数：

prometheus --enable-feature=promql-experimental-functions

缺陷

回顾最开始的 Promql：

sum by (http_status_code, k8s_cluster_name) (
    rate(http_server_request_duration_seconds_count[2m])
  * on (job, instance) group_left (k8s_cluster_name)
    target_info
)

注意其中的 on (job, instance) 部分，这里的 job、instance 作为标识标签来联动两个指标。但是上文的 info 并没有提到在哪里指定标识标签啊，是的，info 函数现在是实验阶段，是写死的…就是固定的把 job、instance 作为标识标签，你想用其他的作为标识标签？不好意思，现在尚不支持…毕竟，现在这个特性只是实验阶段…