ELK的替代品：Opentelemetry + OpenSearch

译文 2025-06-04 11:51:28

随着应用程序的增长，日志数据也如潮水般涌来。管理和扩展日志记录基础设施可能很快就会成为一项重大挑战。多年来，ELK堆栈（Elasticsearch、Logstash、Kibana）一直是首选解决方案。然而，ELK已经显露出其局限性，尤其是Logstash，与现代替代品相比，它对资源的消耗较大且缺乏灵活性。除此之外，Elasticsearch在许可证变更后已成为商业产品。许多公司因其许可条款而拒绝使用它。

本文介绍了一种强大、真正开源且资源高效的ELK替代方案：OpenTelemetry 和 OpenSearch。你将清晰了解这种现代方法，如何进行设置，以及它为你的可观测性策略带来的好处。

为什么新的技术栈（OpenTelemetry & OpenSearch）更为推荐

OpenTelemetry

OpenTelemetry 不仅仅是一个日志记录工具，它是云原生计算基金会（CNCF）的一个项目，旨在将遥测数据（日志、指标和跟踪）的生成和收集标准化。虽然本文重点介绍其日志记录功能，但为日志采用 OTel 为跨所有三个支柱的统一可观测性策略铺平了道路。

OpenTelemetry 收集器是一个关键组件，具有极高的灵活性。它有：

接收方：这些定义了数据如何进入收集器。示例包括OTLP（开放遥测协议）、Fluent转发器和filelog（用于跟踪文件）。
处理器：这些允许您操作、过滤、批处理或路由数据。您可以使用属性丰富日志，删除敏感信息，或在导出之前确保高效的批处理。
导出方：这些定义了数据如何从收集器发送到一个或多个后端，如OpenSearch、Kafka或任何符合OTLP的系统。您可以在OpenTelemetry Collector Contrib GitHub存储库中找到可用组件和集成的详尽列表。
代理模式：OTel收集器还可以充当代理。这意味着您可以部署它来从无法直接使用OTel库检测的源收集日志，例如跟踪Nginx访问日志或使用filelog或dockerstats等接收器收集Docker容器日志。
性能和效率：至关重要的是，OpenTelemetry在设计时考虑了性能和效率。它通常比Logstash更轻量且性能更高，使您能够用更少的资源处理更多的数据。

Opensearch

Elasticsearch作为ELK堆栈的核心，在Elastic将核心组件变成 source-available 许可证时，也经历了重大的许可变更。这一转变引发了开源社区对供应商锁定以及该项目长期开放性的担忧。社区真的不喜欢这一变化。

作为直接回应，OpenSearch应运而生，它是一个由社区驱动、基于Apache 2.0许可的分支项目，旨在确保一条真正的开源发展道路。它拥有一个活跃的社区，并得到亚马逊网络服务（AWS）的支持，这为未来的发展提供了坚实的基础。

除了开源特性，OpenSearch还具备一系列对强大日志管理至关重要的功能：

集成安全：它开箱即提供全面的安全功能，包括身份验证、授权、加密和细粒度的访问控制。
警报与异常检测：您可获得内置功能来监控日志数据，并根据定义的条件或检测到的异常触发警报，帮助您主动发现问题。
应用程序级访问控制：OpenSearch 安全功能能够按团队或用户控制日志访问。在不同团队需要访问特定日志子集的大型组织中，这一点非常重要。

实施指南：搭建现代日志管道（Pipeline）

以下是如何使用OpenTelemetry和OpenSearch开始构建你的日志记录管道。

前置依赖

一个Kubernetes集群：虽然你可以在其他环境中运行这些组件，但强烈推荐使用Kubernetes，特别是对于可观测性系统。这些系统的发展速度往往比预期更快，而Kubernetes能提供你所需的可扩展性和可管理性。

译者注：可观测性系统作为一个P1级别的系统，业务挂了可观测性系统也不能挂，要不然没法排查业务问题。建议：要么把可观测性系统和业务系统分开使用不同的 Kubernetes 部署，要么直接在宿主上部署，减少依赖。

搭建 Opensearch

在 Kubernetes 上部署和管理 OpenSearch 的推荐方法是使用 OpenSearch Operator。

为什么选择 Operator？它能显著简化部署、持续管理（如版本升级、配置更改和扩展）以及日常运维。根据我们管理多个大型 OpenSearch 集群（这些集群每天摄取数十亿条跨度数据和日志）的经验，Operator 大幅降低了运维成本。

按照以下说明使用 Helm 启动 OpenSearch Operator：OpenSearch-k8s-operator。

一旦 Operator 运行起来，你就可以部署一个 OpenSearch 集群。以下是一个 3 节点 OpenSearch 集群的示例。欲了解更多详细信息，请阅读本用户指南：OpenSearch Operator 用户指南。

apiVersion: opensearch.opster.io/v1
kind: OpenSearchCluster
metadata:
  name: observability-opensearch
  namespace: observability
spec:
  general:
    serviceName: observability-opensearch
    version: 2.17.0
    additionalConfig:
      plugins.query.datasources.encryption.masterkey: "cbdda1e0ab9e45c44f9b56a3" # Change this

  security:
    config:
      adminSecret:
        # This secret contains the admin certificate using common name "admin". Use cert-manager to generate it.
        name: observability-opensearch-admin-cert 
      adminCredentialsSecret:
        # This secret contains admin credentials. They keys are "username" and "password".
        name: observability-opensearch-admin-credentials
      securityConfigSecret:
        # This secret contains the security config files.
        # The key is filename and the value is the file content. e.g. "config.yml": "internal_users.yml"
        name: observability-opensearch-security-config-files
    tls:
      transport:
        generate: false
        perNode: false
        secret:
          # Generate this similar to the admin cerntificate, but with common name "opensearch".
          name: observability-opensearch-node-cert
        caSecret:
          # This is the secret that contains the CA certificate used to create the admin and node certificates.
          name: observability-ca-secret
        nodesDn: ["CN=opensearch"]
        adminDn: ["CN=admin"]
      http:
        generate: false
        secret:
          name: observability-opensearch-node-cert
        caSecret:
          name: observability-ca-secret
  dashboards:
    enable: true
    version: 2.17.0
    replicas: 1
    resources:
      requests:
         memory: "512Mi"
         cpu: "200m"
      limits:
         memory: "512Mi"
         cpu: "200m"
    opensearchCredentialsSecret:
      name: observability-opensearch-admin-credentials

  nodePools:
  - component: master 
    replicas: 3
    diskSize: "30Gi"
    nodeSelector:
    resources:
        requests:
          memory: "1.5Gi"
          cpu: "500m"
        limits:
          memory: "2Gi"
          cpu: "1500m"
    roles:
      - "cluster_manager"
      - "data"
    persistence:
      pvc:
        storageClass: "storage-class-name" # Change this to your storage class
        accessModes:
          - "ReadWriteOnce"
    env: 
      - name: DISABLE_INSTALL_DEMO_CONFIG
        value: "false"

以下是您需要放入 observability-opensearch-security-config-files secret 中的基本配置文件：

internal_users.yml

_meta:
  type: "internalusers"
  config_version: 2
admin:
  #  Change this to your hashed admin password: Use https://bcrypt-generator.com/
  hash: "$2y$12$eW5Z1z3a8b7c9d8e7f8g9u0h1i2j3k4l5m6n7o8p9q0r1s2t3u4v5w6x7y8z" #
  reserved: true
  description: "Cluster super user"
  backend_roles:
  - "admin"
---

config.yml

_meta:
  type: "config"
  config_version: 2

config:
  dynamic:
    http:
      anonymous_auth_enabled: false
    authc:
      basic_internal_auth_domain:
        description: "Authenticate via HTTP Basic against internal users database"
        http_enabled: true
        transport_enabled: true
        order: 4
        http_authenticator:
          type: basic
          challenge: true
        authentication_backend:
          type: intern
      clientcert_auth_domain:
        description: "Authenticate via SSL client certificates"
        http_enabled: false
        transport_enabled: false
        order: 2
        http_authenticator:
          type: clientcert
          config:
            username_attribute: cn #optional, if omitted DN becomes username
          challenge: false
        authentication_backend:
          type: noop
    authz: {}

配置日志流

数据流旨在处理持续生成的时间序列追加式数据，例如日志。我们将设置一个名为logs-stream的数据流，并使其每天写入一个新索引。这些索引将在30天后过期。这是日志数据的常见模式，可让您有效地管理存储。

📚 注意：你可以在 OpenSearch 仪表板的 OpenSearch 开发工具中运行以下命令。

创建数据流

PUT _index_template/logs-stream-template
{
  "index_patterns" : "logs-stream",
  "data_stream": {},
  "priority": 100
}

PUT _data_stream/logs-stream

注意：你可以在 OpenSearch 仪表板的 OpenSearch 开发工具中运行以下命令。

PUT _plugins/_ism/policies/logs-lifecycle-policy
{
  "policy": {
    "description": "Rollover indices daily and delete after 30 days",
    "default_state": "hot",
    "states": [
      {
        "name": "hot",
        "actions": [
          {
            "rollover": {
              "min_index_age": "1d"
            }
          }
        ],
        "transitions": [
          {
            "state_name": "delete",
            "conditions": {
              "min_index_age": "30d"
            }
          }
        ]
      },
      {
        "name": "delete",
        "actions": [
          {
            "delete": {}
          }
        ],
        "transitions": []
      }
    ]
  }
}

POST _plugins/_ism/add/logs-stream
{
  "policy_id": "logs-lifecycle-policy"
}

配置 OpenTelemetry Collector

同样，对于 OpenTelemetry 收集器，建议使用 OTel Operator。按照以下说明进行设置：OpenTelemetry Operator。

一旦 Operator 运行起来，你就可以部署一个OpenTelemetry Collector。OpenTelemetry Collector 是你的可观测性数据的网关。你的应用程序将把它们的日志发送到 Collector，然后日志将被处理并导出到 OpenSearch。

apiVersion: opentelemetry.io/v1beta1
kind: OpenTelemetryCollector
metadata:
  name: observability-otel-workers
  namespace: observability
spec:
  mode: deployment
  image: otel/opentelemetry-collector-contrib:0.118.0
  resources:
    resources:
      requests:
        memory: "100Mi"
        cpu: "500m"
      limits:
        memory: "1Gi"
        cpu: "1000m"
  autoscaler:
    maxReplicas: 2
    minReplicas: 1
    targetCPUUtilization: 90
    targetMemoryUtilization: 90
  volumeMounts:
    - name: ca-cert
      mountPath: /tls/ca.crt
      subPath: ca.crt
  volumes:
    - name: ca-cert
      secret:
        #  This is the secret that contains the CA certificate used to create the admin and node certificates.
        secretName: observability-ca-secret 
        items:
          - key: ca.crt
            path: ca.crt
  config:
    extensions:
      basicauth/os:
        client_auth:
          username: admin
          password: opensearch-admin-password # Change this to your OpenSearch admin password

    receivers:
      otlp:
        protocols:
          grpc:
            endpoint: "0.0.0.0:4317"
          http:
            endpoint: ":4318"
            cors:
              allowed_origins:
                - "http://*"
                - "https://*"

    processors:
      batch:
        send_batch_max_size: 3000
        send_batch_size: 1000
        timeout: 5s

    exporters:
      opensearch/logs:
        # This is the OpenSearch data stream we created earlier.
        logs_index: "logs-stream"
        http:
          endpoint: "https://observability-opensearch.observability.svc.cluster.local:9200"
          auth:
            authenticator: basicauth/os
          tls:
            insecure: false
            ca_file: /tls/ca.crt

    service:
      extensions: [basicauth/os]
      pipelines:
        logs:
          receivers: [otlp]
          processors: [batch]
          exporters: [opensearch/logs]

插桩：将日志发送到 OpenTelemetry

暴露 collector 服务

如果您的应用程序在同一个 Kubernetes 集群上运行，您可以直接将日志发送到名为 observability-otel-workers.observability.svc.cluster.local:4317 的收集器服务。如果您的应用程序在集群外部运行，您可以使用 LoadBalancer 或 NodePort 服务类型来暴露收集器服务。如果您的应用程序与 Kubernetes 集群在同一个云提供商中运行，我们建议使用内部负载均衡器。

是如何创建此类负载均衡器服务的示例：

apiVersion: v1
kind: Service
metadata:
  name: observability-otel-collector-alb
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-internal: "true"  # Annotation for internal load balancer
spec:
  type: LoadBalancer
  ports:
    - name: otlp-grpc
      port: 4317
      nodePort: 32007
      protocol: TCP
    - name: otlp-http
      port: 4318
      nodePort: 32008
      protocol: TCP
  selector:
    app.kubernetes.io/component: opentelemetry-collector
    app.kubernetes.io/instance: observability.observability-otel-workers
    app.kubernetes.io/part-of: opentelemetry

你可以使用 kubectl get svc 命令获取 alb 端点。

应用插桩

要将日志发送到 OpenTelemetry 收集器，你需要使用 OpenTelemetry 库对你的应用程序进行插桩。大多数语言都支持零代码插桩，这意味着你可以通过极少的改动自动收集日志。详细文档参考这里。

从文件收集日志

建议使用 Otel 库为您的应用程序添加检测。但如果您无法修改应用程序代码，您仍然可以将日志转发到 OTel 收集器。为此，您必须在与应用程序相同的主机上以代理模式运行 OTel 收集器。代理将跟踪日志文件并将其发送到 OTel 收集器，后者会将日志写入 OpenSearch。

以下是一个如何配置 OpenTelemetry 收集器以跟踪日志文件的示例：

config.yaml

receivers:
  filelog:
    include: 
     - /var/log/myapp/*.log
    operators:
      - type: regex_parser
        regex: '^(?P<time>\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}) (?P<sev>[A-Z]*) (?P<msg>.*)$'
        timestamp:
          parse_from: attributes.time
          layout: '%Y-%m-%d %H:%M:%S'
        severity:
          parse_from: attributes.sev
exporters:
  otlp:
    endpoint: "<your-otel-collector-endpoint>:4317"
    tls:
      insecure: true
    sending_queue:
      num_consumers: 4
      queue_size: 100
    retry_on_failure:
      enabled: true
processors:
  batch:
service:
  pipelines:
    traces:
      receivers: [filelog]
      processors: [batch]
      exporters: [otlp]