夜莺-Nightingale
夜莺V6
项目介绍 架构介绍
快速开始
黄埔营
安装部署
升级
采集器
使用手册
API
数据库表结构
FAQ
开源生态
Prometheus
版权声明
第1章:天降奇兵
第2章:探索PromQL
第3章:Prometheus告警处理
第4章:Exporter详解
第5章:数据与可视化
第6章:集群与高可用
第7章:Prometheus服务发现
第8章:监控Kubernetes
第9章:Prometheus Operator
参考资料

Log Collection & Storage

For ES storage deployment and log collection reporting, please refer to this document.

Elasticsearch Data Source Configuration

es001

Data Source Name - Name: The identifier name of the Elasticsearch data source;

HTTP - URL: Elasticsearch service address; Example: http://server-ip:9200

HTTP - Timeout (unit: ms): Connection service timeout;

Authorization - Username & Password: Elasticsearch authorization username and password;

es002

Elasticsearch can create and query users via your_path/bin/elasticsearch-users list|useradd. You can also view and manage users via the GET /_security/user or in Kibana’s management interface.

Custom HTTP Headers - Header & Value: Configure the validation parameters for requesting the Elasticsearch interface;

ES Details - Version & Maximum Concurrent Shard Requests

If the number of shards in the cluster is very high, the unconfigured maximum concurrent shard requests may cause high memory usage, increased CPU load, increased latency, etc. In this case, setting the maximum concurrent shard requests is very necessary. You can query the set maximum concurrent shard requests via the Elasticsearch command curl -X GET http://server_ip:port/_cat/shards?v.

Minimum Time Interval (s): The lower limit of automatic grouping by time interval. It is recommended to set it to the writing frequency. For example, if data is written every minute, set it to 1m.

Write Configuration - Allow Writing (Commercial Feature)

This feature is similar to the Remote Write in the Prometheus data source. Once enabled, log extraction data can be written back to the ES node;

es003

After enabling write permission, data can be written normally.

es004

Bind to Alarm Engine Cluster: Configure the data source to bind to the specified alarm engine;

How to Query Data

  1. Log Query - Instant Query —> 2. Select the corresponding ES data source —> 3. Select the query syntax —> 4. Enter the query statement

es004

  • KQL: A simple text-based query language for filtering data. KQL only filters data and does not aggregate, transform, or sort data. KQL Official Syntax Introduction
  • Lucene: The main reason for using Lucene syntax is to use advanced Lucene features such as regular expressions or fuzzy term matching. However, Lucene syntax cannot search nested objects or script fields. Lucene Official Syntax Introduction

How to Configure ES Log Alarms

es005

The log type and the general metric alarm rules are very similar, with the only difference being the setting of the alarm conditions. Metric alarm rules use PromQL as the query condition, while log type alarm rules use boolean expressions as the query condition. These alarm conditions (such as A, B, etc.) need to be obtained through query statistics.

When configuring query statistics, it is found to be similar to the ES Log Instant Query. First, select the index and query conditions as well as the date field. There are also two additional data field groups: Numerical Extraction and Group By.

es006

To get numerical type results, use numerical extraction and select the appropriate statistical function. In addition to common functions such as count, sum, avg, min, max, there are also percentile value functions like p90, p95, p99.

Additionally, by configuring Group By, results can be grouped based on specific fields. This will generate multiple time series and, if the alarm conditions are met, produce multiple anomaly points, generating multiple alarm events.

Example 1: Alarm Condition for HTTP CODE 4xx

Explanation: In a 10-minute time period, check the message field in the log. If the number of 4xx logs exceeds 2, an alarm is generated, and statistics are grouped by the host.hostname field. Configuration is as follows:

es007

Example 2: Alarm when the 95th Percentile of Request Time Exceeds 1700ms

Explanation: In a 2-minute time period, use the p95 function to statistically analyze the request_time in the log. Group by remote_addr dimension and check if there are request times exceeding 1700ms. Configuration is as follows:

es008

Example 3: Alarm When Request Time Greater Than 1900ms and More Than 10 Logs Match

Explanation: In a 2-minute time period, filter logs with request_time greater than 1900ms. Group by request_uri dimension and check if the number of logs exceeds 10. Configuration is as follows:

es009

After configuring the required data fields, you can also preview the query results using the data preview button.

es010

开源版
Flashcat
Flashduty