Nightingale ES Log Alert Rules

ES log alerting allows you to detect abnormal logs through query analysis and trigger alerts accordingly.
First, select the ES data source, then configure query conditions and alert rules. Below is a detailed explanation of each numbered function.
1 Select Index
Supports multiple configuration methods:
- Specify a single index:
gbsearches all documents in the gb index - Specify multiple indices:
gb,ussearches all documents in both gb and us indices - Specify index prefix:
g*,u*searches all documents in any index starting with g or u
2 Set Filter Conditions
Currently supports query string syntax (Lucene syntax)
Basic Query Syntax
| Syntax | Description | Example |
|---|---|---|
field:value |
Query records where field contains the value | status:active |
field:(value1 OR value2) |
Query records where field contains either value | title:(quick OR brown) |
field:"exact phrase" |
Query records containing the exact phrase (no tokenization) | author:"John Smith" |
Important Notes on Tokenization (Analyzer)
Elasticsearch performs tokenization on text type fields - this is where most issues occur:
What is tokenization?
- When a field type is
text, ES splits the content into multiple tokens for indexing - Text analyzers split “connection timeout” into “connection” and “timeout”
- English is split by spaces and punctuation, “John Smith” becomes “john” and “smith”
How tokenization affects queries:
| Query Method | Behavior | Tokenized? |
|---|---|---|
message:connection timeout |
Query terms are also tokenized, default OR logic | ✅ Yes |
message:"connection timeout" |
Phrase query, requires exact phrase in order | ❌ No |
message.keyword:connection timeout |
Uses keyword subfield, exact match | ❌ No |
Common Issue: Search results don’t contain the search keyword
If you search message:connection timeout but the returned logs don’t contain “connection timeout”, here’s why:
- The query “connection timeout” is tokenized into “connection” and “timeout”
- ES uses OR logic by default, returning documents containing ANY of the tokens
- So logs containing only “connection” or only “timeout” will also be returned
Solutions:
# Solution 1: Use quotes for phrase query (Recommended)
message:"connection timeout"
# Solution 2: Use AND to require all terms
message:(connection AND timeout)
# Solution 3: Use keyword subfield for exact match (requires index support)
message.keyword:*connection timeout*
Supports ? and * wildcards:
- qu?ck - ? matches any single character
- bro* - * matches zero or more characters
Use ~ operator for fuzzy matching:
- quikc~ - Matches words similar to “quick”
- “fox quick”~5 - Phrase query where words can be up to 5 positions apart
Supports numeric and date ranges:
- count:[1 TO 5] - Closed interval, includes 1 and 5
- date:[2022-01-01 TO 2022-12-31]
- age:>=10 - Greater than or equal to 10
Can use boolean operators like AND, OR, NOT:
- quick AND brown - Contains both words
- quick OR brown - Contains either word
- quick NOT fox - Contains quick but not fox
For detailed syntax, refer to the ES documentation
3 Set Date Field
Click to select the date field in logs, which will be used as the basis for querying log time ranges
4 Set Log Query Time Range
If set to 5 minutes, it will query logs from the past 5 minutes when performing alert queries
5 Value Extraction
Statistical analysis functions for logs, such as count, sum, avg, min, max, etc.
6 Group By
Group logs by fields, for example, grouping by host field for count statistics. Results will be grouped by the host field
7 Alert Conditions
Statistical values are assigned to variables A, B, C, etc. in alert conditions, then alerts are triggered based on these variables. For example, $A > 10 triggers an alert when log count exceeds 10
8 Advanced Configuration
In some scenarios where logs are delayed (e.g., 3-minute delay), querying the last 3 minutes may return no data. In advanced settings, you can set a delay query time, such as 180s, which shifts both start and end times backward by 180s
Usage Examples
Example 1: Error Log Monitoring
- Index: app-logs-*
- Query condition: level:ERROR AND service:payment
- Time range: 5 minutes
- Value extraction: count()
- Alert condition: $A > 10 Description: Monitor if payment service error logs exceed 10 entries within 5 minutes
Example 2: API Response Time Monitoring
- Index: nginx-access-*
- Query condition: path:"/api/v1/order*" AND response_time:>500
- Time range: 10 minutes
- Value extraction: avg(response_time)
- Group By: path
- Alert condition: $A > 1000 Description: Monitor if order-related API average response time exceeds 1 second
Example 3: Error Status Code Monitoring
- Index: nginx-*
- Query condition: status:[500 TO 599]
- Time range: 15 minutes
- Value extraction: count()
- Group By: host, status
- Alert condition: $A > 50 Description: Group 5xx errors by host and status code, alert if any host’s specific status code occurs more than 50 times
Example 4: Business Exception Keyword Monitoring
- Index: business-logs-*
- Query condition: message:(“timeout” OR “connection refused” OR “out of memory”)
- Time range: 30 minutes
- Value extraction: count()
- Alert condition: $A > 5 Description: Monitor log count containing specific error keywords