Compare commits
2 Commits
| Author | SHA1 | Date | |
|---|---|---|---|
| 02b01e5c23 | |||
| fce8ea5889 |
16
.env.sample
16
.env.sample
@ -6,7 +6,8 @@ DOMAIN=monitoring-ng.example.com
|
|||||||
ENABLE_BACKUPS=true
|
ENABLE_BACKUPS=true
|
||||||
|
|
||||||
## Enable this secret for Promtail / Prometheus
|
## Enable this secret for Promtail / Prometheus
|
||||||
# SECRET_BASIC_AUTH_VERSION=v1
|
#COMPOSE_FILE="$COMPOSE_FILE:compose.basic-auth.yml"
|
||||||
|
#SECRET_BASIC_AUTH_VERSION=v1
|
||||||
#
|
#
|
||||||
# Promtail (Gathering Logs)
|
# Promtail (Gathering Logs)
|
||||||
# COMPOSE_FILE="$COMPOSE_FILE:compose.promtail.yml"
|
# COMPOSE_FILE="$COMPOSE_FILE:compose.promtail.yml"
|
||||||
@ -79,9 +80,10 @@ ENABLE_BACKUPS=true
|
|||||||
#GF_MATRIX_ROOM_ID="<room-id>"
|
#GF_MATRIX_ROOM_ID="<room-id>"
|
||||||
#GF_MATRIX_HOMESERVER_URL="<homeserver-url>"
|
#GF_MATRIX_HOMESERVER_URL="<homeserver-url>"
|
||||||
|
|
||||||
# ALerts
|
## ALerts
|
||||||
#ALERT_BACKUP_FAILED_ENABLED=true
|
|
||||||
#ALERT_BACKUP_MISSING_ENABLED=true
|
# Node disk space alert will trigger when free disk space left is below the given number in percent
|
||||||
#ALERT_BACKUP_NOT_SUCCESSFULL_ENABLED=true
|
#ALERT_NODE_DISK_SPACE_LEFT=10
|
||||||
#ALERT_NODE_DISK_SPACE_ENABLED=true
|
|
||||||
#ALERT_NODE_MEMORY_USAGE_ENABLED=true
|
# Node memory usage alert will trigger when memory usage is above the given number in percent
|
||||||
|
#ALERT_NODE_MEMORY_USAGE=85
|
||||||
|
|||||||
47
README.md
47
README.md
@ -18,32 +18,18 @@ It's based heavily on the [monitoring-lite](https://git.coopcloud.tech/coop-clou
|
|||||||
|
|
||||||
<!-- endmetadata -->
|
<!-- endmetadata -->
|
||||||
|
|
||||||
## Setup a Metrics Gathering
|
## Setup Metrics Gathering
|
||||||
|
|
||||||
Where gathering.org is the node you want to gather metrics from.
|
Where gathering.org is the node you want to gather metrics from.
|
||||||
|
|
||||||
1. Configure DNS
|
1. Configure DNS
|
||||||
- monitoring.gathering.org
|
|
||||||
- cadvisor.monitoring.gathering.org
|
- cadvisor.monitoring.gathering.org
|
||||||
- node.monitoring.gathering.org
|
- node.monitoring.gathering.org
|
||||||
1. Configure Traefik to use BasicAuth
|
2. [Configure Traefik to use BasicAuth](https://git.coopcloud.tech/coop-cloud/traefik#configuring-wildcard-ssl-using-dns)
|
||||||
* `abra app config traefik.gathering.org`
|
3. `abra app new monitoring-ng`
|
||||||
uncomment
|
4. `abra app config monitoring.gathering.org` (for gathering only the main `compose.yml` is needed, nothing more.)
|
||||||
```
|
5. `abra app deploy monitoring.gathering.org`
|
||||||
# BASIC_AUTH
|
6. check that endpoints are up and basic-auth works
|
||||||
COMPOSE_FILE="$COMPOSE_FILE:compose.basicauth.yml"
|
|
||||||
BASIC_AUTH=1
|
|
||||||
SECRET_USERSFILE_VERSION=v1
|
|
||||||
```
|
|
||||||
- Generate userslist with httpasswd hashed password
|
|
||||||
`abra app secret insert traefik.gathering.org usersfile v1 'admin:<hashed-secret>'`
|
|
||||||
make sure there is no whitespace in between `admin:<hashed-secret>`, it seems to break stuff...
|
|
||||||
- `abra app deploy -f traefik`
|
|
||||||
1. `abra app new monitoring-ng`
|
|
||||||
1. `abra app config monitoring.gathering.org`
|
|
||||||
for gathering only the main `compose.yml` is needed, nothing more.
|
|
||||||
1. `abra app deploy monitoring.gathering.org`
|
|
||||||
1. check that endpoints are up and basic-auth works
|
|
||||||
- cadvisor.monitoring.gathering.org
|
- cadvisor.monitoring.gathering.org
|
||||||
- node.monitoring.gathering.org
|
- node.monitoring.gathering.org
|
||||||
|
|
||||||
@ -56,16 +42,13 @@ In case you have no traefik running on the machine, you can expose the ports dir
|
|||||||
|
|
||||||
## Setup Metrics Browser
|
## Setup Metrics Browser
|
||||||
|
|
||||||
|
This builds upon [Setup Metrics Gathering](#setup-metrics-grathering) so make sure you did that first.
|
||||||
|
|
||||||
1. Configure DNS
|
1. Configure DNS
|
||||||
- monitoring.example.org
|
- monitoring.example.org
|
||||||
- prometheus.monitoring.example.org
|
|
||||||
- loki.monitoring.example.org
|
|
||||||
2. Setup monitoring stack
|
2. Setup monitoring stack
|
||||||
- `abra app new monitoring-ng`
|
- `abra app config monitoring.example.org` Uncomment prometheus, loki and grafana
|
||||||
- `abra app config monitoring.example.org`
|
- `abra app secret insert monitoring.example.org basic_auth v1 <password>`
|
||||||
Uncomment all the stuff
|
|
||||||
- `abra app secret insert monitoring.example.org basic_auth v1 <secret>`
|
|
||||||
this needs the plaintext traefik basic-auth secret, not the hashed one!
|
this needs the plaintext traefik basic-auth secret, not the hashed one!
|
||||||
- `abra app secret ls monitoring.example.org`
|
- `abra app secret ls monitoring.example.org`
|
||||||
- `abra app deploy monitoring.example.org`
|
- `abra app deploy monitoring.example.org`
|
||||||
@ -156,13 +139,9 @@ GF_MATRIX_HOME_SERVER_URL=
|
|||||||
```
|
```
|
||||||
4. Configure Alertmanager webhook and set the url to `http://matrix-alertmanager-receiver:12345/alerts/<room-id>`
|
4. Configure Alertmanager webhook and set the url to `http://matrix-alertmanager-receiver:12345/alerts/<room-id>`
|
||||||
|
|
||||||
## alerts
|
## Alerts
|
||||||
|
|
||||||
It is possible to enable the following alerts, by setting the corresponding env variable to `true`:
|
|
||||||
- backupbot failed: `ALERT_BACKUP_FAILED_ENABLED`
|
|
||||||
- backupbot missing: `ALERT_BACKUP_MISSING_ENABLED`
|
|
||||||
- backupbot not successfull: `ALERT_BACKUP_NOT_SUCCESSFULL_ENABLED`
|
|
||||||
- node disk space: `ALERT_NODE_DISK_SPACE_ENABLED`
|
|
||||||
- node memory usage: `ALERT_NODE_MEMORY_USAGE_ENABLED`
|
|
||||||
|
|
||||||
|
It is possible to enable the following alerts, by uncommenting the corresponding env variable:
|
||||||
|
|
||||||
|
- node disk space: `ALERT_NODE_DISK_SPACE_LEFT`
|
||||||
|
- node memory usage: `ALERT_NODE_MEMORY_USAGE`
|
||||||
|
|||||||
6
abra.sh
6
abra.sh
@ -9,9 +9,9 @@ export GF_CUSTOM_INI_VERSION=v4
|
|||||||
export PROMTAIL_YML_VERSION=v3
|
export PROMTAIL_YML_VERSION=v3
|
||||||
export LOKI_YML_VERSION=v3
|
export LOKI_YML_VERSION=v3
|
||||||
export PROMETHEUS_YML_VERSION=v2
|
export PROMETHEUS_YML_VERSION=v2
|
||||||
export MATRIX_ALERTMANAGER_CONFIG_VERSION=e
|
export MATRIX_ALERTMANAGER_CONFIG_VERSION=v1
|
||||||
export MATRIX_ALERTMANAGER_ENTRYPOINT_VERSION=a
|
export MATRIX_ALERTMANAGER_ENTRYPOINT_VERSION=v1
|
||||||
export GRAFANA_ALERTS_NODE_VERSION=v1c
|
export GRAFANA_ALERTS_NODE_VERSION=v2
|
||||||
|
|
||||||
# creates a default prometheus scrape config for a given node
|
# creates a default prometheus scrape config for a given node
|
||||||
add_node(){
|
add_node(){
|
||||||
|
|||||||
@ -2,13 +2,13 @@ apiVersion: 1
|
|||||||
|
|
||||||
# List of alert rule UIDs that should be deleted
|
# List of alert rule UIDs that should be deleted
|
||||||
deleteRules:
|
deleteRules:
|
||||||
{{ if ne (env "ALERT_NODE_DISK_SPACE_ENABLED") "true" }}
|
{{ if not (env "ALERT_NODE_DISK_SPACE_LEFT") }}
|
||||||
- orgId: 1
|
- orgId: 1
|
||||||
uid: bds8bhxu97pxca
|
uid: coopcloud_node_disk_space_left
|
||||||
{{ end }}
|
{{ end }}
|
||||||
{{ if ne (env "ALERT_NODE_MEMORY_USAGE_ENABLED") "true" }}
|
{{ if not (env "ALERT_NODE_MEMORY_USAGE") }}
|
||||||
- orgId: 1
|
- orgId: 1
|
||||||
uid: ads8cswmly96oa
|
uid: coopcloud_node_memory_usage
|
||||||
{{ end }}
|
{{ end }}
|
||||||
|
|
||||||
groups:
|
groups:
|
||||||
@ -17,8 +17,8 @@ groups:
|
|||||||
folder: node
|
folder: node
|
||||||
interval: 5m
|
interval: 5m
|
||||||
rules:
|
rules:
|
||||||
{{ if eq (env "ALERT_NODE_DISK_SPACE_ENABLED") "true" }}
|
{{ if (env "ALERT_NODE_DISK_SPACE_LEFT") }}
|
||||||
- uid: bds8bhxu97pxca
|
- uid: coopcloud_node_disk_space_left
|
||||||
title: Node Disk Space
|
title: Node Disk Space
|
||||||
condition: C
|
condition: C
|
||||||
data:
|
data:
|
||||||
@ -45,7 +45,7 @@ groups:
|
|||||||
conditions:
|
conditions:
|
||||||
- evaluator:
|
- evaluator:
|
||||||
params:
|
params:
|
||||||
- 10
|
- {{ env "ALERT_NODE_DISK_SPACE_LEFT" }}
|
||||||
type: lt
|
type: lt
|
||||||
operator:
|
operator:
|
||||||
type: and
|
type: and
|
||||||
@ -70,13 +70,13 @@ groups:
|
|||||||
annotations:
|
annotations:
|
||||||
description: ""
|
description: ""
|
||||||
runbook_url: ""
|
runbook_url: ""
|
||||||
summary: Less than 10% disk space left on {{`{{ $labels.instance }}`}} ({{`{{ (index $values "A").Value }}`}}% left)
|
summary: Less than {{ env "ALERT_NODE_DISK_SPACE_LEFT" }}% disk space left on {{`{{ $labels.instance }}`}} ({{`{{ (index $values "A").Value }}`}}% left)
|
||||||
labels:
|
labels:
|
||||||
"": ""
|
"": ""
|
||||||
isPaused: false
|
isPaused: false
|
||||||
{{ end }}
|
{{ end }}
|
||||||
{{ if eq (env "ALERT_NODE_MEMORY_USAGE_ENABLED") "true" }}
|
{{ if (env "ALERT_NODE_MEMORY_USAGE") }}
|
||||||
- uid: ads8cswmly96oa
|
- uid: coopcloud_node_memory_usage
|
||||||
title: Node Memory Usage
|
title: Node Memory Usage
|
||||||
condition: C
|
condition: C
|
||||||
data:
|
data:
|
||||||
@ -103,7 +103,7 @@ groups:
|
|||||||
conditions:
|
conditions:
|
||||||
- evaluator:
|
- evaluator:
|
||||||
params:
|
params:
|
||||||
- 85
|
- {{ env "ALERT_NODE_MEMORY_USAGE" }}
|
||||||
type: gt
|
type: gt
|
||||||
operator:
|
operator:
|
||||||
type: and
|
type: and
|
||||||
@ -126,6 +126,6 @@ groups:
|
|||||||
execErrState: Error
|
execErrState: Error
|
||||||
for: 5m
|
for: 5m
|
||||||
annotations:
|
annotations:
|
||||||
summary: Memory usage is above 85% on {{`{{ $labels.instance }}`}} ({{`{{ printf "%.2f" (index $values "A").Value }}`}}% usage)
|
summary: Memory usage is above {{ env "ALERT_NODE_MEMORY_USAGE" }}% on {{`{{ $labels.instance }}`}} ({{`{{ printf "%.2f" (index $values "A").Value }}`}}% usage)
|
||||||
isPaused: false
|
isPaused: false
|
||||||
{{ end }}
|
{{ end }}
|
||||||
|
|||||||
7
compose.basic-auth.yml
Normal file
7
compose.basic-auth.yml
Normal file
@ -0,0 +1,7 @@
|
|||||||
|
---
|
||||||
|
version: "3.8"
|
||||||
|
|
||||||
|
secrets:
|
||||||
|
basic_auth:
|
||||||
|
external: true
|
||||||
|
name: ${STACK_NAME}_basic_auth_${SECRET_BASIC_AUTH_VERSION}
|
||||||
@ -32,8 +32,8 @@ services:
|
|||||||
- GF_SECURITY_ADMIN_PASSWORD__FILE=/run/secrets/gf_adminpasswd
|
- GF_SECURITY_ADMIN_PASSWORD__FILE=/run/secrets/gf_adminpasswd
|
||||||
- GF_SECURITY_ALLOW_EMBEDDING
|
- GF_SECURITY_ALLOW_EMBEDDING
|
||||||
- GF_INSTALL_PLUGINS
|
- GF_INSTALL_PLUGINS
|
||||||
- ALERT_NODE_DISK_SPACE_ENABLED
|
- ALERT_NODE_DISK_SPACE_LEFT
|
||||||
- ALERT_NODE_MEMORY_USAGE_ENABLED
|
- ALERT_NODE_MEMORY_USAGE
|
||||||
deploy:
|
deploy:
|
||||||
labels:
|
labels:
|
||||||
- "traefik.enable=true"
|
- "traefik.enable=true"
|
||||||
|
|||||||
@ -30,6 +30,7 @@ services:
|
|||||||
- "traefik.http.routers.${STACK_NAME}-prometheus.entrypoints=web-secure"
|
- "traefik.http.routers.${STACK_NAME}-prometheus.entrypoints=web-secure"
|
||||||
- "traefik.http.routers.${STACK_NAME}-prometheus.tls=true"
|
- "traefik.http.routers.${STACK_NAME}-prometheus.tls=true"
|
||||||
- "traefik.http.routers.${STACK_NAME}-prometheus.tls.certresolver=${LETS_ENCRYPT_ENV}"
|
- "traefik.http.routers.${STACK_NAME}-prometheus.tls.certresolver=${LETS_ENCRYPT_ENV}"
|
||||||
|
- "traefik.http.routers.${STACK_NAME}-prometheus.middlewares=basicauth@file"
|
||||||
|
|
||||||
configs:
|
configs:
|
||||||
prometheus_yml:
|
prometheus_yml:
|
||||||
|
|||||||
@ -23,8 +23,3 @@ configs:
|
|||||||
name: ${STACK_NAME}_promtail_yml_${PROMTAIL_YML_VERSION}
|
name: ${STACK_NAME}_promtail_yml_${PROMTAIL_YML_VERSION}
|
||||||
file: promtail.yml.tmpl
|
file: promtail.yml.tmpl
|
||||||
template_driver: golang
|
template_driver: golang
|
||||||
|
|
||||||
secrets:
|
|
||||||
basic_auth:
|
|
||||||
external: true
|
|
||||||
name: ${STACK_NAME}_basic_auth_${SECRET_BASIC_AUTH_VERSION}
|
|
||||||
|
|||||||
@ -1,315 +0,0 @@
|
|||||||
{
|
|
||||||
"apiVersion": 1,
|
|
||||||
"groups": [
|
|
||||||
{
|
|
||||||
"orgId": 1,
|
|
||||||
"name": "backupbot",
|
|
||||||
"folder": "node",
|
|
||||||
"interval": "1m",
|
|
||||||
"rules": [
|
|
||||||
{{ if eq (env "ALERT_BACKUP_FAILED_ENABLED") "true" }}
|
|
||||||
{
|
|
||||||
"uid": "de8e5xxup7t34a",
|
|
||||||
"title": "Backup Failed",
|
|
||||||
"condition": "C",
|
|
||||||
"data": [
|
|
||||||
{
|
|
||||||
"refId": "A",
|
|
||||||
"relativeTimeRange": { "from": 600, "to": 0 },
|
|
||||||
"datasourceUid": "PBFA97CFB590B2093",
|
|
||||||
"model": {
|
|
||||||
"disableTextWrap": false,
|
|
||||||
"editorMode": "builder",
|
|
||||||
"expr": "backup",
|
|
||||||
"fullMetaSearch": false,
|
|
||||||
"includeNullMetadata": true,
|
|
||||||
"instant": true,
|
|
||||||
"intervalMs": 1000,
|
|
||||||
"legendFormat": "__auto",
|
|
||||||
"maxDataPoints": 43200,
|
|
||||||
"range": false,
|
|
||||||
"refId": "A",
|
|
||||||
"useBackend": false
|
|
||||||
}
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"refId": "C",
|
|
||||||
"relativeTimeRange": { "from": 600, "to": 0 },
|
|
||||||
"datasourceUid": "__expr__",
|
|
||||||
"model": {
|
|
||||||
"conditions": [
|
|
||||||
{
|
|
||||||
"evaluator": { "params": [0], "type": "lt" },
|
|
||||||
"operator": { "type": "and" },
|
|
||||||
"query": { "params": ["C"] },
|
|
||||||
"reducer": { "params": [], "type": "last" },
|
|
||||||
"type": "query"
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"datasource": { "type": "__expr__", "uid": "__expr__" },
|
|
||||||
"expression": "A",
|
|
||||||
"intervalMs": 1000,
|
|
||||||
"maxDataPoints": 43200,
|
|
||||||
"refId": "C",
|
|
||||||
"type": "threshold"
|
|
||||||
}
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"noDataState": "NoData",
|
|
||||||
"execErrState": "Error",
|
|
||||||
"for": "1m",
|
|
||||||
"isPaused": false
|
|
||||||
},
|
|
||||||
{{ end }}
|
|
||||||
{{ if eq (env "ALERT_BACKUP_MISSING_ENABLED") "true" }}
|
|
||||||
{
|
|
||||||
"uid": "ce8e65uddcwe8d",
|
|
||||||
"title": "Backup Missing",
|
|
||||||
"condition": "B",
|
|
||||||
"data": [
|
|
||||||
{
|
|
||||||
"refId": "A",
|
|
||||||
"relativeTimeRange": { "from": 600, "to": 0 },
|
|
||||||
"datasourceUid": "PBFA97CFB590B2093",
|
|
||||||
"model": {
|
|
||||||
"disableTextWrap": false,
|
|
||||||
"editorMode": "builder",
|
|
||||||
"expr": "rate(backup[24h])",
|
|
||||||
"fullMetaSearch": false,
|
|
||||||
"includeNullMetadata": true,
|
|
||||||
"instant": true,
|
|
||||||
"intervalMs": 1000,
|
|
||||||
"legendFormat": "__auto",
|
|
||||||
"maxDataPoints": 43200,
|
|
||||||
"range": false,
|
|
||||||
"refId": "A",
|
|
||||||
"useBackend": false
|
|
||||||
}
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"refId": "B",
|
|
||||||
"relativeTimeRange": { "from": 600, "to": 0 },
|
|
||||||
"datasourceUid": "__expr__",
|
|
||||||
"model": {
|
|
||||||
"conditions": [
|
|
||||||
{
|
|
||||||
"evaluator": { "params": [0, 0], "type": "within_range" },
|
|
||||||
"operator": { "type": "and" },
|
|
||||||
"query": { "params": ["C"] },
|
|
||||||
"reducer": { "params": [], "type": "last" },
|
|
||||||
"type": "query"
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"datasource": { "type": "__expr__", "uid": "__expr__" },
|
|
||||||
"expression": "A",
|
|
||||||
"intervalMs": 1000,
|
|
||||||
"maxDataPoints": 43200,
|
|
||||||
"refId": "B",
|
|
||||||
"type": "threshold"
|
|
||||||
}
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"noDataState": "NoData",
|
|
||||||
"execErrState": "Error",
|
|
||||||
"for": "5m",
|
|
||||||
"isPaused": false
|
|
||||||
},
|
|
||||||
{{ end }}
|
|
||||||
{{ if eq (env "ALERT_BACKUP_NOT_SUCCESSFULL_ENABLED") "true" }}
|
|
||||||
{
|
|
||||||
"uid": "de8e6bc92a8lcc",
|
|
||||||
"title": "Backup Not Successfull",
|
|
||||||
"condition": "B",
|
|
||||||
"data": [
|
|
||||||
{
|
|
||||||
"refId": "A",
|
|
||||||
"relativeTimeRange": {
|
|
||||||
"from": 60,
|
|
||||||
"to": 0
|
|
||||||
},
|
|
||||||
"datasourceUid": "PBFA97CFB590B2093",
|
|
||||||
"model": {
|
|
||||||
"disableTextWrap": false,
|
|
||||||
"editorMode": "builder",
|
|
||||||
"expr": "backup",
|
|
||||||
"fullMetaSearch": false,
|
|
||||||
"includeNullMetadata": true,
|
|
||||||
"instant": true,
|
|
||||||
"intervalMs": 1000,
|
|
||||||
"legendFormat": "__auto",
|
|
||||||
"maxDataPoints": 43200,
|
|
||||||
"range": false,
|
|
||||||
"refId": "A",
|
|
||||||
"useBackend": false
|
|
||||||
}
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"refId": "B",
|
|
||||||
"relativeTimeRange": {
|
|
||||||
"from": 60,
|
|
||||||
"to": 0
|
|
||||||
},
|
|
||||||
"datasourceUid": "__expr__",
|
|
||||||
"model": {
|
|
||||||
"conditions": [
|
|
||||||
{
|
|
||||||
"evaluator": {
|
|
||||||
"params": [
|
|
||||||
0
|
|
||||||
],
|
|
||||||
"type": "gt"
|
|
||||||
},
|
|
||||||
"operator": {
|
|
||||||
"type": "and"
|
|
||||||
},
|
|
||||||
"query": {
|
|
||||||
"params": [
|
|
||||||
"C"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
"reducer": {
|
|
||||||
"params": [],
|
|
||||||
"type": "last"
|
|
||||||
},
|
|
||||||
"type": "query"
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"datasource": {
|
|
||||||
"type": "__expr__",
|
|
||||||
"uid": "__expr__"
|
|
||||||
},
|
|
||||||
"expression": "A",
|
|
||||||
"intervalMs": 1000,
|
|
||||||
"maxDataPoints": 43200,
|
|
||||||
"refId": "B",
|
|
||||||
"type": "threshold"
|
|
||||||
}
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"noDataState": "NoData",
|
|
||||||
"execErrState": "Error",
|
|
||||||
"for": "20m",
|
|
||||||
"annotations": {
|
|
||||||
"summary": "Backup did not finish within 20 minutes"
|
|
||||||
},
|
|
||||||
"labels": {},
|
|
||||||
"isPaused": false
|
|
||||||
}
|
|
||||||
{{ end }}
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"orgId": 1,
|
|
||||||
"name": "node",
|
|
||||||
"folder": "node",
|
|
||||||
"interval": "5m",
|
|
||||||
"rules": [
|
|
||||||
{{ if eq (env "ALERT_NODE_DISK_SPACE_ENABLED") "true" }}
|
|
||||||
{
|
|
||||||
"uid": "bds8bhxu97pxca",
|
|
||||||
"title": "Node Disk Space",
|
|
||||||
"condition": "C",
|
|
||||||
"data": [
|
|
||||||
{
|
|
||||||
"refId": "A",
|
|
||||||
"relativeTimeRange": { "from": 600, "to": 0 },
|
|
||||||
"datasourceUid": "PBFA97CFB590B2093",
|
|
||||||
"model": {
|
|
||||||
"editorMode": "code",
|
|
||||||
"expr": "(node_filesystem_free_bytes{fstype=\"ext4\"} / node_filesystem_size_bytes{fstype=\"ext4\"}) * 100",
|
|
||||||
"instant": true,
|
|
||||||
"intervalMs": 1000,
|
|
||||||
"legendFormat": "__auto",
|
|
||||||
"maxDataPoints": 43200,
|
|
||||||
"range": false,
|
|
||||||
"refId": "A"
|
|
||||||
}
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"refId": "C",
|
|
||||||
"relativeTimeRange": { "from": 600, "to": 0 },
|
|
||||||
"datasourceUid": "__expr__",
|
|
||||||
"model": {
|
|
||||||
"conditions": [
|
|
||||||
{
|
|
||||||
"evaluator": { "params": [10], "type": "lt" },
|
|
||||||
"operator": { "type": "and" },
|
|
||||||
"query": { "params": ["C"] },
|
|
||||||
"reducer": { "params": [], "type": "last" },
|
|
||||||
"type": "query"
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"datasource": { "type": "__expr__", "uid": "__expr__" },
|
|
||||||
"expression": "A",
|
|
||||||
"intervalMs": 1000,
|
|
||||||
"maxDataPoints": 43200,
|
|
||||||
"refId": "C",
|
|
||||||
"type": "threshold"
|
|
||||||
}
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"noDataState": "NoData",
|
|
||||||
"execErrState": "Error",
|
|
||||||
"for": "5m",
|
|
||||||
"annotations": {},
|
|
||||||
"labels": {},
|
|
||||||
"isPaused": false
|
|
||||||
},
|
|
||||||
{{ end }}
|
|
||||||
{{ if eq (env "ALERT_NODE_MEMORY_USAGE_ENABLED") "true" }}
|
|
||||||
{
|
|
||||||
"uid": "ads8cswmly96oa",
|
|
||||||
"title": "Node Memory Usage",
|
|
||||||
"condition": "C",
|
|
||||||
"data": [
|
|
||||||
{
|
|
||||||
"refId": "A",
|
|
||||||
"relativeTimeRange": { "from": 600, "to": 0 },
|
|
||||||
"datasourceUid": "PBFA97CFB590B2093",
|
|
||||||
"model": {
|
|
||||||
"editorMode": "code",
|
|
||||||
"expr": "(node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100",
|
|
||||||
"instant": true,
|
|
||||||
"intervalMs": 1000,
|
|
||||||
"legendFormat": "__auto",
|
|
||||||
"maxDataPoints": 43200,
|
|
||||||
"range": false,
|
|
||||||
"refId": "A"
|
|
||||||
}
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"refId": "C",
|
|
||||||
"relativeTimeRange": { "from": 600, "to": 0 },
|
|
||||||
"datasourceUid": "__expr__",
|
|
||||||
"model": {
|
|
||||||
"conditions": [
|
|
||||||
{
|
|
||||||
"evaluator": { "params": [90], "type": "gt" },
|
|
||||||
"operator": { "type": "and" },
|
|
||||||
"query": { "params": ["C"] },
|
|
||||||
"reducer": { "params": [], "type": "last" },
|
|
||||||
"type": "query"
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"datasource": { "type": "__expr__", "uid": "__expr__" },
|
|
||||||
"expression": "A",
|
|
||||||
"intervalMs": 1000,
|
|
||||||
"maxDataPoints": 43200,
|
|
||||||
"refId": "C",
|
|
||||||
"type": "threshold"
|
|
||||||
}
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"noDataState": "NoData",
|
|
||||||
"execErrState": "Error",
|
|
||||||
"for": "5m",
|
|
||||||
"annotations": {},
|
|
||||||
"labels": {},
|
|
||||||
"isPaused": false
|
|
||||||
}
|
|
||||||
{{ end }}
|
|
||||||
]
|
|
||||||
}
|
|
||||||
]
|
|
||||||
}
|
|
||||||
|
|
||||||
Reference in New Issue
Block a user