2 Commits

Author SHA1 Message Date
02b01e5c23 feat: make alerts configurable (#19)
Alert threshholds can now be configured via env variables

I also seperated the basic auth, so that it is possible to deploy only prometheus without promtail

Reviewed-on: #19
Reviewed-by: ammaratef45 <ammaratef45@proton.me>
Reviewed-by: Danny Groenewegen <dannygroenewegen@noreply.git.coopcloud.tech>
Co-authored-by: p4u1 <p4u1_f4u1@riseup.net>
Co-committed-by: p4u1 <p4u1_f4u1@riseup.net>
2026-06-12 06:49:17 +00:00
fce8ea5889 docs: Update deploy docs and adds prometheus basic auth back (#7)
Reviewed-on: #7
Reviewed-by: Danny Groenewegen <dannygroenewegen@noreply.git.coopcloud.tech>
Co-authored-by: p4u1 <p4u1_f4u1@riseup.net>
Co-committed-by: p4u1 <p4u1_f4u1@riseup.net>
2026-06-12 06:48:39 +00:00
9 changed files with 47 additions and 378 deletions

View File

@ -6,7 +6,8 @@ DOMAIN=monitoring-ng.example.com
ENABLE_BACKUPS=true ENABLE_BACKUPS=true
## Enable this secret for Promtail / Prometheus ## Enable this secret for Promtail / Prometheus
# SECRET_BASIC_AUTH_VERSION=v1 #COMPOSE_FILE="$COMPOSE_FILE:compose.basic-auth.yml"
#SECRET_BASIC_AUTH_VERSION=v1
# #
# Promtail (Gathering Logs) # Promtail (Gathering Logs)
# COMPOSE_FILE="$COMPOSE_FILE:compose.promtail.yml" # COMPOSE_FILE="$COMPOSE_FILE:compose.promtail.yml"
@ -79,9 +80,10 @@ ENABLE_BACKUPS=true
#GF_MATRIX_ROOM_ID="<room-id>" #GF_MATRIX_ROOM_ID="<room-id>"
#GF_MATRIX_HOMESERVER_URL="<homeserver-url>" #GF_MATRIX_HOMESERVER_URL="<homeserver-url>"
# ALerts ## ALerts
#ALERT_BACKUP_FAILED_ENABLED=true
#ALERT_BACKUP_MISSING_ENABLED=true # Node disk space alert will trigger when free disk space left is below the given number in percent
#ALERT_BACKUP_NOT_SUCCESSFULL_ENABLED=true #ALERT_NODE_DISK_SPACE_LEFT=10
#ALERT_NODE_DISK_SPACE_ENABLED=true
#ALERT_NODE_MEMORY_USAGE_ENABLED=true # Node memory usage alert will trigger when memory usage is above the given number in percent
#ALERT_NODE_MEMORY_USAGE=85

View File

@ -18,32 +18,18 @@ It's based heavily on the [monitoring-lite](https://git.coopcloud.tech/coop-clou
<!-- endmetadata --> <!-- endmetadata -->
## Setup a Metrics Gathering ## Setup Metrics Gathering
Where gathering.org is the node you want to gather metrics from. Where gathering.org is the node you want to gather metrics from.
1. Configure DNS 1. Configure DNS
- monitoring.gathering.org
- cadvisor.monitoring.gathering.org - cadvisor.monitoring.gathering.org
- node.monitoring.gathering.org - node.monitoring.gathering.org
1. Configure Traefik to use BasicAuth 2. [Configure Traefik to use BasicAuth](https://git.coopcloud.tech/coop-cloud/traefik#configuring-wildcard-ssl-using-dns)
* `abra app config traefik.gathering.org` 3. `abra app new monitoring-ng`
uncomment 4. `abra app config monitoring.gathering.org` (for gathering only the main `compose.yml` is needed, nothing more.)
``` 5. `abra app deploy monitoring.gathering.org`
# BASIC_AUTH 6. check that endpoints are up and basic-auth works
COMPOSE_FILE="$COMPOSE_FILE:compose.basicauth.yml"
BASIC_AUTH=1
SECRET_USERSFILE_VERSION=v1
```
- Generate userslist with httpasswd hashed password
`abra app secret insert traefik.gathering.org usersfile v1 'admin:<hashed-secret>'`
make sure there is no whitespace in between `admin:<hashed-secret>`, it seems to break stuff...
- `abra app deploy -f traefik`
1. `abra app new monitoring-ng`
1. `abra app config monitoring.gathering.org`
for gathering only the main `compose.yml` is needed, nothing more.
1. `abra app deploy monitoring.gathering.org`
1. check that endpoints are up and basic-auth works
- cadvisor.monitoring.gathering.org - cadvisor.monitoring.gathering.org
- node.monitoring.gathering.org - node.monitoring.gathering.org
@ -56,16 +42,13 @@ In case you have no traefik running on the machine, you can expose the ports dir
## Setup Metrics Browser ## Setup Metrics Browser
This builds upon [Setup Metrics Gathering](#setup-metrics-grathering) so make sure you did that first.
1. Configure DNS 1. Configure DNS
- monitoring.example.org - monitoring.example.org
- prometheus.monitoring.example.org
- loki.monitoring.example.org
2. Setup monitoring stack 2. Setup monitoring stack
- `abra app new monitoring-ng` - `abra app config monitoring.example.org` Uncomment prometheus, loki and grafana
- `abra app config monitoring.example.org` - `abra app secret insert monitoring.example.org basic_auth v1 <password>`
Uncomment all the stuff
- `abra app secret insert monitoring.example.org basic_auth v1 <secret>`
this needs the plaintext traefik basic-auth secret, not the hashed one! this needs the plaintext traefik basic-auth secret, not the hashed one!
- `abra app secret ls monitoring.example.org` - `abra app secret ls monitoring.example.org`
- `abra app deploy monitoring.example.org` - `abra app deploy monitoring.example.org`
@ -156,13 +139,9 @@ GF_MATRIX_HOME_SERVER_URL=
``` ```
4. Configure Alertmanager webhook and set the url to `http://matrix-alertmanager-receiver:12345/alerts/<room-id>` 4. Configure Alertmanager webhook and set the url to `http://matrix-alertmanager-receiver:12345/alerts/<room-id>`
## alerts ## Alerts
It is possible to enable the following alerts, by setting the corresponding env variable to `true`:
- backupbot failed: `ALERT_BACKUP_FAILED_ENABLED`
- backupbot missing: `ALERT_BACKUP_MISSING_ENABLED`
- backupbot not successfull: `ALERT_BACKUP_NOT_SUCCESSFULL_ENABLED`
- node disk space: `ALERT_NODE_DISK_SPACE_ENABLED`
- node memory usage: `ALERT_NODE_MEMORY_USAGE_ENABLED`
It is possible to enable the following alerts, by uncommenting the corresponding env variable:
- node disk space: `ALERT_NODE_DISK_SPACE_LEFT`
- node memory usage: `ALERT_NODE_MEMORY_USAGE`

View File

@ -9,9 +9,9 @@ export GF_CUSTOM_INI_VERSION=v4
export PROMTAIL_YML_VERSION=v3 export PROMTAIL_YML_VERSION=v3
export LOKI_YML_VERSION=v3 export LOKI_YML_VERSION=v3
export PROMETHEUS_YML_VERSION=v2 export PROMETHEUS_YML_VERSION=v2
export MATRIX_ALERTMANAGER_CONFIG_VERSION=e export MATRIX_ALERTMANAGER_CONFIG_VERSION=v1
export MATRIX_ALERTMANAGER_ENTRYPOINT_VERSION=a export MATRIX_ALERTMANAGER_ENTRYPOINT_VERSION=v1
export GRAFANA_ALERTS_NODE_VERSION=v1c export GRAFANA_ALERTS_NODE_VERSION=v2
# creates a default prometheus scrape config for a given node # creates a default prometheus scrape config for a given node
add_node(){ add_node(){

View File

@ -2,13 +2,13 @@ apiVersion: 1
# List of alert rule UIDs that should be deleted # List of alert rule UIDs that should be deleted
deleteRules: deleteRules:
{{ if ne (env "ALERT_NODE_DISK_SPACE_ENABLED") "true" }} {{ if not (env "ALERT_NODE_DISK_SPACE_LEFT") }}
- orgId: 1 - orgId: 1
uid: bds8bhxu97pxca uid: coopcloud_node_disk_space_left
{{ end }} {{ end }}
{{ if ne (env "ALERT_NODE_MEMORY_USAGE_ENABLED") "true" }} {{ if not (env "ALERT_NODE_MEMORY_USAGE") }}
- orgId: 1 - orgId: 1
uid: ads8cswmly96oa uid: coopcloud_node_memory_usage
{{ end }} {{ end }}
groups: groups:
@ -17,8 +17,8 @@ groups:
folder: node folder: node
interval: 5m interval: 5m
rules: rules:
{{ if eq (env "ALERT_NODE_DISK_SPACE_ENABLED") "true" }} {{ if (env "ALERT_NODE_DISK_SPACE_LEFT") }}
- uid: bds8bhxu97pxca - uid: coopcloud_node_disk_space_left
title: Node Disk Space title: Node Disk Space
condition: C condition: C
data: data:
@ -45,7 +45,7 @@ groups:
conditions: conditions:
- evaluator: - evaluator:
params: params:
- 10 - {{ env "ALERT_NODE_DISK_SPACE_LEFT" }}
type: lt type: lt
operator: operator:
type: and type: and
@ -70,13 +70,13 @@ groups:
annotations: annotations:
description: "" description: ""
runbook_url: "" runbook_url: ""
summary: Less than 10% disk space left on {{`{{ $labels.instance }}`}} ({{`{{ (index $values "A").Value }}`}}% left) summary: Less than {{ env "ALERT_NODE_DISK_SPACE_LEFT" }}% disk space left on {{`{{ $labels.instance }}`}} ({{`{{ (index $values "A").Value }}`}}% left)
labels: labels:
"": "" "": ""
isPaused: false isPaused: false
{{ end }} {{ end }}
{{ if eq (env "ALERT_NODE_MEMORY_USAGE_ENABLED") "true" }} {{ if (env "ALERT_NODE_MEMORY_USAGE") }}
- uid: ads8cswmly96oa - uid: coopcloud_node_memory_usage
title: Node Memory Usage title: Node Memory Usage
condition: C condition: C
data: data:
@ -103,7 +103,7 @@ groups:
conditions: conditions:
- evaluator: - evaluator:
params: params:
- 85 - {{ env "ALERT_NODE_MEMORY_USAGE" }}
type: gt type: gt
operator: operator:
type: and type: and
@ -126,6 +126,6 @@ groups:
execErrState: Error execErrState: Error
for: 5m for: 5m
annotations: annotations:
summary: Memory usage is above 85% on {{`{{ $labels.instance }}`}} ({{`{{ printf "%.2f" (index $values "A").Value }}`}}% usage) summary: Memory usage is above {{ env "ALERT_NODE_MEMORY_USAGE" }}% on {{`{{ $labels.instance }}`}} ({{`{{ printf "%.2f" (index $values "A").Value }}`}}% usage)
isPaused: false isPaused: false
{{ end }} {{ end }}

7
compose.basic-auth.yml Normal file
View File

@ -0,0 +1,7 @@
---
version: "3.8"
secrets:
basic_auth:
external: true
name: ${STACK_NAME}_basic_auth_${SECRET_BASIC_AUTH_VERSION}

View File

@ -32,8 +32,8 @@ services:
- GF_SECURITY_ADMIN_PASSWORD__FILE=/run/secrets/gf_adminpasswd - GF_SECURITY_ADMIN_PASSWORD__FILE=/run/secrets/gf_adminpasswd
- GF_SECURITY_ALLOW_EMBEDDING - GF_SECURITY_ALLOW_EMBEDDING
- GF_INSTALL_PLUGINS - GF_INSTALL_PLUGINS
- ALERT_NODE_DISK_SPACE_ENABLED - ALERT_NODE_DISK_SPACE_LEFT
- ALERT_NODE_MEMORY_USAGE_ENABLED - ALERT_NODE_MEMORY_USAGE
deploy: deploy:
labels: labels:
- "traefik.enable=true" - "traefik.enable=true"

View File

@ -30,6 +30,7 @@ services:
- "traefik.http.routers.${STACK_NAME}-prometheus.entrypoints=web-secure" - "traefik.http.routers.${STACK_NAME}-prometheus.entrypoints=web-secure"
- "traefik.http.routers.${STACK_NAME}-prometheus.tls=true" - "traefik.http.routers.${STACK_NAME}-prometheus.tls=true"
- "traefik.http.routers.${STACK_NAME}-prometheus.tls.certresolver=${LETS_ENCRYPT_ENV}" - "traefik.http.routers.${STACK_NAME}-prometheus.tls.certresolver=${LETS_ENCRYPT_ENV}"
- "traefik.http.routers.${STACK_NAME}-prometheus.middlewares=basicauth@file"
configs: configs:
prometheus_yml: prometheus_yml:

View File

@ -23,8 +23,3 @@ configs:
name: ${STACK_NAME}_promtail_yml_${PROMTAIL_YML_VERSION} name: ${STACK_NAME}_promtail_yml_${PROMTAIL_YML_VERSION}
file: promtail.yml.tmpl file: promtail.yml.tmpl
template_driver: golang template_driver: golang
secrets:
basic_auth:
external: true
name: ${STACK_NAME}_basic_auth_${SECRET_BASIC_AUTH_VERSION}

View File

@ -1,315 +0,0 @@
{
"apiVersion": 1,
"groups": [
{
"orgId": 1,
"name": "backupbot",
"folder": "node",
"interval": "1m",
"rules": [
{{ if eq (env "ALERT_BACKUP_FAILED_ENABLED") "true" }}
{
"uid": "de8e5xxup7t34a",
"title": "Backup Failed",
"condition": "C",
"data": [
{
"refId": "A",
"relativeTimeRange": { "from": 600, "to": 0 },
"datasourceUid": "PBFA97CFB590B2093",
"model": {
"disableTextWrap": false,
"editorMode": "builder",
"expr": "backup",
"fullMetaSearch": false,
"includeNullMetadata": true,
"instant": true,
"intervalMs": 1000,
"legendFormat": "__auto",
"maxDataPoints": 43200,
"range": false,
"refId": "A",
"useBackend": false
}
},
{
"refId": "C",
"relativeTimeRange": { "from": 600, "to": 0 },
"datasourceUid": "__expr__",
"model": {
"conditions": [
{
"evaluator": { "params": [0], "type": "lt" },
"operator": { "type": "and" },
"query": { "params": ["C"] },
"reducer": { "params": [], "type": "last" },
"type": "query"
}
],
"datasource": { "type": "__expr__", "uid": "__expr__" },
"expression": "A",
"intervalMs": 1000,
"maxDataPoints": 43200,
"refId": "C",
"type": "threshold"
}
}
],
"noDataState": "NoData",
"execErrState": "Error",
"for": "1m",
"isPaused": false
},
{{ end }}
{{ if eq (env "ALERT_BACKUP_MISSING_ENABLED") "true" }}
{
"uid": "ce8e65uddcwe8d",
"title": "Backup Missing",
"condition": "B",
"data": [
{
"refId": "A",
"relativeTimeRange": { "from": 600, "to": 0 },
"datasourceUid": "PBFA97CFB590B2093",
"model": {
"disableTextWrap": false,
"editorMode": "builder",
"expr": "rate(backup[24h])",
"fullMetaSearch": false,
"includeNullMetadata": true,
"instant": true,
"intervalMs": 1000,
"legendFormat": "__auto",
"maxDataPoints": 43200,
"range": false,
"refId": "A",
"useBackend": false
}
},
{
"refId": "B",
"relativeTimeRange": { "from": 600, "to": 0 },
"datasourceUid": "__expr__",
"model": {
"conditions": [
{
"evaluator": { "params": [0, 0], "type": "within_range" },
"operator": { "type": "and" },
"query": { "params": ["C"] },
"reducer": { "params": [], "type": "last" },
"type": "query"
}
],
"datasource": { "type": "__expr__", "uid": "__expr__" },
"expression": "A",
"intervalMs": 1000,
"maxDataPoints": 43200,
"refId": "B",
"type": "threshold"
}
}
],
"noDataState": "NoData",
"execErrState": "Error",
"for": "5m",
"isPaused": false
},
{{ end }}
{{ if eq (env "ALERT_BACKUP_NOT_SUCCESSFULL_ENABLED") "true" }}
{
"uid": "de8e6bc92a8lcc",
"title": "Backup Not Successfull",
"condition": "B",
"data": [
{
"refId": "A",
"relativeTimeRange": {
"from": 60,
"to": 0
},
"datasourceUid": "PBFA97CFB590B2093",
"model": {
"disableTextWrap": false,
"editorMode": "builder",
"expr": "backup",
"fullMetaSearch": false,
"includeNullMetadata": true,
"instant": true,
"intervalMs": 1000,
"legendFormat": "__auto",
"maxDataPoints": 43200,
"range": false,
"refId": "A",
"useBackend": false
}
},
{
"refId": "B",
"relativeTimeRange": {
"from": 60,
"to": 0
},
"datasourceUid": "__expr__",
"model": {
"conditions": [
{
"evaluator": {
"params": [
0
],
"type": "gt"
},
"operator": {
"type": "and"
},
"query": {
"params": [
"C"
]
},
"reducer": {
"params": [],
"type": "last"
},
"type": "query"
}
],
"datasource": {
"type": "__expr__",
"uid": "__expr__"
},
"expression": "A",
"intervalMs": 1000,
"maxDataPoints": 43200,
"refId": "B",
"type": "threshold"
}
}
],
"noDataState": "NoData",
"execErrState": "Error",
"for": "20m",
"annotations": {
"summary": "Backup did not finish within 20 minutes"
},
"labels": {},
"isPaused": false
}
{{ end }}
]
},
{
"orgId": 1,
"name": "node",
"folder": "node",
"interval": "5m",
"rules": [
{{ if eq (env "ALERT_NODE_DISK_SPACE_ENABLED") "true" }}
{
"uid": "bds8bhxu97pxca",
"title": "Node Disk Space",
"condition": "C",
"data": [
{
"refId": "A",
"relativeTimeRange": { "from": 600, "to": 0 },
"datasourceUid": "PBFA97CFB590B2093",
"model": {
"editorMode": "code",
"expr": "(node_filesystem_free_bytes{fstype=\"ext4\"} / node_filesystem_size_bytes{fstype=\"ext4\"}) * 100",
"instant": true,
"intervalMs": 1000,
"legendFormat": "__auto",
"maxDataPoints": 43200,
"range": false,
"refId": "A"
}
},
{
"refId": "C",
"relativeTimeRange": { "from": 600, "to": 0 },
"datasourceUid": "__expr__",
"model": {
"conditions": [
{
"evaluator": { "params": [10], "type": "lt" },
"operator": { "type": "and" },
"query": { "params": ["C"] },
"reducer": { "params": [], "type": "last" },
"type": "query"
}
],
"datasource": { "type": "__expr__", "uid": "__expr__" },
"expression": "A",
"intervalMs": 1000,
"maxDataPoints": 43200,
"refId": "C",
"type": "threshold"
}
}
],
"noDataState": "NoData",
"execErrState": "Error",
"for": "5m",
"annotations": {},
"labels": {},
"isPaused": false
},
{{ end }}
{{ if eq (env "ALERT_NODE_MEMORY_USAGE_ENABLED") "true" }}
{
"uid": "ads8cswmly96oa",
"title": "Node Memory Usage",
"condition": "C",
"data": [
{
"refId": "A",
"relativeTimeRange": { "from": 600, "to": 0 },
"datasourceUid": "PBFA97CFB590B2093",
"model": {
"editorMode": "code",
"expr": "(node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100",
"instant": true,
"intervalMs": 1000,
"legendFormat": "__auto",
"maxDataPoints": 43200,
"range": false,
"refId": "A"
}
},
{
"refId": "C",
"relativeTimeRange": { "from": 600, "to": 0 },
"datasourceUid": "__expr__",
"model": {
"conditions": [
{
"evaluator": { "params": [90], "type": "gt" },
"operator": { "type": "and" },
"query": { "params": ["C"] },
"reducer": { "params": [], "type": "last" },
"type": "query"
}
],
"datasource": { "type": "__expr__", "uid": "__expr__" },
"expression": "A",
"intervalMs": 1000,
"maxDataPoints": 43200,
"refId": "C",
"type": "threshold"
}
}
],
"noDataState": "NoData",
"execErrState": "Error",
"for": "5m",
"annotations": {},
"labels": {},
"isPaused": false
}
{{ end }}
]
}
]
}