19 Commits

Author SHA1 Message Date
4e038327b5 docs: Update deploy docs and adds prometheus basic auth back 2025-11-05 15:45:17 +01:00
7dbe5bf22e fix: Removes duplicate basic auth from prometheus and a few other improvements 2025-02-21 18:31:54 +01:00
89b5fef6ac chore: publish 1.6.0+v1.8.1 release 2025-02-21 17:23:06 +01:00
cd42c64544 feat: Adds option to expose ports for node and cadvisor service 2025-02-21 17:21:46 +01:00
70719dbee8 chore: publish 1.5.0+v1.8.1 release 2025-02-18 16:04:21 +01:00
8900ace6a2 feat: Adds matrix contact point for grafana alerts 2025-02-18 16:03:47 +01:00
8b464156bd chore: publish 1.4.0+v1.8.1 release 2025-02-17 12:18:39 +01:00
73c4ec3e74 feat: Adds optional GRAFANA_DOMAIN 2025-02-17 12:16:29 +01:00
225899785b Merge pull request 'feat: Adds dashboard and alerts for backupbot' (#2) from backup-dashboard into main
Reviewed-on: #2
2025-01-14 13:35:29 +00:00
0352a393de feat: Adds dashboard and alerts for backupbot 2024-12-30 14:43:30 +01:00
92e7bbc730 chore: publish 1.3.0+v1.8.1 release 2024-12-28 13:55:05 +01:00
5bf3d31c0f docs: Improves documentation for pushgateway 2024-12-28 13:53:31 +01:00
a14cb575a2 fix: shorten names for dashboard files 2024-12-21 14:37:10 +01:00
1a59dfac7f add pushgateway 2024-12-21 14:23:50 +01:00
a9b76dff65 update backupbot label 2024-10-24 17:32:29 +02:00
0401de1d16 chore: publish 1.2.0+v1.8.1 release 2024-07-17 13:36:04 +02:00
aa133fcfea add backup label 2024-04-17 16:02:38 +02:00
3750e75439 chore: publish 1.1.0+v1.7.0 release 2024-04-16 16:45:42 +02:00
5d0b6d88fc fix default DOMAIN naming scheme 2024-01-19 12:30:58 +01:00
18 changed files with 820 additions and 48 deletions

View File

@ -3,6 +3,7 @@ LETS_ENCRYPT_ENV=production
COMPOSE_FILE=compose.yml COMPOSE_FILE=compose.yml
DOMAIN=monitoring.example.com DOMAIN=monitoring.example.com
TIMEOUT=120 TIMEOUT=120
ENABLE_BACKUPS=true
## Enable this secret for Promtail / Prometheus ## Enable this secret for Promtail / Prometheus
# SECRET_BASIC_AUTH_VERSION=v1 # SECRET_BASIC_AUTH_VERSION=v1
@ -11,11 +12,18 @@ TIMEOUT=120
# COMPOSE_FILE="$COMPOSE_FILE:compose.promtail.yml" # COMPOSE_FILE="$COMPOSE_FILE:compose.promtail.yml"
# LOKI_PUSH_URL=https://loki.monitoring.example.org/loki/api/v1/push # LOKI_PUSH_URL=https://loki.monitoring.example.org/loki/api/v1/push
## Expose node and cadvisor ports instead of traefik
# COMPOSE_FILE="$COMPOSE_FILE:compose.expose-ports.yml"
# Monitoring Server # Monitoring Server
# #
## Prometheus ## Prometheus
# COMPOSE_FILE="$COMPOSE_FILE:compose.prometheus.yml" # COMPOSE_FILE="$COMPOSE_FILE:compose.prometheus.yml"
# PROMETHEUS_RETENTION_TIME=1y # PROMETHEUS_RETENTION_TIME=1y
#
## Prometheus Pushgateway
# COMPOSE_FILE="$COMPOSE_FILE:compose.pushgateway.yml"
#
## Loki ## Loki
# Loki Server # Loki Server
# #
@ -36,16 +44,18 @@ TIMEOUT=120
## Grafana ## Grafana
# #
# COMPOSE_FILE="$COMPOSE_FILE:compose.grafana.yml" # COMPOSE_FILE="$COMPOSE_FILE:compose.grafana.yml"
# GF_SERVER_ROOT_URL=https://${DOMAIN} # GF_SERVER_ROOT_URL=https://monitoring.example.com
# SECRET_GRAFANA_ADMIN_PASSWORD_VERSION=v1 # SECRET_GRAFANA_ADMIN_PASSWORD_VERSION=v1
## Seperate domain for Grafana
#GRAFANA_DOMAIN=grafana.example.com
# #
## Single-Sign-On with OIDC ## Single-Sign-On with OIDC
# OIDC_ENABLED=1 # OIDC_ENABLED=1
# SECRET_GRAFANA_OIDC_CLIENT_SECRET_VERSION=v1 # SECRET_GRAFANA_OIDC_CLIENT_SECRET_VERSION=v1
# OIDC_CLIENT_ID=grafana # OIDC_CLIENT_ID=grafana
# OIDC_AUTH_URL="https://sso.example.com/auth/realms/autonomic/protocol/openid-connect/auth" # OIDC_AUTH_URL="https://authentik.example.com/application/o/authorize/"
# OIDC_API_URL="https://sso.example.com/auth/realms/autonomic/protocol/openid-connect/userinfo" # OIDC_API_URL="https://authentik.example.com/application/o/userinfo/"
# OIDC_TOKEN_URL="https://sso.example.com/auth/realms/autonomic/protocol/openid-connect/token" # OIDC_TOKEN_URL="https://authentik.example.com/application/o/token/"
# #
## Additional grafana settings (unlikely to require editing) ## Additional grafana settings (unlikely to require editing)
# GF_SECURITY_ALLOW_EMBEDDING=1 # GF_SECURITY_ALLOW_EMBEDDING=1
@ -59,3 +69,17 @@ TIMEOUT=120
# GF_SMTP_SKIP_VERIFY=false # GF_SMTP_SKIP_VERIFY=false
# SECRET_GRAFANA_SMTP_PASSWORD_VERSION=v1 # SECRET_GRAFANA_SMTP_PASSWORD_VERSION=v1
# #
## Grafana Matrix Contact Point (optional)
#COMPOSE_FILE="$COMPOSE_FILE:compose.matrix-alertmanager-receiver.yml"
#SECRET_MATRIX_ACCESS_TOKEN_VERSION=v1
#GF_MATRIX_USER_ID="<user-id>"
#GF_MATRIX_ROOM_ID="<room-id>"
#GF_MATRIX_HOMESERVER_URL="<homeserver-url>"
# ALerts
#ALERT_BACKUP_FAILED_ENABLED=true
#ALERT_BACKUP_MISSING_ENABLED=true
#ALERT_BACKUP_NOT_SUCCESSFULL_ENABLED=true
#ALERT_NODE_DISK_SPACE_ENABLED=true
#ALERT_NODE_MEMORY_USAGE_ENABLED=true

View File

@ -18,51 +18,41 @@ It's based heavily on the [monitoring-lite](https://git.coopcloud.tech/coop-clou
<!-- endmetadata --> <!-- endmetadata -->
## Setup a Metrics Gathering ## Setup Metrics Gathering
Where gathering.org is the node you want to gather metrics from. Where gathering.org is the node you want to gather metrics from.
1. Configure DNS 1. Configure DNS
- monitoring.gathering.org
- cadvisor.monitoring.gathering.org - cadvisor.monitoring.gathering.org
- node.monitoring.gathering.org - node.monitoring.gathering.org
1. Configure Traefik to use BasicAuth 2. [Configure Traefik to use BasicAuth](https://git.coopcloud.tech/coop-cloud/traefik#configuring-wildcard-ssl-using-dns)
* `abra app config traefik.gathering.org` 3. `abra app new monitoring-ng`
uncomment 4. `abra app config monitoring.gathering.org` (for gathering only the main `compose.yml` is needed, nothing more.)
``` 5. `abra app deploy monitoring.gathering.org`
# BASIC_AUTH 6. check that endpoints are up and basic-auth works
COMPOSE_FILE="$COMPOSE_FILE:compose.basicauth.yml"
BASIC_AUTH=1
SECRET_USERSFILE_VERSION=v1
```
- Generate userslist with httpasswd hashed password
`abra app secret insert traefik.gathering.org userslist v1 'admin:<hashed-secret>'`
make sure there is no whitespace in between `admin:<hashed-secret>`, it seems to break stuff...
- `abra app deploy -f traefik`
1. `abra app new monitoring-ng`
1. `abra app config monitoring.gathering.org`
for gathering only the main `compose.yml` is needed, nothing more.
1. `abra app deploy monitoring.gathering.org`
1. check that endpoints are up and basic-auth works
- cadvisor.monitoring.gathering.org - cadvisor.monitoring.gathering.org
- node.monitoring.gathering.org - node.monitoring.gathering.org
### Expose node and cadvisor via ports instead of traefik
In case you have no traefik running on the machine, you can expose the ports directly by uncommenting the following line:
```
# COMPOSE_FILE="$COMPOSE_FILE:compose.expose-ports.yml"
```
## Setup Metrics Browser ## Setup Metrics Browser
This builds upon [Setup Metrics Gathering](#setup-metrics-grathering) so make sure you did that first.
1. Configure DNS 1. Configure DNS
- monitoring.example.org - monitoring.example.org
- prometheus.monitoring.example.org 2. Setup monitoring stack
- loki.monitoring.example.org - `abra app config monitoring.example.org` Uncomment prometheus, loki and grafana
1. Setup monitoring stack - `abra app secret insert monitoring.example.org basic_auth v1 <password>`
- `abra app new monitoring-ng`
- `abra app config monitoring.example.org`
Uncomment all the stuff
- `abra app secret insert monitoring.example.org basic_auth v1 <secret>`
this needs the plaintext traefik basic-auth secret, not the hashed one! this needs the plaintext traefik basic-auth secret, not the hashed one!
- `abra app secret ls monitoring.example.org` - `abra app secret ls monitoring.example.org`
- `abra app deploy monitoring.example.org` - `abra app deploy monitoring.example.org`
1. add scrape config to prometheus 3. Add scrape config to prometheus
- `abra app cmd monitoring.example.org prometheus gathering.org` - `abra app cmd monitoring.example.org prometheus gathering.org`
- or manually - or manually
``` ```
@ -85,7 +75,6 @@ Where gathering.org is the node you want to gather metrics from.
| Cadvisor | traefik basic-auth | cadvisor.monitoring.example.org | | Cadvisor | traefik basic-auth | cadvisor.monitoring.example.org |
| Node Exporter | traefik basic-auth | node.monitoring.example.org | | Node Exporter | traefik basic-auth | node.monitoring.example.org |
### Logging from a docker host to loki server without anything else ### Logging from a docker host to loki server without anything else
``` ```
@ -101,8 +90,18 @@ $ echo '{
$ systemctl restart docker.service $ systemctl restart docker.service
``` ```
## Setup Push Gateway
1. Enable in the env fiöle by uncommenting the following lines:
```
## Prometheus Pushgateway
# COMPOSE_FILE="$COMPOSE_FILE:compose.pushgateway.yml"
```
2. `abra app deploy monitoring.example.org`
This will expose the pushgateway at `https://pushgateway.${DOMAIN}`.
It is secured behind the same basic auth as the other services.
After that you need to add the `pushgateway.${DOMAIN}` to the scare config.
## Post-setup guide ## Post-setup guide
@ -119,3 +118,34 @@ $ systemctl restart docker.service
--- ---
THX to the previous work of @decentral1se @knooflok @3wc @cellarspoon @mirsal THX to the previous work of @decentral1se @knooflok @3wc @cellarspoon @mirsal
## Adding Matrix as Alert Contact point
1. Enable the [matrix-alertmanager-receiver](https://github.com/metio/matrix-alertmanager-receiver/):
```
COMPOSE_FILE="$COMPOSE_FILE:compose.matrix-alertmanager-receiver.yml"
```
2. Insert the matrix access token secret:
```
abra app secret insert monitoring.marx.klasse-methode.it matrix_access_token v1
```
3. Set required configurations:
```
GF_MATRIX_USER_ID=
GF_MATRIX_ROOM_ID=
GF_MATRIX_HOME_SERVER_URL=
```
4. Configure Alertmanager webhook and set the url to `http://matrix-alertmanager-receiver:12345/alerts/<room-id>`
## alerts
It is possible to enable the following alerts, by setting the corresponding env variable to `true`:
- backupbot failed: `ALERT_BACKUP_FAILED_ENABLED`
- backupbot missing: `ALERT_BACKUP_MISSING_ENABLED`
- backupbot not successfull: `ALERT_BACKUP_NOT_SUCCESSFULL_ENABLED`
- node disk space: `ALERT_NODE_DISK_SPACE_ENABLED`
- node memory usage: `ALERT_NODE_MEMORY_USAGE_ENABLED`

View File

@ -1,13 +1,17 @@
export ENTRYPOINT_VERSION=v1 export ENTRYPOINT_VERSION=v1
export GRAFANA_DATASOURCES_YML_VERSION=v1 export GRAFANA_DATASOURCES_YML_VERSION=v1
export GRAFANA_DASHBOARDS_YML_VERSION=v1 export GRAFANA_DASHBOARDS_YML_VERSION=v2
export GRAFANA_SWARM_DASHBOARD_JSON_VERSION=v2 export GRAFANA_SWARM_DASHBOARD_JSON_VERSION=v2
export GRAFANA_STACKS_DASHBOARD_JSON_VERSION=v2 export GRAFANA_STACKS_DASHBOARD_JSON_VERSION=v2
export GRAFANA_TRAEFIK_DASHBOARD_JSON_VERSION=v2 export GRAFANA_TRAEFIK_DASHBOARD_JSON_VERSION=v2
export GRAFANA_BACKUP_DASHBOARD_JSON_VERSION=v1
export GRAFANA_ALERTS_JSON_VERSION=v3
export GRAFANA_CUSTOM_INI_VERSION=v4 export GRAFANA_CUSTOM_INI_VERSION=v4
export PROMTAIL_YML_VERSION=v3 export PROMTAIL_YML_VERSION=v3
export LOKI_YML_VERSION=v2 export LOKI_YML_VERSION=v2
export PROMETHEUS_YML_VERSION=v2 export PROMETHEUS_YML_VERSION=v2
export MATRIX_ALERTMANAGER_CONFIG_VERSION=e
export MATRIX_ALERTMANAGER_ENTRYPOINT_VERSION=a
# creates a default prometheus scrape config for a given node # creates a default prometheus scrape config for a given node
add_node(){ add_node(){

View File

@ -0,0 +1,74 @@
# configuration of the HTTP server
http:
## address: 127.0.0.1 # bind address for this service. Can be left unspecified to bind on all interfaces
port: 12345 # port used by this service
alerts-path-prefix: /alerts # URL path for the webhook receiver called by an Alertmanager. Defaults to /alerts
metrics-path: /metrics # URL path to collect metrics. Defaults to /metrics
metrics-enabled: true # Whether to enable metrics or not. Defaults to false
# basic-username: alertmanager # Username for basic authentication. Defaults to alertmanager
# basic-password: secret # If set, the alerts endpoint expects basic-auth credentials with the configured username and password
# configuration for the Matrix connection
matrix:
homeserver-url: "{{ env "GF_MATRIX_HOMESERVER_URL" }}"
user-id: "{{ env "GF_MATRIX_USER_ID" }}"
access-token: "{{ secret "matrix_access_token" }}"
room-mapping:
matrixroom: "{{ env "GF_MATRIX_ROOM_ID" }}"
templating:
# mapping of ExternalURL values
external-url-mapping:
# key is the original value taken from the Alertmanager payload
# value is the mapped value which will be available as '.ExternalURL' in templates
"http://alertmanager:9093": https://alertmanager.example.com
# mapping of GeneratorURL values
generator-url-mapping:
# key is the original value taken from the Alertmanager payload
# value is the mapped value which will be available as '.GeneratorURL' in templates
"http://prometheus:8080": https://prometheus.example.com
# computation of arbitrary values based on matching alert annotations, labels, or status
# values will be evaluated top to bottom, last entry wins
computed-values:
- values: # always set 'color' to 'yellow'
color: yellow
- values: # set 'color' to 'orange' when alert label 'severity' is 'warning'
color: orange
when-matching-labels:
severity: warning
- values: # set 'color' to 'red' when alert label 'severity' is 'critical'
color: red
when-matching-labels:
severity: critical
- values: # set 'color' to 'green' when alert status is 'resolved'
color: green
when-matching-status: resolved
# template for alerts in status 'firing'
firing-template: '{{`
<p>
<strong><font color="{{ .ComputedValues.color }}">{{ .Alert.Status | ToUpper }}</font></strong>
{{ if .Alert.Labels.name }}
{{ .Alert.Labels.name }}
{{ else if .Alert.Labels.alertname }}
{{ .Alert.Labels.alertname }}
{{ end }}
>>
{{ if .Alert.Labels.severity }}
{{ .Alert.Labels.severity | ToUpper }}:
{{ end }}
{{ if .Alert.Annotations.description }}
{{ .Alert.Annotations.description }}
{{ else if .Alert.Annotations.summary }}
{{ .Alert.Annotations.summary }}
{{ end }}
>>
{{ if .Alert.Annotations.runbook }}
<a href="{{ .Alert.Annotations.runbook }}">Runbook</a> |
{{ end }}
{{ if .Alert.Annotations.dashboard }}
<a href="{{ .Alert.Annotations.dashboard }}">Dashboard</a> |
{{ end }}
<a href="{{ .SilenceURL }}">Silence</a>
</p>`}}'

13
compose.expose-ports.yml Normal file
View File

@ -0,0 +1,13 @@
---
version: "3.8"
services:
app:
ports:
- "9100:9100"
deploy:
cadvisor:
ports:
- "9101:8080"
deploy:

View File

@ -2,7 +2,7 @@ version: '3.8'
services: services:
grafana: grafana:
image: grafana/grafana:10.2.3 image: grafana/grafana:10.4.14
volumes: volumes:
- grafana-data:/var/lib/grafana:rw - grafana-data:/var/lib/grafana:rw
secrets: secrets:
@ -22,6 +22,10 @@ services:
target: /var/lib/grafana/dashboards/docker-swarm-stacks.json target: /var/lib/grafana/dashboards/docker-swarm-stacks.json
- source: grafana_traefik_dashboard_json - source: grafana_traefik_dashboard_json
target: /var/lib/grafana/dashboards/traefik.json target: /var/lib/grafana/dashboards/traefik.json
- source: grafana_backup_dashboard_json
target: /var/lib/grafana/dashboards/backup.json
- source: grafana_alerts_json
target: /var/lib/grafana/alerts/alerts.json
networks: networks:
- proxy - proxy
- internal - internal
@ -44,8 +48,9 @@ services:
deploy: deploy:
labels: labels:
- "traefik.enable=true" - "traefik.enable=true"
- "traefik.docker.network=proxy"
- "traefik.http.services.${STACK_NAME}-grafana.loadbalancer.server.port=3000" - "traefik.http.services.${STACK_NAME}-grafana.loadbalancer.server.port=3000"
- "traefik.http.routers.${STACK_NAME}-grafana.rule=Host(`${DOMAIN}`)" - "traefik.http.routers.${STACK_NAME}-grafana.rule=Host(`${GRAFANA_DOMAIN:-$DOMAIN}`)"
- "traefik.http.routers.${STACK_NAME}-grafana.entrypoints=web-secure" - "traefik.http.routers.${STACK_NAME}-grafana.entrypoints=web-secure"
- "traefik.http.routers.${STACK_NAME}-grafana.tls=true" - "traefik.http.routers.${STACK_NAME}-grafana.tls=true"
- "traefik.http.routers.${STACK_NAME}-grafana.tls.certresolver=${LETS_ENCRYPT_ENV}" - "traefik.http.routers.${STACK_NAME}-grafana.tls.certresolver=${LETS_ENCRYPT_ENV}"
@ -62,20 +67,27 @@ configs:
name: ${STACK_NAME}_grafana_custom_ini_${GRAFANA_CUSTOM_INI_VERSION} name: ${STACK_NAME}_grafana_custom_ini_${GRAFANA_CUSTOM_INI_VERSION}
file: grafana_custom.ini file: grafana_custom.ini
grafana_datasources_yml: grafana_datasources_yml:
name: ${STACK_NAME}_grafana_datasources_yml_${GRAFANA_DATASOURCES_YML_VERSION} name: ${STACK_NAME}_g_datasources_yml_${GRAFANA_DATASOURCES_YML_VERSION}
file: grafana-datasources.yml file: grafana-datasources.yml
grafana_dashboards_yml: grafana_dashboards_yml:
name: ${STACK_NAME}_grafana_dashboards_yml_${GRAFANA_DASHBOARDS_YML_VERSION} name: ${STACK_NAME}_g_dashboards_yml_${GRAFANA_DASHBOARDS_YML_VERSION}
file: grafana-dashboards.yml file: grafana-dashboards.yml
grafana_swarm_dashboard_json: grafana_swarm_dashboard_json:
name: ${STACK_NAME}_grafana_swarm_dashboard_json_${GRAFANA_SWARM_DASHBOARD_JSON_VERSION} name: ${STACK_NAME}_g_swarm_dashboard_json_${GRAFANA_SWARM_DASHBOARD_JSON_VERSION}
file: grafana-swarm-dashboard.json file: grafana-swarm-dashboard.json
grafana_stacks_dashboard_json: grafana_stacks_dashboard_json:
name: ${STACK_NAME}_grafana_stacks_dashboard_json_${GRAFANA_STACKS_DASHBOARD_JSON_VERSION} name: ${STACK_NAME}_g_stacks_dashboard_json_${GRAFANA_STACKS_DASHBOARD_JSON_VERSION}
file: grafana-stacks-dashboard.json file: grafana-stacks-dashboard.json
grafana_traefik_dashboard_json: grafana_traefik_dashboard_json:
name: ${STACK_NAME}_grafana_traefik_dashboard_json_${GRAFANA_TRAEFIK_DASHBOARD_JSON_VERSION} name: ${STACK_NAME}_g_traefik_dashboard_json_${GRAFANA_TRAEFIK_DASHBOARD_JSON_VERSION}
file: grafana-traefik-dashboard.json file: grafana-traefik-dashboard.json
grafana_backup_dashboard_json:
name: ${STACK_NAME}_g_backup_dashboard_json_${GRAFANA_BACKUP_DASHBOARD_JSON_VERSION}
file: grafana-backup-dashboard.json
grafana_alerts_json:
template_driver: golang
name: ${STACK_NAME}_g_alerts_json_${GRAFANA_ALERTS_JSON_VERSION}
file: grafana-alerts.json.tmpl
volumes: volumes:
grafana-data: grafana-data:

View File

@ -2,7 +2,7 @@ version: '3.8'
services: services:
loki: loki:
image: grafana/loki:2.9.3 image: grafana/loki:2.9.11
command: -config.file=/etc/loki/local-config.yaml command: -config.file=/etc/loki/local-config.yaml
networks: networks:
- proxy - proxy
@ -27,6 +27,7 @@ services:
condition: on-failure condition: on-failure
labels: labels:
- "traefik.enable=true" - "traefik.enable=true"
- "traefik.docker.network=proxy"
- "traefik.http.services.${STACK_NAME}-loki.loadbalancer.server.port=3100" - "traefik.http.services.${STACK_NAME}-loki.loadbalancer.server.port=3100"
- "traefik.http.routers.${STACK_NAME}-loki.rule=Host(`loki.${DOMAIN}`)" - "traefik.http.routers.${STACK_NAME}-loki.rule=Host(`loki.${DOMAIN}`)"
- "traefik.http.routers.${STACK_NAME}-loki.entrypoints=web-secure" - "traefik.http.routers.${STACK_NAME}-loki.entrypoints=web-secure"

View File

@ -0,0 +1,28 @@
version: '3.8'
services:
matrix-alertmanager-receiver:
image: metio/matrix-alertmanager-receiver:2025.2.9
secrets:
- matrix_access_token
configs:
- source: matrix-alertmanager-receiver-config
target: /etc/matrix-alertmanager-receiver/config.yml
networks:
- internal
environment:
- GF_MATRIX_USER_ID
- GF_MATRIX_ROOM_ID
- GF_MATRIX_HOMESERVER_URL
command: "--config-path=/etc/matrix-alertmanager-receiver/config.yml"
configs:
matrix-alertmanager-receiver-config:
template_driver: golang
name: ${STACK_NAME}_mar_config_${MATRIX_ALERTMANAGER_CONFIG_VERSION}
file: alertmanager-matrix-config.yml.tmpl
secrets:
matrix_access_token:
external: true
name: ${STACK_NAME}_matrix_access_token_${SECRET_MATRIX_ACCESS_TOKEN_VERSION}

View File

@ -2,7 +2,7 @@ version: '3.8'
services: services:
prometheus: prometheus:
image: prom/prometheus:v2.48.1 image: prom/prometheus:v2.55.1
secrets: secrets:
- basic_auth - basic_auth
volumes: volumes:
@ -24,6 +24,7 @@ services:
condition: on-failure condition: on-failure
labels: labels:
- "traefik.enable=true" - "traefik.enable=true"
- "traefik.docker.network=proxy"
- "traefik.http.services.${STACK_NAME}-prometheus.loadbalancer.server.port=9090" - "traefik.http.services.${STACK_NAME}-prometheus.loadbalancer.server.port=9090"
- "traefik.http.routers.${STACK_NAME}-prometheus.rule=Host(`prometheus.${DOMAIN}`)" - "traefik.http.routers.${STACK_NAME}-prometheus.rule=Host(`prometheus.${DOMAIN}`)"
- "traefik.http.routers.${STACK_NAME}-prometheus.entrypoints=web-secure" - "traefik.http.routers.${STACK_NAME}-prometheus.entrypoints=web-secure"

View File

@ -2,7 +2,7 @@ version: "3.8"
services: services:
promtail: promtail:
image: grafana/promtail:2.9.3 image: grafana/promtail:2.9.11
volumes: volumes:
- /var/log:/var/log:ro - /var/log:/var/log:ro
- /var/run/docker.sock:/var/run/docker.sock - /var/run/docker.sock:/var/run/docker.sock

26
compose.pushgateway.yml Normal file
View File

@ -0,0 +1,26 @@
version: '3.8'
services:
pushgateway:
image: prom/pushgateway:v1.10.0
command:
- '--web.listen-address=:9191'
- '--push.disable-consistency-check'
- '--persistence.interval=5m'
ports:
- 9191:9191
networks:
- internal
- proxy
deploy:
restart_policy:
condition: on-failure
labels:
- "traefik.enable=true"
- "traefik.docker.network=proxy"
- "traefik.http.services.${STACK_NAME}-pushgateway.loadbalancer.server.port=9191"
- "traefik.http.routers.${STACK_NAME}-pushgateway.rule=Host(`pushgateway.${DOMAIN}`)"
- "traefik.http.routers.${STACK_NAME}-pushgateway.entrypoints=web-secure"
- "traefik.http.routers.${STACK_NAME}-pushgateway.tls=true"
- "traefik.http.routers.${STACK_NAME}-pushgateway.tls.certresolver=${LETS_ENCRYPT_ENV}"
- "traefik.http.routers.${STACK_NAME}-pushgateway.middlewares=basicauth@file"

View File

@ -3,7 +3,7 @@ version: "3.8"
services: services:
app: app:
image: prom/node-exporter:v1.7.0 image: prom/node-exporter:v1.8.1
user: root user: root
environment: environment:
- NODE_ID={{.Node.ID}} - NODE_ID={{.Node.ID}}
@ -30,18 +30,20 @@ services:
restart_policy: restart_policy:
condition: on-failure condition: on-failure
labels: labels:
- "backupbot.backup=${ENABLE_BACKUPS:-true}"
- "traefik.enable=true" - "traefik.enable=true"
- "traefik.docker.network=proxy"
- "traefik.http.services.${STACK_NAME}-node.loadbalancer.server.port=9100" - "traefik.http.services.${STACK_NAME}-node.loadbalancer.server.port=9100"
- "traefik.http.routers.${STACK_NAME}-node.rule=Host(`node.${DOMAIN}`)" - "traefik.http.routers.${STACK_NAME}-node.rule=Host(`node.${DOMAIN}`)"
- "traefik.http.routers.${STACK_NAME}-node.entrypoints=web-secure" - "traefik.http.routers.${STACK_NAME}-node.entrypoints=web-secure"
- "traefik.http.routers.${STACK_NAME}-node.tls=true" - "traefik.http.routers.${STACK_NAME}-node.tls=true"
- "traefik.http.routers.${STACK_NAME}-node.tls.certresolver=${LETS_ENCRYPT_ENV}" - "traefik.http.routers.${STACK_NAME}-node.tls.certresolver=${LETS_ENCRYPT_ENV}"
- "traefik.http.routers.${STACK_NAME}-node.middlewares=basicauth@file" - "traefik.http.routers.${STACK_NAME}-node.middlewares=basicauth@file"
- "coop-cloud.${STACK_NAME}.version=1.0.0+v1.7.0" - "coop-cloud.${STACK_NAME}.version=1.6.0+v1.8.1"
- "coop-cloud.${STACK_NAME}.timeout=${TIMEOUT:-120}" - "coop-cloud.${STACK_NAME}.timeout=${TIMEOUT:-120}"
cadvisor: cadvisor:
image: gcr.io/cadvisor/cadvisor:v0.47.2 image: gcr.io/cadvisor/cadvisor:v0.49.2
command: command:
- "-logtostderr" - "-logtostderr"
- "--enable_metrics=cpu,cpuLoad,disk,diskIO,process,memory,network" - "--enable_metrics=cpu,cpuLoad,disk,diskIO,process,memory,network"
@ -62,6 +64,7 @@ services:
condition: on-failure condition: on-failure
labels: labels:
- "traefik.enable=true" - "traefik.enable=true"
- "traefik.docker.network=proxy"
- "traefik.http.services.${STACK_NAME}-cadvisor.loadbalancer.server.port=8080" - "traefik.http.services.${STACK_NAME}-cadvisor.loadbalancer.server.port=8080"
- "traefik.http.routers.${STACK_NAME}-cadvisor.rule=Host(`cadvisor.${DOMAIN}`)" - "traefik.http.routers.${STACK_NAME}-cadvisor.rule=Host(`cadvisor.${DOMAIN}`)"
- "traefik.http.routers.${STACK_NAME}-cadvisor.entrypoints=web-secure" - "traefik.http.routers.${STACK_NAME}-cadvisor.entrypoints=web-secure"

315
grafana-alerts.json.tmpl Normal file
View File

@ -0,0 +1,315 @@
{
"apiVersion": 1,
"groups": [
{
"orgId": 1,
"name": "backupbot",
"folder": "node",
"interval": "1m",
"rules": [
{{ if eq (env "ALERT_BACKUP_FAILED_ENABLED") "true" }}
{
"uid": "de8e5xxup7t34a",
"title": "Backup Failed",
"condition": "C",
"data": [
{
"refId": "A",
"relativeTimeRange": { "from": 600, "to": 0 },
"datasourceUid": "PBFA97CFB590B2093",
"model": {
"disableTextWrap": false,
"editorMode": "builder",
"expr": "backup",
"fullMetaSearch": false,
"includeNullMetadata": true,
"instant": true,
"intervalMs": 1000,
"legendFormat": "__auto",
"maxDataPoints": 43200,
"range": false,
"refId": "A",
"useBackend": false
}
},
{
"refId": "C",
"relativeTimeRange": { "from": 600, "to": 0 },
"datasourceUid": "__expr__",
"model": {
"conditions": [
{
"evaluator": { "params": [0], "type": "lt" },
"operator": { "type": "and" },
"query": { "params": ["C"] },
"reducer": { "params": [], "type": "last" },
"type": "query"
}
],
"datasource": { "type": "__expr__", "uid": "__expr__" },
"expression": "A",
"intervalMs": 1000,
"maxDataPoints": 43200,
"refId": "C",
"type": "threshold"
}
}
],
"noDataState": "NoData",
"execErrState": "Error",
"for": "1m",
"isPaused": false
},
{{ end }}
{{ if eq (env "ALERT_BACKUP_MISSING_ENABLED") "true" }}
{
"uid": "ce8e65uddcwe8d",
"title": "Backup Missing",
"condition": "B",
"data": [
{
"refId": "A",
"relativeTimeRange": { "from": 600, "to": 0 },
"datasourceUid": "PBFA97CFB590B2093",
"model": {
"disableTextWrap": false,
"editorMode": "builder",
"expr": "rate(backup[24h])",
"fullMetaSearch": false,
"includeNullMetadata": true,
"instant": true,
"intervalMs": 1000,
"legendFormat": "__auto",
"maxDataPoints": 43200,
"range": false,
"refId": "A",
"useBackend": false
}
},
{
"refId": "B",
"relativeTimeRange": { "from": 600, "to": 0 },
"datasourceUid": "__expr__",
"model": {
"conditions": [
{
"evaluator": { "params": [0, 0], "type": "within_range" },
"operator": { "type": "and" },
"query": { "params": ["C"] },
"reducer": { "params": [], "type": "last" },
"type": "query"
}
],
"datasource": { "type": "__expr__", "uid": "__expr__" },
"expression": "A",
"intervalMs": 1000,
"maxDataPoints": 43200,
"refId": "B",
"type": "threshold"
}
}
],
"noDataState": "NoData",
"execErrState": "Error",
"for": "5m",
"isPaused": false
},
{{ end }}
{{ if eq (env "ALERT_BACKUP_NOT_SUCCESSFULL_ENABLED") "true" }}
{
"uid": "de8e6bc92a8lcc",
"title": "Backup Not Successfull",
"condition": "B",
"data": [
{
"refId": "A",
"relativeTimeRange": {
"from": 60,
"to": 0
},
"datasourceUid": "PBFA97CFB590B2093",
"model": {
"disableTextWrap": false,
"editorMode": "builder",
"expr": "backup",
"fullMetaSearch": false,
"includeNullMetadata": true,
"instant": true,
"intervalMs": 1000,
"legendFormat": "__auto",
"maxDataPoints": 43200,
"range": false,
"refId": "A",
"useBackend": false
}
},
{
"refId": "B",
"relativeTimeRange": {
"from": 60,
"to": 0
},
"datasourceUid": "__expr__",
"model": {
"conditions": [
{
"evaluator": {
"params": [
0
],
"type": "gt"
},
"operator": {
"type": "and"
},
"query": {
"params": [
"C"
]
},
"reducer": {
"params": [],
"type": "last"
},
"type": "query"
}
],
"datasource": {
"type": "__expr__",
"uid": "__expr__"
},
"expression": "A",
"intervalMs": 1000,
"maxDataPoints": 43200,
"refId": "B",
"type": "threshold"
}
}
],
"noDataState": "NoData",
"execErrState": "Error",
"for": "20m",
"annotations": {
"summary": "Backup did not finish within 20 minutes"
},
"labels": {},
"isPaused": false
}
{{ end }}
]
},
{
"orgId": 1,
"name": "node",
"folder": "node",
"interval": "5m",
"rules": [
{{ if eq (env "ALERT_NODE_DISK_SPACE_ENABLED") "true" }}
{
"uid": "bds8bhxu97pxca",
"title": "Node Disk Space",
"condition": "C",
"data": [
{
"refId": "A",
"relativeTimeRange": { "from": 600, "to": 0 },
"datasourceUid": "PBFA97CFB590B2093",
"model": {
"editorMode": "code",
"expr": "(node_filesystem_free_bytes{fstype=\"ext4\",mountpoint=~\"(/$)|(/media.*)\"} / node_filesystem_size_bytes{fstype=\"ext4\",mountpoint=~\"(/$)|(/media.*)\"}) * 100",
"instant": true,
"intervalMs": 1000,
"legendFormat": "__auto",
"maxDataPoints": 43200,
"range": false,
"refId": "A"
}
},
{
"refId": "C",
"relativeTimeRange": { "from": 600, "to": 0 },
"datasourceUid": "__expr__",
"model": {
"conditions": [
{
"evaluator": { "params": [10], "type": "lt" },
"operator": { "type": "and" },
"query": { "params": ["C"] },
"reducer": { "params": [], "type": "last" },
"type": "query"
}
],
"datasource": { "type": "__expr__", "uid": "__expr__" },
"expression": "A",
"intervalMs": 1000,
"maxDataPoints": 43200,
"refId": "C",
"type": "threshold"
}
}
],
"noDataState": "NoData",
"execErrState": "Error",
"for": "5m",
"annotations": {},
"labels": {},
"isPaused": false
},
{{ end }}
{{ if eq (env "ALERT_NODE_MEMORY_USAGE_ENABLED") "true" }}
{
"uid": "ads8cswmly96oa",
"title": "Node Memory Usage",
"condition": "C",
"data": [
{
"refId": "A",
"relativeTimeRange": { "from": 600, "to": 0 },
"datasourceUid": "PBFA97CFB590B2093",
"model": {
"editorMode": "code",
"expr": "(node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100",
"instant": true,
"intervalMs": 1000,
"legendFormat": "__auto",
"maxDataPoints": 43200,
"range": false,
"refId": "A"
}
},
{
"refId": "C",
"relativeTimeRange": { "from": 600, "to": 0 },
"datasourceUid": "__expr__",
"model": {
"conditions": [
{
"evaluator": { "params": [90], "type": "gt" },
"operator": { "type": "and" },
"query": { "params": ["C"] },
"reducer": { "params": [], "type": "last" },
"type": "query"
}
],
"datasource": { "type": "__expr__", "uid": "__expr__" },
"expression": "A",
"intervalMs": 1000,
"maxDataPoints": 43200,
"refId": "C",
"type": "threshold"
}
}
],
"noDataState": "NoData",
"execErrState": "Error",
"for": "5m",
"annotations": {},
"labels": {},
"isPaused": false
}
{{ end }}
]
}
]
}

View File

@ -0,0 +1,228 @@
{
"annotations": {
"list": [
{
"builtIn": 1,
"datasource": {
"type": "grafana",
"uid": "-- Grafana --"
},
"enable": true,
"hide": true,
"iconColor": "rgba(0, 211, 255, 1)",
"name": "Annotations & Alerts",
"type": "dashboard"
}
]
},
"editable": true,
"fiscalYearStartMonth": 0,
"graphTooltip": 0,
"id": 6,
"links": [],
"panels": [
{
"datasource": {
"type": "prometheus",
"uid": "PBFA97CFB590B2093"
},
"fieldConfig": {
"defaults": {
"color": {
"mode": "palette-classic"
},
"custom": {
"axisBorderShow": false,
"axisCenteredZero": false,
"axisColorMode": "text",
"axisLabel": "",
"axisPlacement": "auto",
"axisSoftMax": 2,
"axisSoftMin": -2,
"barAlignment": 0,
"drawStyle": "line",
"fillOpacity": 0,
"gradientMode": "none",
"hideFrom": {
"legend": false,
"tooltip": false,
"viz": false
},
"insertNulls": false,
"lineInterpolation": "linear",
"lineWidth": 1,
"pointSize": 5,
"scaleDistribution": {
"type": "linear"
},
"showPoints": "auto",
"spanNulls": false,
"stacking": {
"group": "A",
"mode": "none"
},
"thresholdsStyle": {
"mode": "off"
}
},
"mappings": [
{
"options": {
"0": {
"color": "dark-green",
"index": 0
},
"1": {
"color": "dark-yellow",
"index": 1,
"text": "Running"
},
"-1": {
"index": 2,
"text": "Fail"
}
},
"type": "value"
}
],
"max": 1,
"min": -1,
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
}
]
},
"unit": "string"
},
"overrides": []
},
"gridPos": {
"h": 7,
"w": 24,
"x": 0,
"y": 0
},
"id": 1,
"options": {
"legend": {
"calcs": [],
"displayMode": "list",
"placement": "bottom",
"showLegend": true
},
"tooltip": {
"mode": "single",
"sort": "none"
}
},
"targets": [
{
"datasource": {
"type": "prometheus",
"uid": "PBFA97CFB590B2093"
},
"disableTextWrap": false,
"editorMode": "builder",
"exemplar": false,
"expr": "backup",
"fullMetaSearch": false,
"includeNullMetadata": true,
"instant": false,
"legendFormat": "__auto",
"range": true,
"refId": "A",
"useBackend": false
}
],
"title": "Backup Status",
"type": "timeseries"
},
{
"datasource": {
"type": "loki",
"uid": "P8E80F9AEF21F6940"
},
"gridPos": {
"h": 11,
"w": 24,
"x": 0,
"y": 7
},
"id": 2,
"options": {
"dedupStrategy": "none",
"enableLogDetails": true,
"prettifyLogMessage": false,
"showCommonLabels": false,
"showLabels": false,
"showTime": true,
"sortOrder": "Descending",
"wrapLogMessage": false
},
"targets": [
{
"datasource": {
"type": "loki",
"uid": "P8E80F9AEF21F6940"
},
"editorMode": "builder",
"expr": "{service_name=\"$ServiceName\"} |= ``",
"queryType": "range",
"refId": "A"
}
],
"title": "Backupbot Logs",
"type": "logs"
}
],
"refresh": "auto",
"schemaVersion": 39,
"tags": [],
"templating": {
"list": [
{
"current": {
"selected": true,
"text": "backup_marx_klasse-methode_it_app",
"value": "backup_marx_klasse-methode_it_app"
},
"datasource": {
"type": "loki",
"uid": "P8E80F9AEF21F6940"
},
"definition": "",
"hide": 0,
"includeAll": false,
"label": "Backupbot Service",
"multi": false,
"name": "ServiceName",
"options": [],
"query": {
"label": "service_name",
"refId": "LokiVariableQueryEditor-VariableQuery",
"stream": "",
"type": 1
},
"refresh": 1,
"regex": "",
"skipUrlSync": false,
"sort": 1,
"type": "query"
}
]
},
"time": {
"from": "now-24h",
"to": "now"
},
"timepicker": {},
"timezone": "browser",
"title": "backupbot-two",
"uid": "be8e2xeofw4xsa",
"version": 3,
"weekStart": ""
}

View File

@ -11,3 +11,13 @@ providers:
options: options:
path: /var/lib/grafana/dashboards path: /var/lib/grafana/dashboards
foldersFromFilesStructure: true foldersFromFilesStructure: true
- name: 'default-alert-provider'
orgId: 1
folder: 'default-alerts'
type: file
disableDeletion: false
updateIntervalSeconds: 10
allowUiUpdates: true
options:
path: /var/lib/grafana/alerts
foldersFromFilesStructure: true

1
release/1.4.0+v1.8.1 Normal file
View File

@ -0,0 +1 @@
Adds an optional GRAFANA_DOMAIN

1
release/1.5.0+v1.8.1 Normal file
View File

@ -0,0 +1 @@
Adds an optional matrix contact point for grafana

1
release/1.6.0+v1.8.1 Normal file
View File

@ -0,0 +1 @@
Adds option to expose ports for node and cadvisor service