36 Commits

Author SHA1 Message Date
e247677433 feat: scrape metrics from containers via Docker label discovery
Containers opt in with prometheus.io/scrape=true and optionally set
prometheus.io/port, prometheus.io/path, and prometheus.io/auth=basic.
2026-06-14 21:38:43 +02:00
f2310f2b86 improve alloy config and match with main branch
- Restrict Alloy UI to loopback
- Narrow volume mounts: drop /dev, reduce /var/run to docker.sock:ro
- Replace HTTP scrape of :12345 with prometheus.exporter.self
Match with main branch (node-exporter / promtail / cadvisor):
- Add docker_only and explicit enabled_metrics to cadvisor exporter
- Match node-exporter collector config
- Match promtail relabeling (container_name, container_id, stack_namespace,
  service_name) and external hostname label
- Add SYSLOG_FILES option to tail /var/log/*log (matches promtail)
- Fix journal path and syslog listener address
2026-06-13 22:07:55 +02:00
f
f2711fa16e fix: upgrades 2026-06-03 00:19:41 -03:00
f
2870b9486c fix: use the actual health check path 2026-06-02 21:06:05 -03:00
f
3a1fabe4f9 fix: prevent redirections on health check 2026-06-02 21:04:56 -03:00
f
a358837922 wip: relabel syslog according to docs 2026-06-02 21:00:34 -03:00
f
dd0a0c1bb0 fixup! feat: read syslog 2026-06-02 20:19:45 -03:00
f
31cabc36ae fix: prevent traefik deprecation warnings 2026-06-02 19:16:49 -03:00
f
d25986d5cb fix: README 2026-06-02 18:51:10 -03:00
f
f8f8004445 feat: read syslog 2026-06-02 18:50:41 -03:00
f
aa05d022da feat: optionally push to prometheus and loki 2026-06-02 18:50:20 -03:00
f
fb52a76247 BREAKING CHANGE: deprecate node-exporter 2026-06-02 18:49:05 -03:00
f
2e2a52eae0 BREAKING CHANGE: deprecate promtail 2026-06-02 18:48:20 -03:00
f
48419d5afa fixup! BREAKING CHANGE: no need to expose exporters 2026-06-02 18:46:02 -03:00
f
a0a6e2c509 fix: basic auth secret is always needed 2026-06-02 18:44:32 -03:00
f
024f2a8aec feat: send docker logs to loki 2026-06-02 18:39:24 -03:00
f
38095e23fa BREAKING CHANGE: no need to expose exporters 2026-06-02 18:37:56 -03:00
f
641161329e fix: grafana alternate domain doesn't work
the variable is not expanded and the domain name label ends up as a
literal "$DOMAIN".
2026-06-02 18:00:00 -03:00
f
cdacfd035e fix: prometheus querying panel is accessible through basic auth 2026-06-02 17:52:25 -03:00
f
b2d3901f61 fix: bind mounts recommended by docs 2026-06-02 13:24:28 -03:00
f
8becf1c1d6 fixup! feat: node exporter 2026-05-29 16:16:37 -03:00
f
777b1355dd fixup! feat: node exporter 2026-05-29 16:16:08 -03:00
f
e83433cebd feat: node exporter 2026-05-29 16:04:19 -03:00
f
a713f98ffb feat: instance name is domain 2026-05-29 16:03:59 -03:00
f
8dc84c591c fixup! feat: enable prometheus remote write receiver 2026-05-29 15:38:52 -03:00
f
d9aa05a4b5 feat: send metrics to prometheus 2026-05-28 21:00:10 -03:00
f
349df12204 feat: enable prometheus remote write receiver 2026-05-28 20:44:00 -03:00
f
6c33089078 feat: cadvisor 2026-05-28 20:38:50 -03:00
f
4bedebfab1 BREAKING CHANGES: replace promtail and cadvisor for alloy 2026-05-28 20:33:36 -03:00
dd320e9f1c fix: Shorten all secret and config names to max 14 characters (#13)
Reviewed-on: #13
Reviewed-by: p4u1 <p4u1@noreply.git.coopcloud.tech>
Reviewed-by: moritz <moritz@noreply.git.coopcloud.tech>
2026-05-11 15:38:15 +00:00
9cb997b25a delete_request_store based on env variable 2026-04-09 04:36:03 +00:00
48d137d194 update loki config file 2026-04-09 04:36:03 +00:00
1acb5ebd6a chore: update image tags 2026-04-09 04:36:03 +00:00
682f30cef1 Add migrate_secret_names() to abra.sh to reinsert all secrets with shortened names in docker 2026-03-25 16:11:37 +01:00
694c8a9875 Add instructions for shorter secret names to release notes 2026-03-25 16:11:28 +01:00
9dfa9cad2a Shortened all the secret and config names to max 14 characters to prevent running into Docker's 64 character limit when STACK_NAME is appended to it. 2026-03-25 15:58:28 +01:00
24 changed files with 824 additions and 289 deletions

View File

@ -5,16 +5,28 @@ DOMAIN=monitoring-ng.example.com
#TIMEOUT=120
ENABLE_BACKUPS=true
## Enable this secret for Promtail / Prometheus
#COMPOSE_FILE="$COMPOSE_FILE:compose.basic-auth.yml"
#SECRET_BASIC_AUTH_VERSION=v1
SECRET_BASIC_AUTH_VERSION=v1
# Enable this to send logs to a Loki server, adapt DOMAIN if server is
# remote
# LOKI_PUSH_URL=https://loki.$DOMAIN/loki/api/v1/push
# Enable on systemd hosts to read logs from the journal
# JOURNALD=1
#
# Promtail (Gathering Logs)
# COMPOSE_FILE="$COMPOSE_FILE:compose.promtail.yml"
# LOKI_PUSH_URL=https://loki.monitoring.example.org/loki/api/v1/push
# Enable on non-systemd hosts (Alpine, older Debian/Ubuntu) to tail
# /var/log/*log files (syslog, auth.log, kern.log, etc.) that a local
# syslogd writes. No syslogd reconfiguration needed.
# SYSLOG_FILES=1
#
# Enable to receive syslog messages over the network on port 514/tcp.
# Use for remote devices that push syslog to this host, or for a
# local syslogd configured to forward over the network.
# Not needed if you just want to read local log files — use SYSLOG_FILES instead.
# SYSLOG=1
# COMPOSE_FILE="$COMPOSE_FILE:compose.syslog.yml"
## Expose node and cadvisor ports instead of traefik
# COMPOSE_FILE="$COMPOSE_FILE:compose.expose-ports.yml"
# Enable this to send metrics to a Prometheus server, adapt DOMAIN if
# server is remote
# PROMETHEUS_REMOTE_WRITE_URL=https://prometheus.$DOMAIN/api/v1/write
# Monitoring Server
#
@ -40,20 +52,18 @@ ENABLE_BACKUPS=true
# LOKI_AWS_REGION=eu-west-1
# LOKI_ACCESS_KEY_ID=bush-debrief-approval-robust-scraggly-molecule
# LOKI_BUCKET_NAMES=loki
# SECRET_LOKI_AWS_SECRET_ACCESS_KEY_VERSION=v1
# SECRET_LOKI_AWS_KEY_VERSION=v1
#
## Grafana
#
# COMPOSE_FILE="$COMPOSE_FILE:compose.grafana.yml"
# GF_SERVER_ROOT_URL=https://monitoring.example.com
# SECRET_GRAFANA_ADMIN_PASSWORD_VERSION=v1
## Seperate domain for Grafana
#GRAFANA_DOMAIN=grafana.example.com
# SECRET_GF_ADMINPASSWD_VERSION=v1
#
## Single-Sign-On with OIDC
# COMPOSE_FILE="$COMPOSE_FILE:compose.grafana-oidc.yml"
# OIDC_ENABLED=1
# SECRET_GRAFANA_OIDC_CLIENT_SECRET_VERSION=v1
# SECRET_GF_OIDC_SECRET_VERSION=v1
# OIDC_CLIENT_ID=grafana
# OIDC_AUTH_URL="https://authentik.example.com/application/o/authorize/"
# OIDC_API_URL="https://authentik.example.com/application/o/userinfo/"
@ -70,20 +80,19 @@ ENABLE_BACKUPS=true
# GF_SMTP_ENABLED=true
# GF_SMTP_FROM_ADDRESS=grafana@example.com
# GF_SMTP_SKIP_VERIFY=false
# SECRET_GRAFANA_SMTP_PASSWORD_VERSION=v1
# SECRET_GF_SMTP_PASSWD_VERSION=v1
#
## Grafana Matrix Contact Point (optional)
#COMPOSE_FILE="$COMPOSE_FILE:compose.matrix-alertmanager-receiver.yml"
#SECRET_MATRIX_ACCESS_TOKEN_VERSION=v1
#SECRET_MATRIX_TOKEN_VERSION=v1
#GF_MATRIX_USER_ID="<user-id>"
#GF_MATRIX_ROOM_ID="<room-id>"
#GF_MATRIX_HOMESERVER_URL="<homeserver-url>"
## ALerts
# Node disk space alert will trigger when free disk space left is below the given number in percent
#ALERT_NODE_DISK_SPACE_LEFT=10
# Node memory usage alert will trigger when memory usage is above the given number in percent
#ALERT_NODE_MEMORY_USAGE=85
# ALerts
#ALERT_BACKUP_FAILED_ENABLED=true
#ALERT_BACKUP_MISSING_ENABLED=true
#ALERT_BACKUP_NOT_SUCCESSFULL_ENABLED=true
#ALERT_NODE_DISK_SPACE_ENABLED=true
#ALERT_NODE_MEMORY_USAGE_ENABLED=true

View File

@ -1,8 +1,8 @@
# monitoring-ng
Yet another monitoring stack ...
This time its a all-in-one grafana/prometheus/loki/node_exporter/cadvisor/promtail stack.
It's based heavily on the [monitoring-lite](https://git.coopcloud.tech/coop-cloud/monitoring-lite) stack, but has everything in one recipe included now. So you can deploy monitoring instances to only gather metrics / logs (node_exporter/cadvisor/promtail) and also deploy instances with the full monitoring stack (grafana/prometheus/loki) with the same recipe and just different .env configuration.
This time its a all-in-one grafana/prometheus/loki/alloy stack.
It's based heavily on the [monitoring-lite](https://git.coopcloud.tech/coop-cloud/monitoring-lite) stack, but has everything in one recipe included now. So you can deploy monitoring instances to only gather metrics / logs (alloy) and also deploy instances with the full monitoring stack (grafana/prometheus/loki) with the same recipe and just different .env configuration.
<!-- metadata -->
@ -47,13 +47,6 @@ Where gathering.org is the node you want to gather metrics from.
- cadvisor.monitoring.gathering.org
- node.monitoring.gathering.org
### Expose node and cadvisor via ports instead of traefik
In case you have no traefik running on the machine, you can expose the ports directly by uncommenting the following line:
```
# COMPOSE_FILE="$COMPOSE_FILE:compose.expose-ports.yml"
```
## Setup Metrics Browser
@ -145,7 +138,7 @@ COMPOSE_FILE="$COMPOSE_FILE:compose.matrix-alertmanager-receiver.yml"
2. Insert the matrix access token secret:
```
abra app secret insert monitoring.marx.klasse-methode.it matrix_access_token v1
abra app secret insert monitoring.marx.klasse-methode.it matrix_token v1
```
3. Set required configurations:
@ -156,9 +149,13 @@ GF_MATRIX_HOME_SERVER_URL=
```
4. Configure Alertmanager webhook and set the url to `http://matrix-alertmanager-receiver:12345/alerts/<room-id>`
## Alerts
## alerts
It is possible to enable the following alerts, by setting the corresponding env variable to `true`:
- backupbot failed: `ALERT_BACKUP_FAILED_ENABLED`
- backupbot missing: `ALERT_BACKUP_MISSING_ENABLED`
- backupbot not successfull: `ALERT_BACKUP_NOT_SUCCESSFULL_ENABLED`
- node disk space: `ALERT_NODE_DISK_SPACE_ENABLED`
- node memory usage: `ALERT_NODE_MEMORY_USAGE_ENABLED`
It is possible to enable the following alerts, by uncommenting the corresponding env variable:
- node disk space: `ALERT_NODE_DISK_SPACE_LEFT`
- node memory usage: `ALERT_NODE_MEMORY_USAGE`

119
abra.sh
View File

@ -1,17 +1,17 @@
export ENTRYPOINT_VERSION=v1
export GRAFANA_DATASOURCES_YML_VERSION=v1
export GRAFANA_DASHBOARDS_YML_VERSION=v2
export GRAFANA_SWARM_DASHBOARD_JSON_VERSION=v2
export GRAFANA_STACKS_DASHBOARD_JSON_VERSION=v2
export GRAFANA_TRAEFIK_DASHBOARD_JSON_VERSION=v2
export GRAFANA_BACKUP_DASHBOARD_JSON_VERSION=v1
export GRAFANA_CUSTOM_INI_VERSION=v4
export PROMTAIL_YML_VERSION=v3
export LOKI_YML_VERSION=v2
export GF_DATASOURCES_VERSION=v1
export GF_DASHBOARDS_VERSION=v2
export GF_SWARM_DASH_VERSION=v2
export GF_STACKS_DASH_VERSION=v2
export GF_TRAEFIK_DASH_VERSION=v2
export GF_BACKUP_DASH_VERSION=v1
export GF_CUSTOM_INI_VERSION=v4
export LOKI_YML_VERSION=v3
export PROMETHEUS_YML_VERSION=v2
export MATRIX_ALERTMANAGER_CONFIG_VERSION=v1
export MATRIX_ALERTMANAGER_ENTRYPOINT_VERSION=v1
export GRAFANA_ALERTS_NODE_VERSION=v2
export MATRIX_ALERTMANAGER_CONFIG_VERSION=e
export MATRIX_ALERTMANAGER_ENTRYPOINT_VERSION=a
export GRAFANA_ALERTS_NODE_VERSION=v1c
export CONFIG_ALLOY_VERSION=v9
# creates a default prometheus scrape config for a given node
add_node(){
@ -22,6 +22,101 @@ add_node(){
cat "/prometheus/scrape_configs/$name.yml"
}
# migrates secrets from old names to new names by reading values from the
# running containers on the server and re-inserting them under the new names.
# preview changes: abra app cmd --local <app> migrate_secret_names
# execute changes: abra app cmd --local <app> migrate_secret_names execute
migrate_secret_names() {
if ! command -v jq &> /dev/null; then
echo "jq is required on your local machine to migrate secret names"
echo "It could not be found in your PATH, please install jq to proceed."
echo "For example: On a debian/ubuntu system, run `apt install jq`"
exit 1
fi
# Hardcoded migration mappings: old_secret_name|new_secret_name
MIGRATIONS="
grafana_admin_password|gf_adminpasswd
grafana_smtp_password|gf_smtp_passwd
grafana_oidc_client_secret|gf_oidc_secret
matrix_access_token|matrix_token
loki_aws_secret_access_key|loki_aws_key
"
# Determine which server the app is deployed on
SERVER=$(abra app ls -m | jq -r --arg domain "$APP_NAME" '[.[].apps[] | select(.domain == $domain) | .server] | first' 2>/dev/null)
if [ -z "$SERVER" ]; then
echo "Error: could not determine server for app '$APP_NAME'"
exit 1
fi
# Build a lookup table of all secrets currently mounted in this stack.
# Each line: <secretID> <containerID> <secretName>
LOOKUP=$(ssh "$SERVER" "
docker stack services ${STACK_NAME} --format '{{.Name}}' | while read svc; do
CID=\$(docker ps --no-trunc -q --filter \"name=\${svc}\" | head -1)
docker service inspect \"\$svc\" --format '{{json .Spec.TaskTemplate.ContainerSpec.Secrets}}' | \
jq -r --arg cid \"\$CID\" '.[]? | .SecretID + \" \" + \$cid + \" \" + .SecretName'
done | sort -k3 -r
" 2>/dev/null)
echo "Secret migration plan for: $APP_NAME (server: $SERVER)"
echo ""
printf " %-24s %-8s %s\n" "OLD NAME" "FOUND" "ACTION"
printf " %-24s %-8s %s\n" "--------" "-----" "------"
# Check each old name against the lookup table and display the plan
ANY_FOUND=false
while IFS='|' read -r OLD_NAME NEW_NAME; do
[ -z "$OLD_NAME" ] && continue
MATCH=$(echo "$LOOKUP" | grep " ${STACK_NAME}_${OLD_NAME}_" | head -1)
if [ -n "$MATCH" ]; then
printf " %-24s %-8s %s\n" "$OLD_NAME" "yes" "recreate as '$NEW_NAME' version V1"
ANY_FOUND=true
else
printf " %-24s %-8s %s\n" "$OLD_NAME" "no" "nothing (not found on server)"
fi
done <<< "$MIGRATIONS"
echo ""
if [ "$ANY_FOUND" = false ]; then
echo "No old secrets found on server. Nothing to migrate."
return 0
fi
if [ "$1" != "execute" ]; then
echo "To apply the above changes, run:"
echo " abra app cmd --local $APP_NAME migrate_secret_names execute"
return 0
fi
# read each found secret from its container and re-insert with the new name
while IFS='|' read -r OLD_NAME NEW_NAME; do
[ -z "$OLD_NAME" ] && continue
MATCH=$(echo "$LOOKUP" | grep " ${STACK_NAME}_${OLD_NAME}_" | head -1)
[ -z "$MATCH" ] && continue
SECRET_ID=$(echo "$MATCH" | awk '{print $1}')
CID=$(echo "$MATCH" | awk '{print $2}')
SECRET_VALUE=$(ssh "$SERVER" "cat /var/lib/docker/containers/${CID}/mounts/secrets/${SECRET_ID} 2>/dev/null || sudo cat /var/lib/docker/containers/${CID}/mounts/secrets/${SECRET_ID} 2>/dev/null")
if [ -z "$SECRET_VALUE" ]; then
echo "Error: could not read value for '$OLD_NAME', skipping"
continue
fi
echo "Migrating: '$OLD_NAME' -> '$NEW_NAME' (v1)"
printf '%s' "$SECRET_VALUE" | abra app secret insert -C "$APP_NAME" "$NEW_NAME" v1
done <<< "$MIGRATIONS"
echo ""
echo "Done."
}
# adds a domain to a scrape config or creates a new one
add_domain(){
name=$1

View File

@ -12,7 +12,7 @@ http:
matrix:
homeserver-url: "{{ env "GF_MATRIX_HOMESERVER_URL" }}"
user-id: "{{ env "GF_MATRIX_USER_ID" }}"
access-token: "{{ secret "matrix_access_token" }}"
access-token: "{{ secret "matrix_token" }}"
room-mapping:
matrixroom: "{{ env "GF_MATRIX_ROOM_ID" }}"

View File

@ -2,13 +2,13 @@ apiVersion: 1
# List of alert rule UIDs that should be deleted
deleteRules:
{{ if not (env "ALERT_NODE_DISK_SPACE_LEFT") }}
{{ if ne (env "ALERT_NODE_DISK_SPACE_ENABLED") "true" }}
- orgId: 1
uid: coopcloud_node_disk_space_left
uid: bds8bhxu97pxca
{{ end }}
{{ if not (env "ALERT_NODE_MEMORY_USAGE") }}
{{ if ne (env "ALERT_NODE_MEMORY_USAGE_ENABLED") "true" }}
- orgId: 1
uid: coopcloud_node_memory_usage
uid: ads8cswmly96oa
{{ end }}
groups:
@ -17,8 +17,8 @@ groups:
folder: node
interval: 5m
rules:
{{ if (env "ALERT_NODE_DISK_SPACE_LEFT") }}
- uid: coopcloud_node_disk_space_left
{{ if eq (env "ALERT_NODE_DISK_SPACE_ENABLED") "true" }}
- uid: bds8bhxu97pxca
title: Node Disk Space
condition: C
data:
@ -45,7 +45,7 @@ groups:
conditions:
- evaluator:
params:
- {{ env "ALERT_NODE_DISK_SPACE_LEFT" }}
- 10
type: lt
operator:
type: and
@ -70,13 +70,13 @@ groups:
annotations:
description: ""
runbook_url: ""
summary: Less than {{ env "ALERT_NODE_DISK_SPACE_LEFT" }}% disk space left on {{`{{ $labels.instance }}`}} ({{`{{ (index $values "A").Value }}`}}% left)
summary: Less than 10% disk space left on {{`{{ $labels.instance }}`}} ({{`{{ (index $values "A").Value }}`}}% left)
labels:
"": ""
isPaused: false
{{ end }}
{{ if (env "ALERT_NODE_MEMORY_USAGE") }}
- uid: coopcloud_node_memory_usage
{{ if eq (env "ALERT_NODE_MEMORY_USAGE_ENABLED") "true" }}
- uid: ads8cswmly96oa
title: Node Memory Usage
condition: C
data:
@ -103,7 +103,7 @@ groups:
conditions:
- evaluator:
params:
- {{ env "ALERT_NODE_MEMORY_USAGE" }}
- 85
type: gt
operator:
type: and
@ -126,6 +126,6 @@ groups:
execErrState: Error
for: 5m
annotations:
summary: Memory usage is above {{ env "ALERT_NODE_MEMORY_USAGE" }}% on {{`{{ $labels.instance }}`}} ({{`{{ printf "%.2f" (index $values "A").Value }}`}}% usage)
summary: Memory usage is above 85% on {{`{{ $labels.instance }}`}} ({{`{{ printf "%.2f" (index $values "A").Value }}`}}% usage)
isPaused: false
{{ end }}

View File

@ -1,7 +0,0 @@
---
version: "3.8"
secrets:
basic_auth:
external: true
name: ${STACK_NAME}_basic_auth_${SECRET_BASIC_AUTH_VERSION}

View File

@ -1,13 +0,0 @@
---
version: "3.8"
services:
app:
ports:
- "9100:9100"
deploy:
cadvisor:
ports:
- "9101:8080"
deploy:

View File

@ -3,7 +3,7 @@ version: '3.8'
services:
grafana:
secrets:
- grafana_oidc_client_secret
- gf_oidc_secret
environment:
- OIDC_API_URL
- OIDC_AUTH_URL
@ -12,6 +12,6 @@ services:
- OIDC_TOKEN_URL
secrets:
grafana_oidc_client_secret:
gf_oidc_secret:
external: true
name: ${STACK_NAME}_grafana_oidc_client_secret_${SECRET_GRAFANA_OIDC_CLIENT_SECRET_VERSION}
name: ${STACK_NAME}_gf_oidc_secret_${SECRET_GF_OIDC_SECRET_VERSION}

View File

@ -3,16 +3,16 @@ version: '3.8'
services:
grafana:
secrets:
- grafana_smtp_password
- gf_smtp_passwd
environment:
- GF_SMTP_HOST
- GF_SMTP_USER
- GF_SMTP_PASSWORD__FILE=/run/secrets/grafana_smtp_password
- GF_SMTP_PASSWORD__FILE=/run/secrets/gf_smtp_passwd
- GF_SMTP_ENABLED
- GF_SMTP_FROM_ADDRESS
- GF_SMTP_SKIP_VERIFY
secrets:
grafana_smtp_password:
gf_smtp_passwd:
external: true
name: ${STACK_NAME}_grafana_smtp_password_${SECRET_GRAFANA_SMTP_PASSWORD_VERSION}
name: ${STACK_NAME}_gf_smtp_passwd_${SECRET_GF_SMTP_PASSWD_VERSION}

View File

@ -2,25 +2,25 @@ version: '3.8'
services:
grafana:
image: grafana/grafana:10.4.14
image: grafana/grafana:12.4.3
volumes:
- grafana-data:/var/lib/grafana:rw
secrets:
- grafana_admin_password
- gf_adminpasswd
configs:
- source: grafana_custom_ini
- source: gf_custom_ini
target: /etc/grafana/grafana.ini
- source: grafana_datasources_yml
- source: gf_datasources
target: /etc/grafana/provisioning/datasources/datasources.yml
- source: grafana_dashboards_yml
- source: gf_dashboards
target: /etc/grafana/provisioning/dashboards/dashboards.yml
- source: grafana_swarm_dashboard_json
- source: gf_swarm_dash
target: /var/lib/grafana/dashboards/docker-swarm-nodes.json
- source: grafana_stacks_dashboard_json
- source: gf_stacks_dash
target: /var/lib/grafana/dashboards/docker-swarm-stacks.json
- source: grafana_traefik_dashboard_json
- source: gf_traefik_dash
target: /var/lib/grafana/dashboards/traefik.json
- source: grafana_backup_dashboard_json
- source: gf_backup_dash
target: /var/lib/grafana/dashboards/backup.json
- source: gf_alerts_node
target: /etc/grafana/provisioning/alerting/node.yml
@ -29,49 +29,49 @@ services:
- internal
environment:
- GF_SERVER_ROOT_URL
- GF_SECURITY_ADMIN_PASSWORD__FILE=/run/secrets/grafana_admin_password
- GF_SECURITY_ADMIN_PASSWORD__FILE=/run/secrets/gf_adminpasswd
- GF_SECURITY_ALLOW_EMBEDDING
- GF_INSTALL_PLUGINS
- ALERT_NODE_DISK_SPACE_LEFT
- ALERT_NODE_MEMORY_USAGE
- ALERT_NODE_DISK_SPACE_ENABLED
- ALERT_NODE_MEMORY_USAGE_ENABLED
deploy:
labels:
- "traefik.enable=true"
- "traefik.docker.network=proxy"
- "traefik.swarm.network=proxy"
- "traefik.http.services.${STACK_NAME}-grafana.loadbalancer.server.port=3000"
- "traefik.http.routers.${STACK_NAME}-grafana.rule=Host(`${GRAFANA_DOMAIN:-$DOMAIN}`)"
- "traefik.http.routers.${STACK_NAME}-grafana.rule=Host(`${DOMAIN}`)"
- "traefik.http.routers.${STACK_NAME}-grafana.entrypoints=web-secure"
- "traefik.http.routers.${STACK_NAME}-grafana.tls=true"
- "traefik.http.routers.${STACK_NAME}-grafana.tls.certresolver=${LETS_ENCRYPT_ENV}"
healthcheck:
test: "wget -q http://localhost:3000/ -O/dev/null"
test: "wget -q http://localhost:3000/healthz -O/dev/null"
interval: 5s
timeout: 10s
retries: 3
start_period: 10s
configs:
grafana_custom_ini:
gf_custom_ini:
template_driver: golang
name: ${STACK_NAME}_grafana_custom_ini_${GRAFANA_CUSTOM_INI_VERSION}
name: ${STACK_NAME}_gf_custom_ini_${GF_CUSTOM_INI_VERSION}
file: grafana_custom.ini
grafana_datasources_yml:
name: ${STACK_NAME}_g_datasources_yml_${GRAFANA_DATASOURCES_YML_VERSION}
gf_datasources:
name: ${STACK_NAME}_gf_datasources_${GF_DATASOURCES_VERSION}
file: grafana-datasources.yml
grafana_dashboards_yml:
name: ${STACK_NAME}_g_dashboards_yml_${GRAFANA_DASHBOARDS_YML_VERSION}
gf_dashboards:
name: ${STACK_NAME}_gf_dashboards_${GF_DASHBOARDS_VERSION}
file: grafana-dashboards.yml
grafana_swarm_dashboard_json:
name: ${STACK_NAME}_g_swarm_dashboard_json_${GRAFANA_SWARM_DASHBOARD_JSON_VERSION}
gf_swarm_dash:
name: ${STACK_NAME}_gf_swarm_dash_${GF_SWARM_DASH_VERSION}
file: grafana-swarm-dashboard.json
grafana_stacks_dashboard_json:
name: ${STACK_NAME}_g_stacks_dashboard_json_${GRAFANA_STACKS_DASHBOARD_JSON_VERSION}
gf_stacks_dash:
name: ${STACK_NAME}_gf_stacks_dash_${GF_STACKS_DASH_VERSION}
file: grafana-stacks-dashboard.json
grafana_traefik_dashboard_json:
name: ${STACK_NAME}_g_traefik_dashboard_json_${GRAFANA_TRAEFIK_DASHBOARD_JSON_VERSION}
gf_traefik_dash:
name: ${STACK_NAME}_gf_traefik_dash_${GF_TRAEFIK_DASH_VERSION}
file: grafana-traefik-dashboard.json
grafana_backup_dashboard_json:
name: ${STACK_NAME}_g_backup_dashboard_json_${GRAFANA_BACKUP_DASHBOARD_JSON_VERSION}
gf_backup_dash:
name: ${STACK_NAME}_gf_backup_dash_${GF_BACKUP_DASH_VERSION}
file: grafana-backup-dashboard.json
gf_alerts_node:
template_driver: golang
@ -83,6 +83,6 @@ volumes:
secrets:
grafana_admin_password:
gf_adminpasswd:
external: true
name: ${STACK_NAME}_grafana_admin_password_${SECRET_GRAFANA_ADMIN_PASSWORD_VERSION}
name: ${STACK_NAME}_gf_adminpasswd_${SECRET_GF_ADMINPASSWD_VERSION}

View File

@ -2,7 +2,7 @@ version: '3.8'
services:
loki:
image: grafana/loki:2.9.11
image: grafana/loki:3.7.2
command: -config.file=/etc/loki/local-config.yaml
networks:
- proxy
@ -12,7 +12,7 @@ services:
volumes:
- loki-data:/loki
# secrets:
# - loki_aws_secret_access_key
# - loki_aws_key
environment:
- LOKI_ACCESS_KEY_ID
- LOKI_AWS_ENDPOINT
@ -27,7 +27,7 @@ services:
condition: on-failure
labels:
- "traefik.enable=true"
- "traefik.docker.network=proxy"
- "traefik.swarm.network=proxy"
- "traefik.http.services.${STACK_NAME}-loki.loadbalancer.server.port=3100"
- "traefik.http.routers.${STACK_NAME}-loki.rule=Host(`loki.${DOMAIN}`)"
- "traefik.http.routers.${STACK_NAME}-loki.entrypoints=web-secure"
@ -47,6 +47,6 @@ volumes:
loki-data:
# secrets:
# loki_aws_secret_access_key:
# loki_aws_key:
# external: true
# name: ${STACK_NAME}_loki_aws_secret_access_key_${SECRET_LOKI_AWS_SECRET_ACCESS_KEY_VERSION}
# name: ${STACK_NAME}_loki_aws_key_${SECRET_LOKI_AWS_KEY_VERSION}

View File

@ -2,9 +2,9 @@ version: '3.8'
services:
matrix-alertmanager-receiver:
image: metio/matrix-alertmanager-receiver:2025.2.9
image: metio/matrix-alertmanager-receiver:2026.2.25
secrets:
- matrix_access_token
- matrix_token
configs:
- source: matrix-alertmanager-receiver-config
target: /etc/matrix-alertmanager-receiver/config.yml
@ -23,6 +23,6 @@ configs:
file: alertmanager-matrix-config.yml.tmpl
secrets:
matrix_access_token:
matrix_token:
external: true
name: ${STACK_NAME}_matrix_access_token_${SECRET_MATRIX_ACCESS_TOKEN_VERSION}
name: ${STACK_NAME}_matrix_token_${SECRET_MATRIX_TOKEN_VERSION}

View File

@ -2,7 +2,7 @@ version: '3.8'
services:
prometheus:
image: prom/prometheus:v2.55.1
image: prom/prometheus:v3.12.0
secrets:
- basic_auth
volumes:
@ -16,6 +16,8 @@ services:
- "--web.console.libraries=/usr/share/prometheus/console_libraries"
- "--web.console.templates=/usr/share/prometheus/consoles"
- "--storage.tsdb.retention.time=${PROMETHEUS_RETENTION_TIME}"
- "--enable-feature=remote-write-receiver"
- "--web.enable-remote-write-receiver"
networks:
- proxy
- internal
@ -24,12 +26,13 @@ services:
condition: on-failure
labels:
- "traefik.enable=true"
- "traefik.docker.network=proxy"
- "traefik.swarm.network=proxy"
- "traefik.http.services.${STACK_NAME}-prometheus.loadbalancer.server.port=9090"
- "traefik.http.routers.${STACK_NAME}-prometheus.rule=Host(`prometheus.${DOMAIN}`)"
- "traefik.http.routers.${STACK_NAME}-prometheus.entrypoints=web-secure"
- "traefik.http.routers.${STACK_NAME}-prometheus.tls=true"
- "traefik.http.routers.${STACK_NAME}-prometheus.tls.certresolver=${LETS_ENCRYPT_ENV}"
- "traefik.http.routers.${STACK_NAME}-prometheus.middlewares=basicauth@file"
configs:
prometheus_yml:

View File

@ -1,25 +0,0 @@
version: "3.8"
services:
promtail:
image: grafana/promtail:2.9.11
volumes:
- /var/log:/var/log:ro
- /var/run/docker.sock:/var/run/docker.sock
command: -config.file=/etc/promtail/config.yml
configs:
- source: promtail_yml
target: /etc/promtail/config.yml
networks:
- internal
secrets:
- basic_auth
environment:
- DOMAIN
- LOKI_PUSH_URL
configs:
promtail_yml:
name: ${STACK_NAME}_promtail_yml_${PROMTAIL_YML_VERSION}
file: promtail.yml.tmpl
template_driver: golang

View File

@ -2,7 +2,7 @@ version: '3.8'
services:
pushgateway:
image: prom/pushgateway:v1.10.0
image: prom/pushgateway:v1.11.2
command:
- '--web.listen-address=:9191'
- '--push.disable-consistency-check'
@ -17,7 +17,7 @@ services:
condition: on-failure
labels:
- "traefik.enable=true"
- "traefik.docker.network=proxy"
- "traefik.swarm.network=proxy"
- "traefik.http.services.${STACK_NAME}-pushgateway.loadbalancer.server.port=9191"
- "traefik.http.routers.${STACK_NAME}-pushgateway.rule=Host(`pushgateway.${DOMAIN}`)"
- "traefik.http.routers.${STACK_NAME}-pushgateway.entrypoints=web-secure"

6
compose.syslog.yml Normal file
View File

@ -0,0 +1,6 @@
---
version: "3.8"
services:
app:
ports:
- "514:514"

View File

@ -3,89 +3,45 @@ version: "3.8"
services:
app:
image: prom/node-exporter:v1.8.1
user: root
environment:
- NODE_ID={{.Node.ID}}
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
- /etc/hostname:/etc/nodename:ro
command:
- "--path.sysfs=/host/sys"
- "--path.procfs=/host/proc"
- "--path.rootfs=/rootfs"
- "--collector.textfile.directory=/etc/node-exporter/"
- "--collector.filesystem.ignored-mount-points=^/(sys|proc|dev|host|etc)($$|/)"
- "--no-collector.ipvs"
image: grafana/alloy:v1.16.1
hostname: "${DOMAIN}"
configs:
- source: entrypoint
target: /entrypoint.sh
- source: config_alloy
target: /etc/alloy/config.alloy
volumes:
- /:/rootfs:ro
- /var/run/docker.sock:/var/run/docker.sock:ro
- /sys:/sys:ro
- /var/lib/docker:/var/lib/docker:ro
- alloy-data:/var/lib/alloy/data
command:
- "run"
- "--storage.path=/var/lib/alloy/data"
- "--server.http.listen-addr=127.0.0.1:12345"
- "/etc/alloy/config.alloy"
networks:
- internal
- proxy
entrypoint: [ "/bin/sh", "-e", "/entrypoint.sh" ]
secrets:
- basic_auth
deploy:
restart_policy:
condition: on-failure
labels:
- "backupbot.backup=${ENABLE_BACKUPS:-true}"
- "traefik.enable=true"
- "traefik.docker.network=proxy"
- "traefik.http.services.${STACK_NAME}-node.loadbalancer.server.port=9100"
- "traefik.http.routers.${STACK_NAME}-node.rule=Host(`node.${DOMAIN}`)"
- "traefik.http.routers.${STACK_NAME}-node.entrypoints=web-secure"
- "traefik.http.routers.${STACK_NAME}-node.tls=true"
- "traefik.http.routers.${STACK_NAME}-node.tls.certresolver=${LETS_ENCRYPT_ENV}"
- "traefik.http.routers.${STACK_NAME}-node.middlewares=basicauth@file"
- "traefik.enable=false"
- "coop-cloud.${STACK_NAME}.version=1.6.0+v1.8.1"
- "coop-cloud.${STACK_NAME}.timeout=${TIMEOUT}"
cadvisor:
image: gcr.io/cadvisor/cadvisor:v0.49.2
command:
- "-logtostderr"
- "--enable_metrics=cpu,cpuLoad,disk,diskIO,process,memory,network"
# all possible metrics: advtcp,app,cpu,cpuLoad,cpu_topology,cpuset,disk,diskIO,hugetlb,memory,memory_numa,network,oom_event,percpu,perf_event,process,referenced_memory,resctrl,sched,tcp,udp.
- "--housekeeping_interval=120s"
- "--docker_only=true"
volumes:
- /var/lib/docker/:/var/lib/docker:ro
- /dev/disk/:/dev/disk:ro
- /sys:/sys:ro
- /var/run:/var/run:ro
- /:/rootfs:ro
networks:
- internal
- proxy
deploy:
restart_policy:
condition: on-failure
labels:
- "traefik.enable=true"
- "traefik.docker.network=proxy"
- "traefik.http.services.${STACK_NAME}-cadvisor.loadbalancer.server.port=8080"
- "traefik.http.routers.${STACK_NAME}-cadvisor.rule=Host(`cadvisor.${DOMAIN}`)"
- "traefik.http.routers.${STACK_NAME}-cadvisor.entrypoints=web-secure"
- "traefik.http.routers.${STACK_NAME}-cadvisor.tls=true"
- "traefik.http.routers.${STACK_NAME}-cadvisor.tls.certresolver=${LETS_ENCRYPT_ENV}"
- "traefik.http.routers.${STACK_NAME}-cadvisor.middlewares=basicauth@file"
healthcheck:
test: wget --quiet --tries=1 --spider http://localhost:8080/healthz || exit 1
interval: 15s
timeout: 15s
retries: 5
start_period: 30s
configs:
entrypoint:
name: ${STACK_NAME}_entrypoint_${ENTRYPOINT_VERSION}
file: node-exporter-entrypoint.sh
config_alloy:
template_driver: golang
name: ${STACK_NAME}_config_alloy_${CONFIG_ALLOY_VERSION}
file: config.alloy.tmpl
networks:
proxy:
external: true
internal:
volumes:
alloy-data:
secrets:
basic_auth:
external: true
name: ${STACK_NAME}_basic_auth_${SECRET_BASIC_AUTH_VERSION}

245
config.alloy.tmpl Normal file
View File

@ -0,0 +1,245 @@
logging {
level = "info"
format = "logfmt"
}
discovery.docker "linux" {
host = "unix:///var/run/docker.sock"
}
{{ if ne (env "PROMETHEUS_REMOTE_WRITE_URL") "" }}
prometheus.exporter.cadvisor "docker" {
docker_only = true
enabled_metrics = ["cpu", "cpuLoad", "disk", "diskIO", "memory", "network", "process"]
}
prometheus.exporter.unix "default" {
include_exporter_metrics = true
rootfs_path = "/rootfs"
procfs_path = "/rootfs/proc"
sysfs_path = "/rootfs/sys"
disable_collectors = ["ipvs"]
filesystem {
fs_types_exclude = "^(autofs|binfmt_misc|bpf|cgroup2?|configfs|debugfs|devpts|devtmpfs|tmpfs|fusectl|hugetlbfs|iso9660|mqueue|nsfs|overlay|proc|procfs|pstore|rpc_pipefs|securityfs|selinuxfs|squashfs|sysfs|tracefs)$"
mount_points_exclude = "^/(sys|proc|dev|host|etc)($|/)"
mount_timeout = "5s"
}
netclass { ignored_devices = "^(veth.*)$" }
netdev { device_exclude = "^(veth.*)$" }
}
prometheus.exporter.self "alloy" {}
prometheus.scrape "default" {
scrape_interval = "120s"
targets = array.concat(
prometheus.exporter.self.alloy.targets,
prometheus.exporter.unix.default.targets,
prometheus.exporter.cadvisor.docker.targets,
)
forward_to = [prometheus.remote_write.prometheus.receiver]
}
prometheus.remote_write "prometheus" {
endpoint {
url = "{{ env "PROMETHEUS_REMOTE_WRITE_URL" }}"
basic_auth {
username = "admin"
password = "{{ secret "basic_auth" }}"
}
}
}
// Scrape Prometheus metrics from other containers on this host.
// Containers opt in via Docker labels:
// prometheus.io/scrape=true required: enable scraping
// prometheus.io/port=9090 optional: port exposing /metrics (defaults to first exposed port)
// prometheus.io/path=/metrics optional: path to metrics endpoint (default: /metrics)
// prometheus.io/auth=basic optional: use basic auth with the shared basic_auth secret
//
// Uses docker_gwbridge — the host-local bridge network Docker attaches all
// Swarm containers to for outbound connectivity. Alloy can reach any container
// on the same host via this network without needing to join each stack's
// overlay network.
discovery.docker "containers" {
host = "unix:///var/run/docker.sock"
match_first_network = false
}
discovery.relabel "metrics" {
targets = discovery.docker.containers.targets
rule {
source_labels = ["__meta_docker_network_name"]
regex = "docker_gwbridge"
action = "keep"
}
rule {
source_labels = ["__meta_docker_container_label_prometheus_io_scrape"]
regex = "true"
action = "keep"
}
rule {
source_labels = ["__address__", "__meta_docker_container_label_prometheus_io_port"]
regex = `(.+):\d+;(\d+)`
target_label = "__address__"
replacement = "$1:$2"
}
rule {
source_labels = ["__meta_docker_container_label_prometheus_io_path"]
regex = `(.+)`
target_label = "__metrics_path__"
}
rule {
source_labels = ["__meta_docker_container_label_com_docker_swarm_service_name"]
target_label = "job"
}
}
discovery.relabel "metrics_noauth" {
targets = discovery.relabel.metrics.output
rule {
source_labels = ["__meta_docker_container_label_prometheus_io_auth"]
regex = "^$"
action = "keep"
}
}
discovery.relabel "metrics_basicauth" {
targets = discovery.relabel.metrics.output
rule {
source_labels = ["__meta_docker_container_label_prometheus_io_auth"]
regex = "basic"
action = "keep"
}
}
prometheus.scrape "containers" {
scrape_interval = "120s"
targets = discovery.relabel.metrics_noauth.output
forward_to = [prometheus.remote_write.prometheus.receiver]
}
prometheus.scrape "containers_basicauth" {
scrape_interval = "120s"
targets = discovery.relabel.metrics_basicauth.output
forward_to = [prometheus.remote_write.prometheus.receiver]
basic_auth {
username = "admin"
password = "{{ secret "basic_auth" }}"
}
}
{{ end }}
{{ if ne (env "LOKI_PUSH_URL") "" }}
discovery.relabel "docker" {
targets = discovery.docker.linux.targets
rule {
source_labels = ["__meta_docker_container_name"]
target_label = "container_name"
}
rule {
source_labels = ["__meta_docker_container_id"]
target_label = "container_id"
}
rule {
source_labels = ["__meta_docker_container_label_com_docker_stack_namespace"]
target_label = "stack_namespace"
}
rule {
source_labels = ["__meta_docker_container_label_com_docker_swarm_service_name"]
target_label = "service_name"
}
rule {
source_labels = ["__meta_docker_container_log_stream"]
target_label = "stream"
}
}
loki.source.docker "docker" {
host = "unix:///var/run/docker.sock"
targets = discovery.relabel.docker.output
labels = {"app" = "docker"}
forward_to = [loki.write.loki.receiver]
}
// JOURNALD: reads the systemd journal binary log directly.
// Use on systemd hosts (most modern Linux distros). Requires no syslogd.
{{ if eq (env "JOURNALD") "1" }}
loki.source.journal "journal" {
path = "/rootfs/var/log/journal"
labels = { job = "{{ env "DOMAIN" }}" }
forward_to = [loki.write.loki.receiver]
}
{{ end }}
// SYSLOG_FILES: tails all /var/log/*log files (syslog, auth.log, kern.log, etc.).
// Use on non-systemd hosts where a syslogd writes to /var/log.
{{ if eq (env "SYSLOG_FILES") "1" }}
local.file_match "syslog_files" {
path_targets = [{ __path__ = "/rootfs/var/log/*log" }]
}
loki.source.file "syslog_files" {
targets = local.file_match.syslog_files.targets
forward_to = [loki.process.syslog_files.receiver]
}
loki.process "syslog_files" {
stage.static_labels {
values = { job = "syslog" }
}
forward_to = [loki.write.loki.receiver]
}
{{ end }}
// SYSLOG: opens a network syslog listener on port 514.
// Use when a remote device or a local syslogd configured to
// forward over the network sends logs to this host.
// Requires compose.syslog.yml to publish port 514 to the host.
// This is NOT needed for reading local log files — use SYSLOG_FILES instead.
{{ if eq (env "SYSLOG") "1" }}
loki.relabel "syslog" {
rule {
action = "labelmap"
regex = "__syslog_(.+)"
}
forward_to = []
}
loki.source.syslog "syslog" {
listener {
address = "[::]:514"
label_structured_data = true
labels = { component = "loki.source.syslog" }
}
relabel_rules = loki.relabel.syslog.rules
forward_to = [loki.write.loki.receiver]
}
{{ end }}
loki.write "loki" {
endpoint {
url = "{{ env "LOKI_PUSH_URL" }}"
basic_auth {
username = "admin"
password = "{{ secret "basic_auth" }}"
}
}
external_labels = { hostname = "{{ env "DOMAIN" }}" }
}
{{ end }}

315
grafana-alerts.json.tmpl Normal file
View File

@ -0,0 +1,315 @@
{
"apiVersion": 1,
"groups": [
{
"orgId": 1,
"name": "backupbot",
"folder": "node",
"interval": "1m",
"rules": [
{{ if eq (env "ALERT_BACKUP_FAILED_ENABLED") "true" }}
{
"uid": "de8e5xxup7t34a",
"title": "Backup Failed",
"condition": "C",
"data": [
{
"refId": "A",
"relativeTimeRange": { "from": 600, "to": 0 },
"datasourceUid": "PBFA97CFB590B2093",
"model": {
"disableTextWrap": false,
"editorMode": "builder",
"expr": "backup",
"fullMetaSearch": false,
"includeNullMetadata": true,
"instant": true,
"intervalMs": 1000,
"legendFormat": "__auto",
"maxDataPoints": 43200,
"range": false,
"refId": "A",
"useBackend": false
}
},
{
"refId": "C",
"relativeTimeRange": { "from": 600, "to": 0 },
"datasourceUid": "__expr__",
"model": {
"conditions": [
{
"evaluator": { "params": [0], "type": "lt" },
"operator": { "type": "and" },
"query": { "params": ["C"] },
"reducer": { "params": [], "type": "last" },
"type": "query"
}
],
"datasource": { "type": "__expr__", "uid": "__expr__" },
"expression": "A",
"intervalMs": 1000,
"maxDataPoints": 43200,
"refId": "C",
"type": "threshold"
}
}
],
"noDataState": "NoData",
"execErrState": "Error",
"for": "1m",
"isPaused": false
},
{{ end }}
{{ if eq (env "ALERT_BACKUP_MISSING_ENABLED") "true" }}
{
"uid": "ce8e65uddcwe8d",
"title": "Backup Missing",
"condition": "B",
"data": [
{
"refId": "A",
"relativeTimeRange": { "from": 600, "to": 0 },
"datasourceUid": "PBFA97CFB590B2093",
"model": {
"disableTextWrap": false,
"editorMode": "builder",
"expr": "rate(backup[24h])",
"fullMetaSearch": false,
"includeNullMetadata": true,
"instant": true,
"intervalMs": 1000,
"legendFormat": "__auto",
"maxDataPoints": 43200,
"range": false,
"refId": "A",
"useBackend": false
}
},
{
"refId": "B",
"relativeTimeRange": { "from": 600, "to": 0 },
"datasourceUid": "__expr__",
"model": {
"conditions": [
{
"evaluator": { "params": [0, 0], "type": "within_range" },
"operator": { "type": "and" },
"query": { "params": ["C"] },
"reducer": { "params": [], "type": "last" },
"type": "query"
}
],
"datasource": { "type": "__expr__", "uid": "__expr__" },
"expression": "A",
"intervalMs": 1000,
"maxDataPoints": 43200,
"refId": "B",
"type": "threshold"
}
}
],
"noDataState": "NoData",
"execErrState": "Error",
"for": "5m",
"isPaused": false
},
{{ end }}
{{ if eq (env "ALERT_BACKUP_NOT_SUCCESSFULL_ENABLED") "true" }}
{
"uid": "de8e6bc92a8lcc",
"title": "Backup Not Successfull",
"condition": "B",
"data": [
{
"refId": "A",
"relativeTimeRange": {
"from": 60,
"to": 0
},
"datasourceUid": "PBFA97CFB590B2093",
"model": {
"disableTextWrap": false,
"editorMode": "builder",
"expr": "backup",
"fullMetaSearch": false,
"includeNullMetadata": true,
"instant": true,
"intervalMs": 1000,
"legendFormat": "__auto",
"maxDataPoints": 43200,
"range": false,
"refId": "A",
"useBackend": false
}
},
{
"refId": "B",
"relativeTimeRange": {
"from": 60,
"to": 0
},
"datasourceUid": "__expr__",
"model": {
"conditions": [
{
"evaluator": {
"params": [
0
],
"type": "gt"
},
"operator": {
"type": "and"
},
"query": {
"params": [
"C"
]
},
"reducer": {
"params": [],
"type": "last"
},
"type": "query"
}
],
"datasource": {
"type": "__expr__",
"uid": "__expr__"
},
"expression": "A",
"intervalMs": 1000,
"maxDataPoints": 43200,
"refId": "B",
"type": "threshold"
}
}
],
"noDataState": "NoData",
"execErrState": "Error",
"for": "20m",
"annotations": {
"summary": "Backup did not finish within 20 minutes"
},
"labels": {},
"isPaused": false
}
{{ end }}
]
},
{
"orgId": 1,
"name": "node",
"folder": "node",
"interval": "5m",
"rules": [
{{ if eq (env "ALERT_NODE_DISK_SPACE_ENABLED") "true" }}
{
"uid": "bds8bhxu97pxca",
"title": "Node Disk Space",
"condition": "C",
"data": [
{
"refId": "A",
"relativeTimeRange": { "from": 600, "to": 0 },
"datasourceUid": "PBFA97CFB590B2093",
"model": {
"editorMode": "code",
"expr": "(node_filesystem_free_bytes{fstype=\"ext4\"} / node_filesystem_size_bytes{fstype=\"ext4\"}) * 100",
"instant": true,
"intervalMs": 1000,
"legendFormat": "__auto",
"maxDataPoints": 43200,
"range": false,
"refId": "A"
}
},
{
"refId": "C",
"relativeTimeRange": { "from": 600, "to": 0 },
"datasourceUid": "__expr__",
"model": {
"conditions": [
{
"evaluator": { "params": [10], "type": "lt" },
"operator": { "type": "and" },
"query": { "params": ["C"] },
"reducer": { "params": [], "type": "last" },
"type": "query"
}
],
"datasource": { "type": "__expr__", "uid": "__expr__" },
"expression": "A",
"intervalMs": 1000,
"maxDataPoints": 43200,
"refId": "C",
"type": "threshold"
}
}
],
"noDataState": "NoData",
"execErrState": "Error",
"for": "5m",
"annotations": {},
"labels": {},
"isPaused": false
},
{{ end }}
{{ if eq (env "ALERT_NODE_MEMORY_USAGE_ENABLED") "true" }}
{
"uid": "ads8cswmly96oa",
"title": "Node Memory Usage",
"condition": "C",
"data": [
{
"refId": "A",
"relativeTimeRange": { "from": 600, "to": 0 },
"datasourceUid": "PBFA97CFB590B2093",
"model": {
"editorMode": "code",
"expr": "(node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100",
"instant": true,
"intervalMs": 1000,
"legendFormat": "__auto",
"maxDataPoints": 43200,
"range": false,
"refId": "A"
}
},
{
"refId": "C",
"relativeTimeRange": { "from": 600, "to": 0 },
"datasourceUid": "__expr__",
"model": {
"conditions": [
{
"evaluator": { "params": [90], "type": "gt" },
"operator": { "type": "and" },
"query": { "params": ["C"] },
"reducer": { "params": [], "type": "last" },
"type": "query"
}
],
"datasource": { "type": "__expr__", "uid": "__expr__" },
"expression": "A",
"intervalMs": 1000,
"maxDataPoints": 43200,
"refId": "C",
"type": "threshold"
}
}
],
"noDataState": "NoData",
"execErrState": "Error",
"for": "5m",
"annotations": {},
"labels": {},
"isPaused": false
}
{{ end }}
]
}
]
}

View File

@ -21,7 +21,7 @@ tls_skip_verify_insecure = false
allow_sign_up = true
auto_login = true
client_id = {{ env "OIDC_CLIENT_ID" }}
client_secret = {{ secret "grafana_oidc_client_secret" }}
client_secret = {{ secret "gf_oidc_secret" }}
auth_url = {{ env "OIDC_AUTH_URL" }}
token_url = {{ env "OIDC_TOKEN_URL" }}
api_url = {{ env "OIDC_API_URL" }}

View File

@ -34,7 +34,6 @@ ingester:
max_chunk_age: 1h # All chunks will be flushed when they hit this age, default is 1h
chunk_target_size: 1048576 # Loki will attempt to build chunks up to 1.5MB, flushing first if chunk_idle_period or max_chunk_age is reached first
chunk_retain_period: 30s # Must be greater than index read cache TTL if using an index cache (Default index read cache TTL is 5m)
max_transfer_retries: 0 # Chunk transfers disabled
wal:
dir: "/tmp/wal"
@ -53,7 +52,7 @@ schema_config:
- from: 2020-10-24
store: boltdb-shipper
object_store: filesystem
schema: v11
schema: v13
index:
prefix: index_
period: 24h
@ -63,7 +62,6 @@ storage_config:
active_index_directory: /loki/boltdb-shipper-active
cache_location: /loki/boltdb-shipper-cache
cache_ttl: 24h # Can be increased for faster performance over longer query periods, uses more disk space
shared_store: filesystem
filesystem:
directory: /loki/chunks
{{ end }}
@ -72,7 +70,6 @@ schema_config:
configs:
- from: 2020-11-25
store: boltdb-shipper
object_store: aws
schema: v11
index:
prefix: index_
@ -89,7 +86,7 @@ storage_config:
endpoint: {{ env "LOKI_AWS_ENDPOINT" }}
region: {{ env "LOKI_AWS_REGION" }}
access_key_id: {{ env "LOKI_ACCESS_KEY_ID" }}
secret_access_key: {{ secret "loki_aws_secret_access_key" }}
secret_access_key: {{ secret "loki_aws_key" }}
bucketnames: {{ env "LOKI_BUCKET_NAMES" }}
insecure: false
sse_encryption: false
@ -103,19 +100,24 @@ storage_config:
compactor:
working_directory: /loki/boltdb-shipper-compactor
shared_store: filesystem
compaction_interval: 10m
retention_enabled: true
retention_delete_delay: 2h
retention_delete_worker_count: 150
{{ if eq (env "LOKI_STORAGE_FILESYSTEM") "1" }}
delete_request_store: filesystem
{{ end }}
{{ if eq (env "LOKI_STORAGE_S3") "1" }}
delete_request_store: aws
{{ end }}
limits_config:
enforce_metric_name: false
reject_old_samples: true
reject_old_samples_max_age: 168h
retention_period: {{ env "LOKI_RETENTION_PERIOD" }}
split_queries_by_interval: 24h
max_query_parallelism: 100
allow_structured_metadata: false
query_scheduler:
max_outstanding_requests_per_tenant: 4096
@ -123,9 +125,6 @@ query_scheduler:
frontend:
max_outstanding_per_tenant: 4096
chunk_store_config:
max_look_back_period: 0s
table_manager:
retention_deletes_enabled: false
retention_period: 0s

View File

@ -1,11 +0,0 @@
#!/bin/sh -e
NODE_NAME=$(cat /etc/nodename)
mkdir -p /etc/node-exporter
echo "node_meta{node_id=\"$NODE_ID\", container_label_com_docker_swarm_node_id=\"$NODE_ID\", node_name=\"$NODE_NAME\"} 1" > /etc/node-exporter/node-meta.prom
set -- /bin/node_exporter "$@"
exec "$@"

View File

@ -1,37 +0,0 @@
server:
http_listen_port: 9080
grpc_listen_port: 0
positions:
filename: /tmp/positions.yaml
clients:
- url: {{ env "LOKI_PUSH_URL" }}
basic_auth:
username: admin
password: {{ secret "basic_auth" }}
external_labels:
hostname: {{ env "DOMAIN" }}
scrape_configs:
- job_name: system
static_configs:
- targets:
- localhost
labels:
job: varlogs
__path__: /var/log/*log
- job_name: "docker"
docker_sd_configs:
- host: "unix:///var/run/docker.sock"
refresh_interval: "10s"
relabel_configs:
- source_labels: ['__meta_docker_container_name']
target_label: "container_name"
- source_labels: ['__meta_docker_container_id']
target_label: "container_id"
- source_labels: ['__meta_docker_container_label_com_docker_stack_namespace']
target_label: "stack_namespace"
- source_labels: ['__meta_docker_container_label_com_docker_swarm_service_name']
target_label: "service_name"

View File

@ -6,4 +6,7 @@ COMPOSE_FILE="$COMPOSE_FILE:compose.grafana-oidc.yml"
COMPOSE_FILE="$COMPOSE_FILE:compose.grafana-smtp.yml"
3. The scape-config.example.yml file and add_node() command were updated to use a secure endpoint for the traefik metrics instead http. This requires an updated Traefik recipe that publishes the metrics on https.
3. The scrape-config.example.yml file and add_node() command were updated to use a secure endpoint for the traefik metrics instead of http. This requires an updated Traefik recipe that publishes the metrics on https.
4. Secret and config names were shortened to max 14 characters to prevent going over Docker's 64 character limit when STACK_NAME and VERSION are added to it.
When upgrading, you need to reinsert the secrets with their shorter names. Run `abra app secret list <domain>` to see which secrets aren't created on the server (because their name was shortened) and run `abra app secret insert <domain> <secret_name> v1 <value>` to reinsert them with the shorter name. Or you can use the migrate_secret_names function in abra.sh to reinsert all existing secrets with their shorter name automatically: `abra app cmd --local <domain> migrate_secret_names`