29 Commits

Author SHA1 Message Date
e247677433 feat: scrape metrics from containers via Docker label discovery
Containers opt in with prometheus.io/scrape=true and optionally set
prometheus.io/port, prometheus.io/path, and prometheus.io/auth=basic.
2026-06-14 21:38:43 +02:00
f2310f2b86 improve alloy config and match with main branch
- Restrict Alloy UI to loopback
- Narrow volume mounts: drop /dev, reduce /var/run to docker.sock:ro
- Replace HTTP scrape of :12345 with prometheus.exporter.self
Match with main branch (node-exporter / promtail / cadvisor):
- Add docker_only and explicit enabled_metrics to cadvisor exporter
- Match node-exporter collector config
- Match promtail relabeling (container_name, container_id, stack_namespace,
  service_name) and external hostname label
- Add SYSLOG_FILES option to tail /var/log/*log (matches promtail)
- Fix journal path and syslog listener address
2026-06-13 22:07:55 +02:00
f
f2711fa16e fix: upgrades 2026-06-03 00:19:41 -03:00
f
2870b9486c fix: use the actual health check path 2026-06-02 21:06:05 -03:00
f
3a1fabe4f9 fix: prevent redirections on health check 2026-06-02 21:04:56 -03:00
f
a358837922 wip: relabel syslog according to docs 2026-06-02 21:00:34 -03:00
f
dd0a0c1bb0 fixup! feat: read syslog 2026-06-02 20:19:45 -03:00
f
31cabc36ae fix: prevent traefik deprecation warnings 2026-06-02 19:16:49 -03:00
f
d25986d5cb fix: README 2026-06-02 18:51:10 -03:00
f
f8f8004445 feat: read syslog 2026-06-02 18:50:41 -03:00
f
aa05d022da feat: optionally push to prometheus and loki 2026-06-02 18:50:20 -03:00
f
fb52a76247 BREAKING CHANGE: deprecate node-exporter 2026-06-02 18:49:05 -03:00
f
2e2a52eae0 BREAKING CHANGE: deprecate promtail 2026-06-02 18:48:20 -03:00
f
48419d5afa fixup! BREAKING CHANGE: no need to expose exporters 2026-06-02 18:46:02 -03:00
f
a0a6e2c509 fix: basic auth secret is always needed 2026-06-02 18:44:32 -03:00
f
024f2a8aec feat: send docker logs to loki 2026-06-02 18:39:24 -03:00
f
38095e23fa BREAKING CHANGE: no need to expose exporters 2026-06-02 18:37:56 -03:00
f
641161329e fix: grafana alternate domain doesn't work
the variable is not expanded and the domain name label ends up as a
literal "$DOMAIN".
2026-06-02 18:00:00 -03:00
f
cdacfd035e fix: prometheus querying panel is accessible through basic auth 2026-06-02 17:52:25 -03:00
f
b2d3901f61 fix: bind mounts recommended by docs 2026-06-02 13:24:28 -03:00
f
8becf1c1d6 fixup! feat: node exporter 2026-05-29 16:16:37 -03:00
f
777b1355dd fixup! feat: node exporter 2026-05-29 16:16:08 -03:00
f
e83433cebd feat: node exporter 2026-05-29 16:04:19 -03:00
f
a713f98ffb feat: instance name is domain 2026-05-29 16:03:59 -03:00
f
8dc84c591c fixup! feat: enable prometheus remote write receiver 2026-05-29 15:38:52 -03:00
f
d9aa05a4b5 feat: send metrics to prometheus 2026-05-28 21:00:10 -03:00
f
349df12204 feat: enable prometheus remote write receiver 2026-05-28 20:44:00 -03:00
f
6c33089078 feat: cadvisor 2026-05-28 20:38:50 -03:00
f
4bedebfab1 BREAKING CHANGES: replace promtail and cadvisor for alloy 2026-05-28 20:33:36 -03:00
17 changed files with 685 additions and 231 deletions

View File

@ -5,16 +5,28 @@ DOMAIN=monitoring-ng.example.com
#TIMEOUT=120
ENABLE_BACKUPS=true
## Enable this secret for Promtail / Prometheus
#COMPOSE_FILE="$COMPOSE_FILE:compose.basic-auth.yml"
#SECRET_BASIC_AUTH_VERSION=v1
SECRET_BASIC_AUTH_VERSION=v1
# Enable this to send logs to a Loki server, adapt DOMAIN if server is
# remote
# LOKI_PUSH_URL=https://loki.$DOMAIN/loki/api/v1/push
# Enable on systemd hosts to read logs from the journal
# JOURNALD=1
#
# Promtail (Gathering Logs)
# COMPOSE_FILE="$COMPOSE_FILE:compose.promtail.yml"
# LOKI_PUSH_URL=https://loki.monitoring.example.org/loki/api/v1/push
# Enable on non-systemd hosts (Alpine, older Debian/Ubuntu) to tail
# /var/log/*log files (syslog, auth.log, kern.log, etc.) that a local
# syslogd writes. No syslogd reconfiguration needed.
# SYSLOG_FILES=1
#
# Enable to receive syslog messages over the network on port 514/tcp.
# Use for remote devices that push syslog to this host, or for a
# local syslogd configured to forward over the network.
# Not needed if you just want to read local log files — use SYSLOG_FILES instead.
# SYSLOG=1
# COMPOSE_FILE="$COMPOSE_FILE:compose.syslog.yml"
## Expose node and cadvisor ports instead of traefik
# COMPOSE_FILE="$COMPOSE_FILE:compose.expose-ports.yml"
# Enable this to send metrics to a Prometheus server, adapt DOMAIN if
# server is remote
# PROMETHEUS_REMOTE_WRITE_URL=https://prometheus.$DOMAIN/api/v1/write
# Monitoring Server
#
@ -47,8 +59,6 @@ ENABLE_BACKUPS=true
# COMPOSE_FILE="$COMPOSE_FILE:compose.grafana.yml"
# GF_SERVER_ROOT_URL=https://monitoring.example.com
# SECRET_GF_ADMINPASSWD_VERSION=v1
## Seperate domain for Grafana
#GRAFANA_DOMAIN=grafana.example.com
#
## Single-Sign-On with OIDC
# COMPOSE_FILE="$COMPOSE_FILE:compose.grafana-oidc.yml"
@ -80,10 +90,9 @@ ENABLE_BACKUPS=true
#GF_MATRIX_ROOM_ID="<room-id>"
#GF_MATRIX_HOMESERVER_URL="<homeserver-url>"
## ALerts
# Node disk space alert will trigger when free disk space left is below the given number in percent
#ALERT_NODE_DISK_SPACE_LEFT=10
# Node memory usage alert will trigger when memory usage is above the given number in percent
#ALERT_NODE_MEMORY_USAGE=85
# ALerts
#ALERT_BACKUP_FAILED_ENABLED=true
#ALERT_BACKUP_MISSING_ENABLED=true
#ALERT_BACKUP_NOT_SUCCESSFULL_ENABLED=true
#ALERT_NODE_DISK_SPACE_ENABLED=true
#ALERT_NODE_MEMORY_USAGE_ENABLED=true

View File

@ -1,8 +1,8 @@
# monitoring-ng
Yet another monitoring stack ...
This time its a all-in-one grafana/prometheus/loki/node_exporter/cadvisor/promtail stack.
It's based heavily on the [monitoring-lite](https://git.coopcloud.tech/coop-cloud/monitoring-lite) stack, but has everything in one recipe included now. So you can deploy monitoring instances to only gather metrics / logs (node_exporter/cadvisor/promtail) and also deploy instances with the full monitoring stack (grafana/prometheus/loki) with the same recipe and just different .env configuration.
This time its a all-in-one grafana/prometheus/loki/alloy stack.
It's based heavily on the [monitoring-lite](https://git.coopcloud.tech/coop-cloud/monitoring-lite) stack, but has everything in one recipe included now. So you can deploy monitoring instances to only gather metrics / logs (alloy) and also deploy instances with the full monitoring stack (grafana/prometheus/loki) with the same recipe and just different .env configuration.
<!-- metadata -->
@ -18,37 +18,47 @@ It's based heavily on the [monitoring-lite](https://git.coopcloud.tech/coop-clou
<!-- endmetadata -->
## Setup Metrics Gathering
## Setup a Metrics Gathering
Where gathering.org is the node you want to gather metrics from.
1. Configure DNS
- monitoring.gathering.org
- cadvisor.monitoring.gathering.org
- node.monitoring.gathering.org
2. [Configure Traefik to use BasicAuth](https://git.coopcloud.tech/coop-cloud/traefik#configuring-wildcard-ssl-using-dns)
3. `abra app new monitoring-ng`
4. `abra app config monitoring.gathering.org` (for gathering only the main `compose.yml` is needed, nothing more.)
5. `abra app deploy monitoring.gathering.org`
6. check that endpoints are up and basic-auth works
1. Configure Traefik to use BasicAuth
* `abra app config traefik.gathering.org`
uncomment
```
# BASIC_AUTH
COMPOSE_FILE="$COMPOSE_FILE:compose.basicauth.yml"
BASIC_AUTH=1
SECRET_USERSFILE_VERSION=v1
```
- Generate userslist with httpasswd hashed password
`abra app secret insert traefik.gathering.org usersfile v1 'admin:<hashed-secret>'`
make sure there is no whitespace in between `admin:<hashed-secret>`, it seems to break stuff...
- `abra app deploy -f traefik`
1. `abra app new monitoring-ng`
1. `abra app config monitoring.gathering.org`
for gathering only the main `compose.yml` is needed, nothing more.
1. `abra app deploy monitoring.gathering.org`
1. check that endpoints are up and basic-auth works
- cadvisor.monitoring.gathering.org
- node.monitoring.gathering.org
### Expose node and cadvisor via ports instead of traefik
In case you have no traefik running on the machine, you can expose the ports directly by uncommenting the following line:
```
# COMPOSE_FILE="$COMPOSE_FILE:compose.expose-ports.yml"
```
## Setup Metrics Browser
This builds upon [Setup Metrics Gathering](#setup-metrics-grathering) so make sure you did that first.
1. Configure DNS
- monitoring.example.org
- prometheus.monitoring.example.org
- loki.monitoring.example.org
2. Setup monitoring stack
- `abra app config monitoring.example.org` Uncomment prometheus, loki and grafana
- `abra app secret insert monitoring.example.org basic_auth v1 <password>`
- `abra app new monitoring-ng`
- `abra app config monitoring.example.org`
Uncomment all the stuff
- `abra app secret insert monitoring.example.org basic_auth v1 <secret>`
this needs the plaintext traefik basic-auth secret, not the hashed one!
- `abra app secret ls monitoring.example.org`
- `abra app deploy monitoring.example.org`
@ -139,9 +149,13 @@ GF_MATRIX_HOME_SERVER_URL=
```
4. Configure Alertmanager webhook and set the url to `http://matrix-alertmanager-receiver:12345/alerts/<room-id>`
## Alerts
## alerts
It is possible to enable the following alerts, by setting the corresponding env variable to `true`:
- backupbot failed: `ALERT_BACKUP_FAILED_ENABLED`
- backupbot missing: `ALERT_BACKUP_MISSING_ENABLED`
- backupbot not successfull: `ALERT_BACKUP_NOT_SUCCESSFULL_ENABLED`
- node disk space: `ALERT_NODE_DISK_SPACE_ENABLED`
- node memory usage: `ALERT_NODE_MEMORY_USAGE_ENABLED`
It is possible to enable the following alerts, by uncommenting the corresponding env variable:
- node disk space: `ALERT_NODE_DISK_SPACE_LEFT`
- node memory usage: `ALERT_NODE_MEMORY_USAGE`

View File

@ -6,12 +6,12 @@ export GF_STACKS_DASH_VERSION=v2
export GF_TRAEFIK_DASH_VERSION=v2
export GF_BACKUP_DASH_VERSION=v1
export GF_CUSTOM_INI_VERSION=v4
export PROMTAIL_YML_VERSION=v3
export LOKI_YML_VERSION=v3
export PROMETHEUS_YML_VERSION=v2
export MATRIX_ALERTMANAGER_CONFIG_VERSION=v1
export MATRIX_ALERTMANAGER_ENTRYPOINT_VERSION=v1
export GRAFANA_ALERTS_NODE_VERSION=v2
export MATRIX_ALERTMANAGER_CONFIG_VERSION=e
export MATRIX_ALERTMANAGER_ENTRYPOINT_VERSION=a
export GRAFANA_ALERTS_NODE_VERSION=v1c
export CONFIG_ALLOY_VERSION=v9
# creates a default prometheus scrape config for a given node
add_node(){

View File

@ -2,13 +2,13 @@ apiVersion: 1
# List of alert rule UIDs that should be deleted
deleteRules:
{{ if not (env "ALERT_NODE_DISK_SPACE_LEFT") }}
{{ if ne (env "ALERT_NODE_DISK_SPACE_ENABLED") "true" }}
- orgId: 1
uid: coopcloud_node_disk_space_left
uid: bds8bhxu97pxca
{{ end }}
{{ if not (env "ALERT_NODE_MEMORY_USAGE") }}
{{ if ne (env "ALERT_NODE_MEMORY_USAGE_ENABLED") "true" }}
- orgId: 1
uid: coopcloud_node_memory_usage
uid: ads8cswmly96oa
{{ end }}
groups:
@ -17,8 +17,8 @@ groups:
folder: node
interval: 5m
rules:
{{ if (env "ALERT_NODE_DISK_SPACE_LEFT") }}
- uid: coopcloud_node_disk_space_left
{{ if eq (env "ALERT_NODE_DISK_SPACE_ENABLED") "true" }}
- uid: bds8bhxu97pxca
title: Node Disk Space
condition: C
data:
@ -45,7 +45,7 @@ groups:
conditions:
- evaluator:
params:
- {{ env "ALERT_NODE_DISK_SPACE_LEFT" }}
- 10
type: lt
operator:
type: and
@ -70,13 +70,13 @@ groups:
annotations:
description: ""
runbook_url: ""
summary: Less than {{ env "ALERT_NODE_DISK_SPACE_LEFT" }}% disk space left on {{`{{ $labels.instance }}`}} ({{`{{ (index $values "A").Value }}`}}% left)
summary: Less than 10% disk space left on {{`{{ $labels.instance }}`}} ({{`{{ (index $values "A").Value }}`}}% left)
labels:
"": ""
isPaused: false
{{ end }}
{{ if (env "ALERT_NODE_MEMORY_USAGE") }}
- uid: coopcloud_node_memory_usage
{{ if eq (env "ALERT_NODE_MEMORY_USAGE_ENABLED") "true" }}
- uid: ads8cswmly96oa
title: Node Memory Usage
condition: C
data:
@ -103,7 +103,7 @@ groups:
conditions:
- evaluator:
params:
- {{ env "ALERT_NODE_MEMORY_USAGE" }}
- 85
type: gt
operator:
type: and
@ -126,6 +126,6 @@ groups:
execErrState: Error
for: 5m
annotations:
summary: Memory usage is above {{ env "ALERT_NODE_MEMORY_USAGE" }}% on {{`{{ $labels.instance }}`}} ({{`{{ printf "%.2f" (index $values "A").Value }}`}}% usage)
summary: Memory usage is above 85% on {{`{{ $labels.instance }}`}} ({{`{{ printf "%.2f" (index $values "A").Value }}`}}% usage)
isPaused: false
{{ end }}

View File

@ -1,7 +0,0 @@
---
version: "3.8"
secrets:
basic_auth:
external: true
name: ${STACK_NAME}_basic_auth_${SECRET_BASIC_AUTH_VERSION}

View File

@ -1,13 +0,0 @@
---
version: "3.8"
services:
app:
ports:
- "9100:9100"
deploy:
cadvisor:
ports:
- "9101:8080"
deploy:

View File

@ -2,7 +2,7 @@ version: '3.8'
services:
grafana:
image: grafana/grafana:12.4.0
image: grafana/grafana:12.4.3
volumes:
- grafana-data:/var/lib/grafana:rw
secrets:
@ -32,19 +32,19 @@ services:
- GF_SECURITY_ADMIN_PASSWORD__FILE=/run/secrets/gf_adminpasswd
- GF_SECURITY_ALLOW_EMBEDDING
- GF_INSTALL_PLUGINS
- ALERT_NODE_DISK_SPACE_LEFT
- ALERT_NODE_MEMORY_USAGE
- ALERT_NODE_DISK_SPACE_ENABLED
- ALERT_NODE_MEMORY_USAGE_ENABLED
deploy:
labels:
- "traefik.enable=true"
- "traefik.docker.network=proxy"
- "traefik.swarm.network=proxy"
- "traefik.http.services.${STACK_NAME}-grafana.loadbalancer.server.port=3000"
- "traefik.http.routers.${STACK_NAME}-grafana.rule=Host(`${GRAFANA_DOMAIN:-$DOMAIN}`)"
- "traefik.http.routers.${STACK_NAME}-grafana.rule=Host(`${DOMAIN}`)"
- "traefik.http.routers.${STACK_NAME}-grafana.entrypoints=web-secure"
- "traefik.http.routers.${STACK_NAME}-grafana.tls=true"
- "traefik.http.routers.${STACK_NAME}-grafana.tls.certresolver=${LETS_ENCRYPT_ENV}"
healthcheck:
test: "wget -q http://localhost:3000/ -O/dev/null"
test: "wget -q http://localhost:3000/healthz -O/dev/null"
interval: 5s
timeout: 10s
retries: 3

View File

@ -2,7 +2,7 @@ version: '3.8'
services:
loki:
image: grafana/loki:3.6.7
image: grafana/loki:3.7.2
command: -config.file=/etc/loki/local-config.yaml
networks:
- proxy
@ -27,7 +27,7 @@ services:
condition: on-failure
labels:
- "traefik.enable=true"
- "traefik.docker.network=proxy"
- "traefik.swarm.network=proxy"
- "traefik.http.services.${STACK_NAME}-loki.loadbalancer.server.port=3100"
- "traefik.http.routers.${STACK_NAME}-loki.rule=Host(`loki.${DOMAIN}`)"
- "traefik.http.routers.${STACK_NAME}-loki.entrypoints=web-secure"

View File

@ -2,7 +2,7 @@ version: '3.8'
services:
prometheus:
image: prom/prometheus:v3.10.0
image: prom/prometheus:v3.12.0
secrets:
- basic_auth
volumes:
@ -16,6 +16,8 @@ services:
- "--web.console.libraries=/usr/share/prometheus/console_libraries"
- "--web.console.templates=/usr/share/prometheus/consoles"
- "--storage.tsdb.retention.time=${PROMETHEUS_RETENTION_TIME}"
- "--enable-feature=remote-write-receiver"
- "--web.enable-remote-write-receiver"
networks:
- proxy
- internal
@ -24,7 +26,7 @@ services:
condition: on-failure
labels:
- "traefik.enable=true"
- "traefik.docker.network=proxy"
- "traefik.swarm.network=proxy"
- "traefik.http.services.${STACK_NAME}-prometheus.loadbalancer.server.port=9090"
- "traefik.http.routers.${STACK_NAME}-prometheus.rule=Host(`prometheus.${DOMAIN}`)"
- "traefik.http.routers.${STACK_NAME}-prometheus.entrypoints=web-secure"

View File

@ -1,25 +0,0 @@
version: "3.8"
services:
promtail:
image: grafana/promtail:3.6.7
volumes:
- /var/log:/var/log:ro
- /var/run/docker.sock:/var/run/docker.sock
command: -config.file=/etc/promtail/config.yml
configs:
- source: promtail_yml
target: /etc/promtail/config.yml
networks:
- internal
secrets:
- basic_auth
environment:
- DOMAIN
- LOKI_PUSH_URL
configs:
promtail_yml:
name: ${STACK_NAME}_promtail_yml_${PROMTAIL_YML_VERSION}
file: promtail.yml.tmpl
template_driver: golang

View File

@ -17,7 +17,7 @@ services:
condition: on-failure
labels:
- "traefik.enable=true"
- "traefik.docker.network=proxy"
- "traefik.swarm.network=proxy"
- "traefik.http.services.${STACK_NAME}-pushgateway.loadbalancer.server.port=9191"
- "traefik.http.routers.${STACK_NAME}-pushgateway.rule=Host(`pushgateway.${DOMAIN}`)"
- "traefik.http.routers.${STACK_NAME}-pushgateway.entrypoints=web-secure"

6
compose.syslog.yml Normal file
View File

@ -0,0 +1,6 @@
---
version: "3.8"
services:
app:
ports:
- "514:514"

View File

@ -3,89 +3,45 @@ version: "3.8"
services:
app:
image: prom/node-exporter:v1.10.2
user: root
environment:
- NODE_ID={{.Node.ID}}
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
- /etc/hostname:/etc/nodename:ro
command:
- "--path.sysfs=/host/sys"
- "--path.procfs=/host/proc"
- "--path.rootfs=/rootfs"
- "--collector.textfile.directory=/etc/node-exporter/"
- "--collector.filesystem.ignored-mount-points=^/(sys|proc|dev|host|etc)($$|/)"
- "--no-collector.ipvs"
image: grafana/alloy:v1.16.1
hostname: "${DOMAIN}"
configs:
- source: entrypoint
target: /entrypoint.sh
- source: config_alloy
target: /etc/alloy/config.alloy
volumes:
- /:/rootfs:ro
- /var/run/docker.sock:/var/run/docker.sock:ro
- /sys:/sys:ro
- /var/lib/docker:/var/lib/docker:ro
- alloy-data:/var/lib/alloy/data
command:
- "run"
- "--storage.path=/var/lib/alloy/data"
- "--server.http.listen-addr=127.0.0.1:12345"
- "/etc/alloy/config.alloy"
networks:
- internal
- proxy
entrypoint: [ "/bin/sh", "-e", "/entrypoint.sh" ]
secrets:
- basic_auth
deploy:
restart_policy:
condition: on-failure
labels:
- "backupbot.backup=${ENABLE_BACKUPS:-true}"
- "traefik.enable=true"
- "traefik.docker.network=proxy"
- "traefik.http.services.${STACK_NAME}-node.loadbalancer.server.port=9100"
- "traefik.http.routers.${STACK_NAME}-node.rule=Host(`node.${DOMAIN}`)"
- "traefik.http.routers.${STACK_NAME}-node.entrypoints=web-secure"
- "traefik.http.routers.${STACK_NAME}-node.tls=true"
- "traefik.http.routers.${STACK_NAME}-node.tls.certresolver=${LETS_ENCRYPT_ENV}"
- "traefik.http.routers.${STACK_NAME}-node.middlewares=basicauth@file"
- "traefik.enable=false"
- "coop-cloud.${STACK_NAME}.version=1.6.0+v1.8.1"
- "coop-cloud.${STACK_NAME}.timeout=${TIMEOUT}"
cadvisor:
image: gcr.io/cadvisor/cadvisor:v0.55.1
command:
- "-logtostderr"
- "--enable_metrics=cpu,cpuLoad,disk,diskIO,process,memory,network"
# all possible metrics: advtcp,app,cpu,cpuLoad,cpu_topology,cpuset,disk,diskIO,hugetlb,memory,memory_numa,network,oom_event,percpu,perf_event,process,referenced_memory,resctrl,sched,tcp,udp.
- "--housekeeping_interval=120s"
- "--docker_only=true"
volumes:
- /var/lib/docker/:/var/lib/docker:ro
- /dev/disk/:/dev/disk:ro
- /sys:/sys:ro
- /var/run:/var/run:ro
- /:/rootfs:ro
networks:
- internal
- proxy
deploy:
restart_policy:
condition: on-failure
labels:
- "traefik.enable=true"
- "traefik.docker.network=proxy"
- "traefik.http.services.${STACK_NAME}-cadvisor.loadbalancer.server.port=8080"
- "traefik.http.routers.${STACK_NAME}-cadvisor.rule=Host(`cadvisor.${DOMAIN}`)"
- "traefik.http.routers.${STACK_NAME}-cadvisor.entrypoints=web-secure"
- "traefik.http.routers.${STACK_NAME}-cadvisor.tls=true"
- "traefik.http.routers.${STACK_NAME}-cadvisor.tls.certresolver=${LETS_ENCRYPT_ENV}"
- "traefik.http.routers.${STACK_NAME}-cadvisor.middlewares=basicauth@file"
healthcheck:
test: wget --quiet --tries=1 --spider http://localhost:8080/healthz || exit 1
interval: 15s
timeout: 15s
retries: 5
start_period: 30s
configs:
entrypoint:
name: ${STACK_NAME}_entrypoint_${ENTRYPOINT_VERSION}
file: node-exporter-entrypoint.sh
config_alloy:
template_driver: golang
name: ${STACK_NAME}_config_alloy_${CONFIG_ALLOY_VERSION}
file: config.alloy.tmpl
networks:
proxy:
external: true
internal:
volumes:
alloy-data:
secrets:
basic_auth:
external: true
name: ${STACK_NAME}_basic_auth_${SECRET_BASIC_AUTH_VERSION}

245
config.alloy.tmpl Normal file
View File

@ -0,0 +1,245 @@
logging {
level = "info"
format = "logfmt"
}
discovery.docker "linux" {
host = "unix:///var/run/docker.sock"
}
{{ if ne (env "PROMETHEUS_REMOTE_WRITE_URL") "" }}
prometheus.exporter.cadvisor "docker" {
docker_only = true
enabled_metrics = ["cpu", "cpuLoad", "disk", "diskIO", "memory", "network", "process"]
}
prometheus.exporter.unix "default" {
include_exporter_metrics = true
rootfs_path = "/rootfs"
procfs_path = "/rootfs/proc"
sysfs_path = "/rootfs/sys"
disable_collectors = ["ipvs"]
filesystem {
fs_types_exclude = "^(autofs|binfmt_misc|bpf|cgroup2?|configfs|debugfs|devpts|devtmpfs|tmpfs|fusectl|hugetlbfs|iso9660|mqueue|nsfs|overlay|proc|procfs|pstore|rpc_pipefs|securityfs|selinuxfs|squashfs|sysfs|tracefs)$"
mount_points_exclude = "^/(sys|proc|dev|host|etc)($|/)"
mount_timeout = "5s"
}
netclass { ignored_devices = "^(veth.*)$" }
netdev { device_exclude = "^(veth.*)$" }
}
prometheus.exporter.self "alloy" {}
prometheus.scrape "default" {
scrape_interval = "120s"
targets = array.concat(
prometheus.exporter.self.alloy.targets,
prometheus.exporter.unix.default.targets,
prometheus.exporter.cadvisor.docker.targets,
)
forward_to = [prometheus.remote_write.prometheus.receiver]
}
prometheus.remote_write "prometheus" {
endpoint {
url = "{{ env "PROMETHEUS_REMOTE_WRITE_URL" }}"
basic_auth {
username = "admin"
password = "{{ secret "basic_auth" }}"
}
}
}
// Scrape Prometheus metrics from other containers on this host.
// Containers opt in via Docker labels:
// prometheus.io/scrape=true required: enable scraping
// prometheus.io/port=9090 optional: port exposing /metrics (defaults to first exposed port)
// prometheus.io/path=/metrics optional: path to metrics endpoint (default: /metrics)
// prometheus.io/auth=basic optional: use basic auth with the shared basic_auth secret
//
// Uses docker_gwbridge — the host-local bridge network Docker attaches all
// Swarm containers to for outbound connectivity. Alloy can reach any container
// on the same host via this network without needing to join each stack's
// overlay network.
discovery.docker "containers" {
host = "unix:///var/run/docker.sock"
match_first_network = false
}
discovery.relabel "metrics" {
targets = discovery.docker.containers.targets
rule {
source_labels = ["__meta_docker_network_name"]
regex = "docker_gwbridge"
action = "keep"
}
rule {
source_labels = ["__meta_docker_container_label_prometheus_io_scrape"]
regex = "true"
action = "keep"
}
rule {
source_labels = ["__address__", "__meta_docker_container_label_prometheus_io_port"]
regex = `(.+):\d+;(\d+)`
target_label = "__address__"
replacement = "$1:$2"
}
rule {
source_labels = ["__meta_docker_container_label_prometheus_io_path"]
regex = `(.+)`
target_label = "__metrics_path__"
}
rule {
source_labels = ["__meta_docker_container_label_com_docker_swarm_service_name"]
target_label = "job"
}
}
discovery.relabel "metrics_noauth" {
targets = discovery.relabel.metrics.output
rule {
source_labels = ["__meta_docker_container_label_prometheus_io_auth"]
regex = "^$"
action = "keep"
}
}
discovery.relabel "metrics_basicauth" {
targets = discovery.relabel.metrics.output
rule {
source_labels = ["__meta_docker_container_label_prometheus_io_auth"]
regex = "basic"
action = "keep"
}
}
prometheus.scrape "containers" {
scrape_interval = "120s"
targets = discovery.relabel.metrics_noauth.output
forward_to = [prometheus.remote_write.prometheus.receiver]
}
prometheus.scrape "containers_basicauth" {
scrape_interval = "120s"
targets = discovery.relabel.metrics_basicauth.output
forward_to = [prometheus.remote_write.prometheus.receiver]
basic_auth {
username = "admin"
password = "{{ secret "basic_auth" }}"
}
}
{{ end }}
{{ if ne (env "LOKI_PUSH_URL") "" }}
discovery.relabel "docker" {
targets = discovery.docker.linux.targets
rule {
source_labels = ["__meta_docker_container_name"]
target_label = "container_name"
}
rule {
source_labels = ["__meta_docker_container_id"]
target_label = "container_id"
}
rule {
source_labels = ["__meta_docker_container_label_com_docker_stack_namespace"]
target_label = "stack_namespace"
}
rule {
source_labels = ["__meta_docker_container_label_com_docker_swarm_service_name"]
target_label = "service_name"
}
rule {
source_labels = ["__meta_docker_container_log_stream"]
target_label = "stream"
}
}
loki.source.docker "docker" {
host = "unix:///var/run/docker.sock"
targets = discovery.relabel.docker.output
labels = {"app" = "docker"}
forward_to = [loki.write.loki.receiver]
}
// JOURNALD: reads the systemd journal binary log directly.
// Use on systemd hosts (most modern Linux distros). Requires no syslogd.
{{ if eq (env "JOURNALD") "1" }}
loki.source.journal "journal" {
path = "/rootfs/var/log/journal"
labels = { job = "{{ env "DOMAIN" }}" }
forward_to = [loki.write.loki.receiver]
}
{{ end }}
// SYSLOG_FILES: tails all /var/log/*log files (syslog, auth.log, kern.log, etc.).
// Use on non-systemd hosts where a syslogd writes to /var/log.
{{ if eq (env "SYSLOG_FILES") "1" }}
local.file_match "syslog_files" {
path_targets = [{ __path__ = "/rootfs/var/log/*log" }]
}
loki.source.file "syslog_files" {
targets = local.file_match.syslog_files.targets
forward_to = [loki.process.syslog_files.receiver]
}
loki.process "syslog_files" {
stage.static_labels {
values = { job = "syslog" }
}
forward_to = [loki.write.loki.receiver]
}
{{ end }}
// SYSLOG: opens a network syslog listener on port 514.
// Use when a remote device or a local syslogd configured to
// forward over the network sends logs to this host.
// Requires compose.syslog.yml to publish port 514 to the host.
// This is NOT needed for reading local log files — use SYSLOG_FILES instead.
{{ if eq (env "SYSLOG") "1" }}
loki.relabel "syslog" {
rule {
action = "labelmap"
regex = "__syslog_(.+)"
}
forward_to = []
}
loki.source.syslog "syslog" {
listener {
address = "[::]:514"
label_structured_data = true
labels = { component = "loki.source.syslog" }
}
relabel_rules = loki.relabel.syslog.rules
forward_to = [loki.write.loki.receiver]
}
{{ end }}
loki.write "loki" {
endpoint {
url = "{{ env "LOKI_PUSH_URL" }}"
basic_auth {
username = "admin"
password = "{{ secret "basic_auth" }}"
}
}
external_labels = { hostname = "{{ env "DOMAIN" }}" }
}
{{ end }}

315
grafana-alerts.json.tmpl Normal file
View File

@ -0,0 +1,315 @@
{
"apiVersion": 1,
"groups": [
{
"orgId": 1,
"name": "backupbot",
"folder": "node",
"interval": "1m",
"rules": [
{{ if eq (env "ALERT_BACKUP_FAILED_ENABLED") "true" }}
{
"uid": "de8e5xxup7t34a",
"title": "Backup Failed",
"condition": "C",
"data": [
{
"refId": "A",
"relativeTimeRange": { "from": 600, "to": 0 },
"datasourceUid": "PBFA97CFB590B2093",
"model": {
"disableTextWrap": false,
"editorMode": "builder",
"expr": "backup",
"fullMetaSearch": false,
"includeNullMetadata": true,
"instant": true,
"intervalMs": 1000,
"legendFormat": "__auto",
"maxDataPoints": 43200,
"range": false,
"refId": "A",
"useBackend": false
}
},
{
"refId": "C",
"relativeTimeRange": { "from": 600, "to": 0 },
"datasourceUid": "__expr__",
"model": {
"conditions": [
{
"evaluator": { "params": [0], "type": "lt" },
"operator": { "type": "and" },
"query": { "params": ["C"] },
"reducer": { "params": [], "type": "last" },
"type": "query"
}
],
"datasource": { "type": "__expr__", "uid": "__expr__" },
"expression": "A",
"intervalMs": 1000,
"maxDataPoints": 43200,
"refId": "C",
"type": "threshold"
}
}
],
"noDataState": "NoData",
"execErrState": "Error",
"for": "1m",
"isPaused": false
},
{{ end }}
{{ if eq (env "ALERT_BACKUP_MISSING_ENABLED") "true" }}
{
"uid": "ce8e65uddcwe8d",
"title": "Backup Missing",
"condition": "B",
"data": [
{
"refId": "A",
"relativeTimeRange": { "from": 600, "to": 0 },
"datasourceUid": "PBFA97CFB590B2093",
"model": {
"disableTextWrap": false,
"editorMode": "builder",
"expr": "rate(backup[24h])",
"fullMetaSearch": false,
"includeNullMetadata": true,
"instant": true,
"intervalMs": 1000,
"legendFormat": "__auto",
"maxDataPoints": 43200,
"range": false,
"refId": "A",
"useBackend": false
}
},
{
"refId": "B",
"relativeTimeRange": { "from": 600, "to": 0 },
"datasourceUid": "__expr__",
"model": {
"conditions": [
{
"evaluator": { "params": [0, 0], "type": "within_range" },
"operator": { "type": "and" },
"query": { "params": ["C"] },
"reducer": { "params": [], "type": "last" },
"type": "query"
}
],
"datasource": { "type": "__expr__", "uid": "__expr__" },
"expression": "A",
"intervalMs": 1000,
"maxDataPoints": 43200,
"refId": "B",
"type": "threshold"
}
}
],
"noDataState": "NoData",
"execErrState": "Error",
"for": "5m",
"isPaused": false
},
{{ end }}
{{ if eq (env "ALERT_BACKUP_NOT_SUCCESSFULL_ENABLED") "true" }}
{
"uid": "de8e6bc92a8lcc",
"title": "Backup Not Successfull",
"condition": "B",
"data": [
{
"refId": "A",
"relativeTimeRange": {
"from": 60,
"to": 0
},
"datasourceUid": "PBFA97CFB590B2093",
"model": {
"disableTextWrap": false,
"editorMode": "builder",
"expr": "backup",
"fullMetaSearch": false,
"includeNullMetadata": true,
"instant": true,
"intervalMs": 1000,
"legendFormat": "__auto",
"maxDataPoints": 43200,
"range": false,
"refId": "A",
"useBackend": false
}
},
{
"refId": "B",
"relativeTimeRange": {
"from": 60,
"to": 0
},
"datasourceUid": "__expr__",
"model": {
"conditions": [
{
"evaluator": {
"params": [
0
],
"type": "gt"
},
"operator": {
"type": "and"
},
"query": {
"params": [
"C"
]
},
"reducer": {
"params": [],
"type": "last"
},
"type": "query"
}
],
"datasource": {
"type": "__expr__",
"uid": "__expr__"
},
"expression": "A",
"intervalMs": 1000,
"maxDataPoints": 43200,
"refId": "B",
"type": "threshold"
}
}
],
"noDataState": "NoData",
"execErrState": "Error",
"for": "20m",
"annotations": {
"summary": "Backup did not finish within 20 minutes"
},
"labels": {},
"isPaused": false
}
{{ end }}
]
},
{
"orgId": 1,
"name": "node",
"folder": "node",
"interval": "5m",
"rules": [
{{ if eq (env "ALERT_NODE_DISK_SPACE_ENABLED") "true" }}
{
"uid": "bds8bhxu97pxca",
"title": "Node Disk Space",
"condition": "C",
"data": [
{
"refId": "A",
"relativeTimeRange": { "from": 600, "to": 0 },
"datasourceUid": "PBFA97CFB590B2093",
"model": {
"editorMode": "code",
"expr": "(node_filesystem_free_bytes{fstype=\"ext4\"} / node_filesystem_size_bytes{fstype=\"ext4\"}) * 100",
"instant": true,
"intervalMs": 1000,
"legendFormat": "__auto",
"maxDataPoints": 43200,
"range": false,
"refId": "A"
}
},
{
"refId": "C",
"relativeTimeRange": { "from": 600, "to": 0 },
"datasourceUid": "__expr__",
"model": {
"conditions": [
{
"evaluator": { "params": [10], "type": "lt" },
"operator": { "type": "and" },
"query": { "params": ["C"] },
"reducer": { "params": [], "type": "last" },
"type": "query"
}
],
"datasource": { "type": "__expr__", "uid": "__expr__" },
"expression": "A",
"intervalMs": 1000,
"maxDataPoints": 43200,
"refId": "C",
"type": "threshold"
}
}
],
"noDataState": "NoData",
"execErrState": "Error",
"for": "5m",
"annotations": {},
"labels": {},
"isPaused": false
},
{{ end }}
{{ if eq (env "ALERT_NODE_MEMORY_USAGE_ENABLED") "true" }}
{
"uid": "ads8cswmly96oa",
"title": "Node Memory Usage",
"condition": "C",
"data": [
{
"refId": "A",
"relativeTimeRange": { "from": 600, "to": 0 },
"datasourceUid": "PBFA97CFB590B2093",
"model": {
"editorMode": "code",
"expr": "(node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100",
"instant": true,
"intervalMs": 1000,
"legendFormat": "__auto",
"maxDataPoints": 43200,
"range": false,
"refId": "A"
}
},
{
"refId": "C",
"relativeTimeRange": { "from": 600, "to": 0 },
"datasourceUid": "__expr__",
"model": {
"conditions": [
{
"evaluator": { "params": [90], "type": "gt" },
"operator": { "type": "and" },
"query": { "params": ["C"] },
"reducer": { "params": [], "type": "last" },
"type": "query"
}
],
"datasource": { "type": "__expr__", "uid": "__expr__" },
"expression": "A",
"intervalMs": 1000,
"maxDataPoints": 43200,
"refId": "C",
"type": "threshold"
}
}
],
"noDataState": "NoData",
"execErrState": "Error",
"for": "5m",
"annotations": {},
"labels": {},
"isPaused": false
}
{{ end }}
]
}
]
}

View File

@ -1,11 +0,0 @@
#!/bin/sh -e
NODE_NAME=$(cat /etc/nodename)
mkdir -p /etc/node-exporter
echo "node_meta{node_id=\"$NODE_ID\", container_label_com_docker_swarm_node_id=\"$NODE_ID\", node_name=\"$NODE_NAME\"} 1" > /etc/node-exporter/node-meta.prom
set -- /bin/node_exporter "$@"
exec "$@"

View File

@ -1,37 +0,0 @@
server:
http_listen_port: 9080
grpc_listen_port: 0
positions:
filename: /tmp/positions.yaml
clients:
- url: {{ env "LOKI_PUSH_URL" }}
basic_auth:
username: admin
password: {{ secret "basic_auth" }}
external_labels:
hostname: {{ env "DOMAIN" }}
scrape_configs:
- job_name: system
static_configs:
- targets:
- localhost
labels:
job: varlogs
__path__: /var/log/*log
- job_name: "docker"
docker_sd_configs:
- host: "unix:///var/run/docker.sock"
refresh_interval: "10s"
relabel_configs:
- source_labels: ['__meta_docker_container_name']
target_label: "container_name"
- source_labels: ['__meta_docker_container_id']
target_label: "container_id"
- source_labels: ['__meta_docker_container_label_com_docker_stack_namespace']
target_label: "stack_namespace"
- source_labels: ['__meta_docker_container_label_com_docker_swarm_service_name']
target_label: "service_name"