feat: scrape metrics from containers via Docker label discovery

Containers opt in with prometheus.io/scrape=true and optionally set prometheus.io/port, prometheus.io/path, and prometheus.io/auth=basic.
improve alloy config and match with main branch
2026-06-14 21:38:43 +02:00 · 2026-06-13 22:07:55 +02:00 · 2026-06-03 00:19:41 -03:00 · 2026-06-02 21:06:05 -03:00 · 2026-06-02 21:04:56 -03:00 · 2026-06-02 21:00:34 -03:00
24 changed files with 824 additions and 289 deletions
--- a/.env.sample
+++ b/.env.sample
@ -5,16 +5,28 @@ DOMAIN=monitoring-ng.example.com
 #TIMEOUT=120
 ENABLE_BACKUPS=true

-## Enable this secret for Promtail / Prometheus
-#COMPOSE_FILE="$COMPOSE_FILE:compose.basic-auth.yml"
-#SECRET_BASIC_AUTH_VERSION=v1
+SECRET_BASIC_AUTH_VERSION=v1
+# Enable this to send logs to a Loki server, adapt DOMAIN if server is
+# remote
+# LOKI_PUSH_URL=https://loki.$DOMAIN/loki/api/v1/push
+# Enable on systemd hosts to read logs from the journal
+# JOURNALD=1
 #
-# Promtail (Gathering Logs)
-# COMPOSE_FILE="$COMPOSE_FILE:compose.promtail.yml"
-# LOKI_PUSH_URL=https://loki.monitoring.example.org/loki/api/v1/push
+# Enable on non-systemd hosts (Alpine, older Debian/Ubuntu) to tail
+# /var/log/*log files (syslog, auth.log, kern.log, etc.) that a local
+# syslogd writes. No syslogd reconfiguration needed.
+# SYSLOG_FILES=1
+#
+# Enable to receive syslog messages over the network on port 514/tcp.
+# Use for remote devices that push syslog to this host, or for a 
+# local syslogd configured to forward over the network.
+# Not needed if you just want to read local log files — use SYSLOG_FILES instead.
+# SYSLOG=1
+# COMPOSE_FILE="$COMPOSE_FILE:compose.syslog.yml"

-## Expose node and cadvisor ports instead of traefik
-# COMPOSE_FILE="$COMPOSE_FILE:compose.expose-ports.yml"
+# Enable this to send metrics to a Prometheus server, adapt DOMAIN if
+# server is remote
+# PROMETHEUS_REMOTE_WRITE_URL=https://prometheus.$DOMAIN/api/v1/write

 # Monitoring Server
 #
@ -40,20 +52,18 @@ ENABLE_BACKUPS=true
 # LOKI_AWS_REGION=eu-west-1
 # LOKI_ACCESS_KEY_ID=bush-debrief-approval-robust-scraggly-molecule
 # LOKI_BUCKET_NAMES=loki
-# SECRET_LOKI_AWS_SECRET_ACCESS_KEY_VERSION=v1
+# SECRET_LOKI_AWS_KEY_VERSION=v1
 #
 ## Grafana
 #
 # COMPOSE_FILE="$COMPOSE_FILE:compose.grafana.yml"
 # GF_SERVER_ROOT_URL=https://monitoring.example.com
-# SECRET_GRAFANA_ADMIN_PASSWORD_VERSION=v1
-## Seperate domain for Grafana
-#GRAFANA_DOMAIN=grafana.example.com
+# SECRET_GF_ADMINPASSWD_VERSION=v1
 #
 ## Single-Sign-On with OIDC
 # COMPOSE_FILE="$COMPOSE_FILE:compose.grafana-oidc.yml"
 # OIDC_ENABLED=1
-# SECRET_GRAFANA_OIDC_CLIENT_SECRET_VERSION=v1
+# SECRET_GF_OIDC_SECRET_VERSION=v1
 # OIDC_CLIENT_ID=grafana
 # OIDC_AUTH_URL="https://authentik.example.com/application/o/authorize/"
 # OIDC_API_URL="https://authentik.example.com/application/o/userinfo/"
@ -70,20 +80,19 @@ ENABLE_BACKUPS=true
 # GF_SMTP_ENABLED=true
 # GF_SMTP_FROM_ADDRESS=grafana@example.com
 # GF_SMTP_SKIP_VERIFY=false
-# SECRET_GRAFANA_SMTP_PASSWORD_VERSION=v1
+# SECRET_GF_SMTP_PASSWD_VERSION=v1
 #

 ## Grafana Matrix Contact Point (optional)
 #COMPOSE_FILE="$COMPOSE_FILE:compose.matrix-alertmanager-receiver.yml"
-#SECRET_MATRIX_ACCESS_TOKEN_VERSION=v1
+#SECRET_MATRIX_TOKEN_VERSION=v1
 #GF_MATRIX_USER_ID="<user-id>"
 #GF_MATRIX_ROOM_ID="<room-id>"
 #GF_MATRIX_HOMESERVER_URL="<homeserver-url>"

-## ALerts
-
-# Node disk space alert will trigger when free disk space left is below the given number in percent
-#ALERT_NODE_DISK_SPACE_LEFT=10
-
-# Node memory usage alert will trigger when memory usage is above the given number in percent
-#ALERT_NODE_MEMORY_USAGE=85
+# ALerts
+#ALERT_BACKUP_FAILED_ENABLED=true
+#ALERT_BACKUP_MISSING_ENABLED=true
+#ALERT_BACKUP_NOT_SUCCESSFULL_ENABLED=true
+#ALERT_NODE_DISK_SPACE_ENABLED=true
+#ALERT_NODE_MEMORY_USAGE_ENABLED=true
--- a/README.md
+++ b/README.md
@ -1,8 +1,8 @@
 # monitoring-ng

 Yet another monitoring stack ...
-This time its a all-in-one grafana/prometheus/loki/node_exporter/cadvisor/promtail stack.
-It's based heavily on the [monitoring-lite](https://git.coopcloud.tech/coop-cloud/monitoring-lite) stack, but has everything in one recipe included now. So you can deploy monitoring instances to only gather metrics / logs (node_exporter/cadvisor/promtail) and also deploy instances with the full monitoring stack (grafana/prometheus/loki) with the same recipe and just different .env configuration.
+This time its a all-in-one grafana/prometheus/loki/alloy stack.
+It's based heavily on the [monitoring-lite](https://git.coopcloud.tech/coop-cloud/monitoring-lite) stack, but has everything in one recipe included now. So you can deploy monitoring instances to only gather metrics / logs (alloy) and also deploy instances with the full monitoring stack (grafana/prometheus/loki) with the same recipe and just different .env configuration.


 <!-- metadata -->
@ -47,13 +47,6 @@ Where gathering.org is the node you want to gather metrics from.
  - cadvisor.monitoring.gathering.org
  - node.monitoring.gathering.org

-### Expose node and cadvisor via ports instead of traefik
-
-In case you have no traefik running on the machine, you can expose the ports directly by uncommenting the following line:
-```
-# COMPOSE_FILE="$COMPOSE_FILE:compose.expose-ports.yml"
-```
-
 ## Setup Metrics Browser


@ -145,7 +138,7 @@ COMPOSE_FILE="$COMPOSE_FILE:compose.matrix-alertmanager-receiver.yml"

 2. Insert the matrix access token secret:
 ```
-abra app secret insert monitoring.marx.klasse-methode.it matrix_access_token v1
+abra app secret insert monitoring.marx.klasse-methode.it matrix_token v1
 ```

 3. Set required configurations:
@ -156,9 +149,13 @@ GF_MATRIX_HOME_SERVER_URL=
 ```
 4. Configure Alertmanager webhook and set the url to `http://matrix-alertmanager-receiver:12345/alerts/<room-id>`

-## Alerts
+## alerts
+
+It is possible to enable the following alerts, by setting the corresponding env variable to `true`:
+- backupbot failed: `ALERT_BACKUP_FAILED_ENABLED`
+- backupbot missing: `ALERT_BACKUP_MISSING_ENABLED`
+- backupbot not successfull: `ALERT_BACKUP_NOT_SUCCESSFULL_ENABLED`
+- node disk space: `ALERT_NODE_DISK_SPACE_ENABLED`
+- node memory usage: `ALERT_NODE_MEMORY_USAGE_ENABLED`

-It is possible to enable the following alerts, by uncommenting the corresponding env variable:

- node disk space: `ALERT_NODE_DISK_SPACE_LEFT`
- node memory usage: `ALERT_NODE_MEMORY_USAGE`
--- a/abra.sh
+++ b/abra.sh
@ -1,17 +1,17 @@
 export ENTRYPOINT_VERSION=v1
-export GRAFANA_DATASOURCES_YML_VERSION=v1
-export GRAFANA_DASHBOARDS_YML_VERSION=v2
-export GRAFANA_SWARM_DASHBOARD_JSON_VERSION=v2
-export GRAFANA_STACKS_DASHBOARD_JSON_VERSION=v2
-export GRAFANA_TRAEFIK_DASHBOARD_JSON_VERSION=v2
-export GRAFANA_BACKUP_DASHBOARD_JSON_VERSION=v1
-export GRAFANA_CUSTOM_INI_VERSION=v4
-export PROMTAIL_YML_VERSION=v3
-export LOKI_YML_VERSION=v2
+export GF_DATASOURCES_VERSION=v1
+export GF_DASHBOARDS_VERSION=v2
+export GF_SWARM_DASH_VERSION=v2
+export GF_STACKS_DASH_VERSION=v2
+export GF_TRAEFIK_DASH_VERSION=v2
+export GF_BACKUP_DASH_VERSION=v1
+export GF_CUSTOM_INI_VERSION=v4
+export LOKI_YML_VERSION=v3
 export PROMETHEUS_YML_VERSION=v2
-export MATRIX_ALERTMANAGER_CONFIG_VERSION=v1
-export MATRIX_ALERTMANAGER_ENTRYPOINT_VERSION=v1
-export GRAFANA_ALERTS_NODE_VERSION=v2
+export MATRIX_ALERTMANAGER_CONFIG_VERSION=e
+export MATRIX_ALERTMANAGER_ENTRYPOINT_VERSION=a
+export GRAFANA_ALERTS_NODE_VERSION=v1c
+export CONFIG_ALLOY_VERSION=v9

 # creates a default prometheus scrape config for a given node
 add_node(){
@ -22,6 +22,101 @@ add_node(){
  cat "/prometheus/scrape_configs/$name.yml"
 }

+# migrates secrets from old names to new names by reading values from the
+# running containers on the server and re-inserting them under the new names.
+# preview changes: abra app cmd --local <app> migrate_secret_names
+# execute changes: abra app cmd --local <app> migrate_secret_names execute
+migrate_secret_names() {
+  if ! command -v jq &> /dev/null; then
+    echo "jq is required on your local machine to migrate secret names"
+    echo "It could not be found in your PATH, please install jq to proceed."
+    echo "For example: On a debian/ubuntu system, run `apt install jq`"
+    exit 1
+  fi
+
+  # Hardcoded migration mappings: old_secret_name|new_secret_name
+  MIGRATIONS="
+grafana_admin_password|gf_adminpasswd
+grafana_smtp_password|gf_smtp_passwd
+grafana_oidc_client_secret|gf_oidc_secret
+matrix_access_token|matrix_token
+loki_aws_secret_access_key|loki_aws_key
+"
+
+  # Determine which server the app is deployed on
+  SERVER=$(abra app ls -m | jq -r --arg domain "$APP_NAME" '[.[].apps[] | select(.domain == $domain) | .server] | first' 2>/dev/null)
+
+  if [ -z "$SERVER" ]; then
+    echo "Error: could not determine server for app '$APP_NAME'"
+    exit 1
+  fi
+
+  # Build a lookup table of all secrets currently mounted in this stack.
+  # Each line: <secretID> <containerID> <secretName>
+  LOOKUP=$(ssh "$SERVER" "
+    docker stack services ${STACK_NAME} --format '{{.Name}}' | while read svc; do
+      CID=\$(docker ps --no-trunc -q --filter \"name=\${svc}\" | head -1)
+      docker service inspect \"\$svc\" --format '{{json .Spec.TaskTemplate.ContainerSpec.Secrets}}' | \
+        jq -r --arg cid \"\$CID\" '.[]? | .SecretID + \" \" + \$cid + \" \" + .SecretName'
+    done | sort -k3 -r
+  " 2>/dev/null)
+
+  echo "Secret migration plan for: $APP_NAME (server: $SERVER)"
+  echo ""
+  printf "  %-24s %-8s %s\n" "OLD NAME" "FOUND" "ACTION"
+  printf "  %-24s %-8s %s\n" "--------" "-----" "------"
+
+  # Check each old name against the lookup table and display the plan
+  ANY_FOUND=false
+  while IFS='|' read -r OLD_NAME NEW_NAME; do
+    [ -z "$OLD_NAME" ] && continue
+    MATCH=$(echo "$LOOKUP" | grep " ${STACK_NAME}_${OLD_NAME}_" | head -1)
+    if [ -n "$MATCH" ]; then
+      printf "  %-24s %-8s %s\n" "$OLD_NAME" "yes" "recreate as '$NEW_NAME' version V1"
+      ANY_FOUND=true
+    else
+      printf "  %-24s %-8s %s\n" "$OLD_NAME" "no" "nothing (not found on server)"
+    fi
+  done <<< "$MIGRATIONS"
+
+  echo ""
+
+  if [ "$ANY_FOUND" = false ]; then
+    echo "No old secrets found on server. Nothing to migrate."
+    return 0
+  fi
+
+  if [ "$1" != "execute" ]; then
+    echo "To apply the above changes, run:"
+    echo "  abra app cmd --local $APP_NAME migrate_secret_names execute"
+    return 0
+  fi
+
+  # read each found secret from its container and re-insert with the new name
+  while IFS='|' read -r OLD_NAME NEW_NAME; do
+    [ -z "$OLD_NAME" ] && continue
+
+    MATCH=$(echo "$LOOKUP" | grep " ${STACK_NAME}_${OLD_NAME}_" | head -1)
+    [ -z "$MATCH" ] && continue
+
+    SECRET_ID=$(echo "$MATCH" | awk '{print $1}')
+    CID=$(echo "$MATCH" | awk '{print $2}')
+    SECRET_VALUE=$(ssh "$SERVER" "cat /var/lib/docker/containers/${CID}/mounts/secrets/${SECRET_ID} 2>/dev/null || sudo cat /var/lib/docker/containers/${CID}/mounts/secrets/${SECRET_ID} 2>/dev/null")
+
+    if [ -z "$SECRET_VALUE" ]; then
+      echo "Error: could not read value for '$OLD_NAME', skipping"
+      continue
+    fi
+
+    echo "Migrating: '$OLD_NAME' -> '$NEW_NAME' (v1)"
+    printf '%s' "$SECRET_VALUE" | abra app secret insert -C "$APP_NAME" "$NEW_NAME" v1
+    
+  done <<< "$MIGRATIONS"
+
+  echo ""
+  echo "Done."
+}
+
 # adds a domain to a scrape config or creates a new one
 add_domain(){
  name=$1
--- a/alertmanager-matrix-config.yml.tmpl
+++ b/alertmanager-matrix-config.yml.tmpl
@ -12,7 +12,7 @@ http:
 matrix:
  homeserver-url: "{{ env "GF_MATRIX_HOMESERVER_URL" }}"
  user-id: "{{ env "GF_MATRIX_USER_ID" }}"
-  access-token: "{{ secret "matrix_access_token" }}"
+  access-token: "{{ secret "matrix_token" }}"
  room-mapping:
    matrixroom: "{{ env "GF_MATRIX_ROOM_ID" }}"

--- a/alerts/node.yml.tmpl
+++ b/alerts/node.yml.tmpl
@ -2,13 +2,13 @@ apiVersion: 1

 # List of alert rule UIDs that should be deleted
 deleteRules:
-  {{ if not (env "ALERT_NODE_DISK_SPACE_LEFT") }}
+  {{ if ne (env "ALERT_NODE_DISK_SPACE_ENABLED") "true" }}
  - orgId: 1
-    uid: coopcloud_node_disk_space_left
+    uid: bds8bhxu97pxca
  {{ end }}
-  {{ if not (env "ALERT_NODE_MEMORY_USAGE") }}
+  {{ if ne (env "ALERT_NODE_MEMORY_USAGE_ENABLED") "true" }}
  - orgId: 1
-    uid: coopcloud_node_memory_usage
+    uid: ads8cswmly96oa
  {{ end }}

 groups:
@ -17,8 +17,8 @@ groups:
  folder: node
  interval: 5m
  rules:
-    {{ if (env "ALERT_NODE_DISK_SPACE_LEFT") }}
-    - uid: coopcloud_node_disk_space_left
+    {{ if eq (env "ALERT_NODE_DISK_SPACE_ENABLED") "true" }}
+    - uid: bds8bhxu97pxca
      title: Node Disk Space
      condition: C
      data:
@ -45,7 +45,7 @@ groups:
            conditions:
                - evaluator:
                    params:
-                        - {{ env "ALERT_NODE_DISK_SPACE_LEFT" }}
+                        - 10
                    type: lt
                  operator:
                    type: and
@ -70,13 +70,13 @@ groups:
      annotations:
        description: ""
        runbook_url: ""
-        summary: Less than {{ env "ALERT_NODE_DISK_SPACE_LEFT" }}% disk space left on {{`{{ $labels.instance }}`}} ({{`{{ (index $values "A").Value }}`}}% left)
+        summary: Less than 10% disk space left on {{`{{ $labels.instance }}`}} ({{`{{ (index $values "A").Value }}`}}% left)
      labels:
        "": ""
      isPaused: false
    {{ end }}
-    {{ if (env "ALERT_NODE_MEMORY_USAGE")  }}
-    - uid: coopcloud_node_memory_usage
+    {{ if eq (env "ALERT_NODE_MEMORY_USAGE_ENABLED") "true" }}
+    - uid: ads8cswmly96oa
      title: Node Memory Usage
      condition: C
      data:
@ -103,7 +103,7 @@ groups:
            conditions:
                - evaluator:
                    params:
-                        - {{ env "ALERT_NODE_MEMORY_USAGE"  }}
+                        - 85
                    type: gt
                  operator:
                    type: and
@ -126,6 +126,6 @@ groups:
      execErrState: Error
      for: 5m
      annotations:
-        summary: Memory usage is above {{ env "ALERT_NODE_MEMORY_USAGE"  }}% on {{`{{ $labels.instance }}`}}  ({{`{{ printf "%.2f" (index $values "A").Value }}`}}%  usage)
+        summary: Memory usage is above 85% on {{`{{ $labels.instance }}`}}  ({{`{{ printf "%.2f" (index $values "A").Value }}`}}%  usage)
      isPaused: false
    {{ end }}
--- a/compose.basic-auth.yml
+++ b/compose.basic-auth.yml
@ -1,7 +0,0 @@
---
-version: "3.8"
-
-secrets:
-  basic_auth:
-    external: true
-    name: ${STACK_NAME}_basic_auth_${SECRET_BASIC_AUTH_VERSION}
--- a/compose.expose-ports.yml
+++ b/compose.expose-ports.yml
@ -1,13 +0,0 @@
---
-version: "3.8"
-
-services:
-  app:
-    ports:
-      - "9100:9100"
-    deploy:
-
-  cadvisor:
-    ports:
-      - "9101:8080"
-    deploy:
--- a/compose.grafana-oidc.yml
+++ b/compose.grafana-oidc.yml
@ -3,7 +3,7 @@ version: '3.8'
 services:
  grafana:
    secrets:
-      - grafana_oidc_client_secret
+      - gf_oidc_secret
    environment:
      - OIDC_API_URL
      - OIDC_AUTH_URL
@ -12,6 +12,6 @@ services:
      - OIDC_TOKEN_URL

 secrets:
-  grafana_oidc_client_secret:
+  gf_oidc_secret:
    external: true
-    name: ${STACK_NAME}_grafana_oidc_client_secret_${SECRET_GRAFANA_OIDC_CLIENT_SECRET_VERSION}
+    name: ${STACK_NAME}_gf_oidc_secret_${SECRET_GF_OIDC_SECRET_VERSION}
--- a/compose.grafana-smtp.yml
+++ b/compose.grafana-smtp.yml
@ -3,16 +3,16 @@ version: '3.8'
 services:
  grafana:
    secrets:
-      - grafana_smtp_password
+      - gf_smtp_passwd
    environment:
      - GF_SMTP_HOST
      - GF_SMTP_USER
-      - GF_SMTP_PASSWORD__FILE=/run/secrets/grafana_smtp_password
+      - GF_SMTP_PASSWORD__FILE=/run/secrets/gf_smtp_passwd
      - GF_SMTP_ENABLED
      - GF_SMTP_FROM_ADDRESS
      - GF_SMTP_SKIP_VERIFY

 secrets:
-  grafana_smtp_password:
+  gf_smtp_passwd:
    external: true
-    name: ${STACK_NAME}_grafana_smtp_password_${SECRET_GRAFANA_SMTP_PASSWORD_VERSION}
+    name: ${STACK_NAME}_gf_smtp_passwd_${SECRET_GF_SMTP_PASSWD_VERSION}
--- a/compose.grafana.yml
+++ b/compose.grafana.yml
@ -2,25 +2,25 @@ version: '3.8'

 services:
  grafana:
-    image: grafana/grafana:10.4.14
+    image: grafana/grafana:12.4.3
    volumes:
      - grafana-data:/var/lib/grafana:rw
    secrets:
-      - grafana_admin_password
+      - gf_adminpasswd
    configs:
-      - source: grafana_custom_ini
+      - source: gf_custom_ini
        target: /etc/grafana/grafana.ini
-      - source: grafana_datasources_yml
+      - source: gf_datasources
        target: /etc/grafana/provisioning/datasources/datasources.yml
-      - source: grafana_dashboards_yml
+      - source: gf_dashboards
        target: /etc/grafana/provisioning/dashboards/dashboards.yml
-      - source: grafana_swarm_dashboard_json
+      - source: gf_swarm_dash
        target: /var/lib/grafana/dashboards/docker-swarm-nodes.json
-      - source: grafana_stacks_dashboard_json
+      - source: gf_stacks_dash
        target: /var/lib/grafana/dashboards/docker-swarm-stacks.json
-      - source: grafana_traefik_dashboard_json
+      - source: gf_traefik_dash
        target: /var/lib/grafana/dashboards/traefik.json
-      - source: grafana_backup_dashboard_json
+      - source: gf_backup_dash
        target: /var/lib/grafana/dashboards/backup.json
      - source: gf_alerts_node
        target: /etc/grafana/provisioning/alerting/node.yml
@ -29,49 +29,49 @@ services:
      - internal
    environment:
      - GF_SERVER_ROOT_URL
-      - GF_SECURITY_ADMIN_PASSWORD__FILE=/run/secrets/grafana_admin_password
+      - GF_SECURITY_ADMIN_PASSWORD__FILE=/run/secrets/gf_adminpasswd
      - GF_SECURITY_ALLOW_EMBEDDING
      - GF_INSTALL_PLUGINS
-      - ALERT_NODE_DISK_SPACE_LEFT
-      - ALERT_NODE_MEMORY_USAGE
+      - ALERT_NODE_DISK_SPACE_ENABLED
+      - ALERT_NODE_MEMORY_USAGE_ENABLED
    deploy:
      labels:
        - "traefik.enable=true"
-        - "traefik.docker.network=proxy"
+        - "traefik.swarm.network=proxy"
        - "traefik.http.services.${STACK_NAME}-grafana.loadbalancer.server.port=3000"
-        - "traefik.http.routers.${STACK_NAME}-grafana.rule=Host(`${GRAFANA_DOMAIN:-$DOMAIN}`)"
+        - "traefik.http.routers.${STACK_NAME}-grafana.rule=Host(`${DOMAIN}`)"
        - "traefik.http.routers.${STACK_NAME}-grafana.entrypoints=web-secure"
        - "traefik.http.routers.${STACK_NAME}-grafana.tls=true"
        - "traefik.http.routers.${STACK_NAME}-grafana.tls.certresolver=${LETS_ENCRYPT_ENV}"
    healthcheck:
-      test: "wget -q http://localhost:3000/ -O/dev/null"
+      test: "wget -q http://localhost:3000/healthz -O/dev/null"
      interval: 5s
      timeout: 10s
      retries: 3
      start_period: 10s

 configs:
-  grafana_custom_ini:
+  gf_custom_ini:
    template_driver: golang
-    name: ${STACK_NAME}_grafana_custom_ini_${GRAFANA_CUSTOM_INI_VERSION}
+    name: ${STACK_NAME}_gf_custom_ini_${GF_CUSTOM_INI_VERSION}
    file: grafana_custom.ini
-  grafana_datasources_yml:
-    name: ${STACK_NAME}_g_datasources_yml_${GRAFANA_DATASOURCES_YML_VERSION}
+  gf_datasources:
+    name: ${STACK_NAME}_gf_datasources_${GF_DATASOURCES_VERSION}
    file: grafana-datasources.yml
-  grafana_dashboards_yml:
-    name: ${STACK_NAME}_g_dashboards_yml_${GRAFANA_DASHBOARDS_YML_VERSION}
+  gf_dashboards:
+    name: ${STACK_NAME}_gf_dashboards_${GF_DASHBOARDS_VERSION}
    file: grafana-dashboards.yml
-  grafana_swarm_dashboard_json:
-    name: ${STACK_NAME}_g_swarm_dashboard_json_${GRAFANA_SWARM_DASHBOARD_JSON_VERSION}
+  gf_swarm_dash:
+    name: ${STACK_NAME}_gf_swarm_dash_${GF_SWARM_DASH_VERSION}
    file: grafana-swarm-dashboard.json
-  grafana_stacks_dashboard_json:
-    name: ${STACK_NAME}_g_stacks_dashboard_json_${GRAFANA_STACKS_DASHBOARD_JSON_VERSION}
+  gf_stacks_dash:
+    name: ${STACK_NAME}_gf_stacks_dash_${GF_STACKS_DASH_VERSION}
    file: grafana-stacks-dashboard.json
-  grafana_traefik_dashboard_json:
-    name: ${STACK_NAME}_g_traefik_dashboard_json_${GRAFANA_TRAEFIK_DASHBOARD_JSON_VERSION}
+  gf_traefik_dash:
+    name: ${STACK_NAME}_gf_traefik_dash_${GF_TRAEFIK_DASH_VERSION}
    file: grafana-traefik-dashboard.json
-  grafana_backup_dashboard_json:
-    name: ${STACK_NAME}_g_backup_dashboard_json_${GRAFANA_BACKUP_DASHBOARD_JSON_VERSION}
+  gf_backup_dash:
+    name: ${STACK_NAME}_gf_backup_dash_${GF_BACKUP_DASH_VERSION}
    file: grafana-backup-dashboard.json
  gf_alerts_node:
    template_driver: golang
@ -83,6 +83,6 @@ volumes:


 secrets:
-  grafana_admin_password:
+  gf_adminpasswd:
    external: true
-    name: ${STACK_NAME}_grafana_admin_password_${SECRET_GRAFANA_ADMIN_PASSWORD_VERSION}
+    name: ${STACK_NAME}_gf_adminpasswd_${SECRET_GF_ADMINPASSWD_VERSION}
--- a/compose.loki.yml
+++ b/compose.loki.yml
@ -2,7 +2,7 @@ version: '3.8'

 services:
  loki:
-    image: grafana/loki:2.9.11
+    image: grafana/loki:3.7.2
    command: -config.file=/etc/loki/local-config.yaml
    networks:
      - proxy
@ -12,7 +12,7 @@ services:
    volumes:
      - loki-data:/loki
    # secrets:
-    #   - loki_aws_secret_access_key
+    #   - loki_aws_key
    environment:
      - LOKI_ACCESS_KEY_ID
      - LOKI_AWS_ENDPOINT
@ -27,7 +27,7 @@ services:
        condition: on-failure
      labels:
        - "traefik.enable=true"
-        - "traefik.docker.network=proxy"
+        - "traefik.swarm.network=proxy"
        - "traefik.http.services.${STACK_NAME}-loki.loadbalancer.server.port=3100"
        - "traefik.http.routers.${STACK_NAME}-loki.rule=Host(`loki.${DOMAIN}`)"
        - "traefik.http.routers.${STACK_NAME}-loki.entrypoints=web-secure"
@ -47,6 +47,6 @@ volumes:
  loki-data:

 # secrets:
-#   loki_aws_secret_access_key:
+#   loki_aws_key:
 #     external: true
-#     name: ${STACK_NAME}_loki_aws_secret_access_key_${SECRET_LOKI_AWS_SECRET_ACCESS_KEY_VERSION}
+#     name: ${STACK_NAME}_loki_aws_key_${SECRET_LOKI_AWS_KEY_VERSION}
--- a/compose.matrix-alertmanager-receiver.yml
+++ b/compose.matrix-alertmanager-receiver.yml
@ -2,9 +2,9 @@ version: '3.8'

 services:
  matrix-alertmanager-receiver:
-    image: metio/matrix-alertmanager-receiver:2025.2.9
+    image: metio/matrix-alertmanager-receiver:2026.2.25
    secrets:
-      - matrix_access_token
+      - matrix_token
    configs:
      - source: matrix-alertmanager-receiver-config
        target: /etc/matrix-alertmanager-receiver/config.yml
@ -23,6 +23,6 @@ configs:
    file: alertmanager-matrix-config.yml.tmpl

 secrets:
-  matrix_access_token:
+  matrix_token:
    external: true
-    name: ${STACK_NAME}_matrix_access_token_${SECRET_MATRIX_ACCESS_TOKEN_VERSION}
+    name: ${STACK_NAME}_matrix_token_${SECRET_MATRIX_TOKEN_VERSION}
--- a/compose.prometheus.yml
+++ b/compose.prometheus.yml
@ -2,7 +2,7 @@ version: '3.8'

 services:
  prometheus:
-    image: prom/prometheus:v2.55.1
+    image: prom/prometheus:v3.12.0
    secrets:
      - basic_auth
    volumes:
@ -16,6 +16,8 @@ services:
      - "--web.console.libraries=/usr/share/prometheus/console_libraries"
      - "--web.console.templates=/usr/share/prometheus/consoles"
      - "--storage.tsdb.retention.time=${PROMETHEUS_RETENTION_TIME}"
+      - "--enable-feature=remote-write-receiver"
+      - "--web.enable-remote-write-receiver"
    networks:
      - proxy
      - internal
@ -24,12 +26,13 @@ services:
        condition: on-failure
      labels:
        - "traefik.enable=true"
-        - "traefik.docker.network=proxy"
+        - "traefik.swarm.network=proxy"
        - "traefik.http.services.${STACK_NAME}-prometheus.loadbalancer.server.port=9090"
        - "traefik.http.routers.${STACK_NAME}-prometheus.rule=Host(`prometheus.${DOMAIN}`)"
        - "traefik.http.routers.${STACK_NAME}-prometheus.entrypoints=web-secure"
        - "traefik.http.routers.${STACK_NAME}-prometheus.tls=true"
        - "traefik.http.routers.${STACK_NAME}-prometheus.tls.certresolver=${LETS_ENCRYPT_ENV}"
+        - "traefik.http.routers.${STACK_NAME}-prometheus.middlewares=basicauth@file"

 configs:
  prometheus_yml:
--- a/compose.promtail.yml
+++ b/compose.promtail.yml
@ -1,25 +0,0 @@
-version: "3.8"
-
-services:
-  promtail:
-    image: grafana/promtail:2.9.11
-    volumes:
-      - /var/log:/var/log:ro
-      - /var/run/docker.sock:/var/run/docker.sock
-    command: -config.file=/etc/promtail/config.yml
-    configs:
-      - source: promtail_yml
-        target: /etc/promtail/config.yml
-    networks:
-      - internal
-    secrets:
-      - basic_auth
-    environment:
-      - DOMAIN
-      - LOKI_PUSH_URL
-
-configs:
-  promtail_yml:
-    name: ${STACK_NAME}_promtail_yml_${PROMTAIL_YML_VERSION}
-    file: promtail.yml.tmpl
-    template_driver: golang
--- a/compose.pushgateway.yml
+++ b/compose.pushgateway.yml
@ -2,7 +2,7 @@ version: '3.8'

 services:
  pushgateway:
-    image: prom/pushgateway:v1.10.0
+    image: prom/pushgateway:v1.11.2
    command:
      - '--web.listen-address=:9191'
      - '--push.disable-consistency-check'
@ -17,7 +17,7 @@ services:
        condition: on-failure
      labels:
        - "traefik.enable=true"
-        - "traefik.docker.network=proxy"
+        - "traefik.swarm.network=proxy"
        - "traefik.http.services.${STACK_NAME}-pushgateway.loadbalancer.server.port=9191"
        - "traefik.http.routers.${STACK_NAME}-pushgateway.rule=Host(`pushgateway.${DOMAIN}`)"
        - "traefik.http.routers.${STACK_NAME}-pushgateway.entrypoints=web-secure"
--- a/compose.syslog.yml
+++ b/compose.syslog.yml
@ -0,0 +1,6 @@
+---
+version: "3.8"
+services:
+  app:
+    ports:
+    - "514:514"
--- a/compose.yml
+++ b/compose.yml
@ -3,89 +3,45 @@ version: "3.8"

 services:
  app:
-    image: prom/node-exporter:v1.8.1
-    user: root
-    environment:
-      - NODE_ID={{.Node.ID}}
-    volumes:
-      - /proc:/host/proc:ro
-      - /sys:/host/sys:ro
-      - /:/rootfs:ro
-      - /etc/hostname:/etc/nodename:ro
-    command:
-      - "--path.sysfs=/host/sys"
-      - "--path.procfs=/host/proc"
-      - "--path.rootfs=/rootfs"
-      - "--collector.textfile.directory=/etc/node-exporter/"
-      - "--collector.filesystem.ignored-mount-points=^/(sys|proc|dev|host|etc)($$|/)"
-      - "--no-collector.ipvs"
+    image: grafana/alloy:v1.16.1
+    hostname: "${DOMAIN}"
    configs:
-      - source: entrypoint
-        target: /entrypoint.sh
+      - source: config_alloy
+        target: /etc/alloy/config.alloy
+    volumes:
+      - /:/rootfs:ro
+      - /var/run/docker.sock:/var/run/docker.sock:ro
+      - /sys:/sys:ro
+      - /var/lib/docker:/var/lib/docker:ro
+      - alloy-data:/var/lib/alloy/data
+    command:
+      - "run"
+      - "--storage.path=/var/lib/alloy/data"
+      - "--server.http.listen-addr=127.0.0.1:12345"
+      - "/etc/alloy/config.alloy"
    networks:
      - internal
-      - proxy
-    entrypoint: [ "/bin/sh", "-e", "/entrypoint.sh" ]
+    secrets:
+      - basic_auth
    deploy:
      restart_policy:
        condition: on-failure
      labels:
        - "backupbot.backup=${ENABLE_BACKUPS:-true}"
-        - "traefik.enable=true"
-        - "traefik.docker.network=proxy"
-        - "traefik.http.services.${STACK_NAME}-node.loadbalancer.server.port=9100"
-        - "traefik.http.routers.${STACK_NAME}-node.rule=Host(`node.${DOMAIN}`)"
-        - "traefik.http.routers.${STACK_NAME}-node.entrypoints=web-secure"
-        - "traefik.http.routers.${STACK_NAME}-node.tls=true"
-        - "traefik.http.routers.${STACK_NAME}-node.tls.certresolver=${LETS_ENCRYPT_ENV}"
-        - "traefik.http.routers.${STACK_NAME}-node.middlewares=basicauth@file"
+        - "traefik.enable=false"
        - "coop-cloud.${STACK_NAME}.version=1.6.0+v1.8.1"
-        - "coop-cloud.${STACK_NAME}.timeout=${TIMEOUT}"
-
-  cadvisor:
-    image: gcr.io/cadvisor/cadvisor:v0.49.2
-    command:
-      - "-logtostderr"
-      - "--enable_metrics=cpu,cpuLoad,disk,diskIO,process,memory,network"
-      # all possible metrics: advtcp,app,cpu,cpuLoad,cpu_topology,cpuset,disk,diskIO,hugetlb,memory,memory_numa,network,oom_event,percpu,perf_event,process,referenced_memory,resctrl,sched,tcp,udp.
-      - "--housekeeping_interval=120s"
-      - "--docker_only=true"
-    volumes:
-      - /var/lib/docker/:/var/lib/docker:ro
-      - /dev/disk/:/dev/disk:ro
-      - /sys:/sys:ro
-      - /var/run:/var/run:ro
-      - /:/rootfs:ro
-    networks:
-      - internal
-      - proxy
-    deploy:
-      restart_policy:
-        condition: on-failure
-      labels:
-        - "traefik.enable=true"
-        - "traefik.docker.network=proxy"
-        - "traefik.http.services.${STACK_NAME}-cadvisor.loadbalancer.server.port=8080"
-        - "traefik.http.routers.${STACK_NAME}-cadvisor.rule=Host(`cadvisor.${DOMAIN}`)"
-        - "traefik.http.routers.${STACK_NAME}-cadvisor.entrypoints=web-secure"
-        - "traefik.http.routers.${STACK_NAME}-cadvisor.tls=true"
-        - "traefik.http.routers.${STACK_NAME}-cadvisor.tls.certresolver=${LETS_ENCRYPT_ENV}"
-        - "traefik.http.routers.${STACK_NAME}-cadvisor.middlewares=basicauth@file"
-    healthcheck:
-      test: wget --quiet --tries=1 --spider http://localhost:8080/healthz || exit 1
-      interval: 15s
-      timeout: 15s
-      retries: 5
-      start_period: 30s
-
 configs:
-  entrypoint:
-    name: ${STACK_NAME}_entrypoint_${ENTRYPOINT_VERSION}
-    file: node-exporter-entrypoint.sh
-
-
-
+  config_alloy:
+    template_driver: golang
+    name: ${STACK_NAME}_config_alloy_${CONFIG_ALLOY_VERSION}
+    file: config.alloy.tmpl
 networks:
  proxy:
    external: true
  internal:
+volumes:
+  alloy-data:
+secrets:
+  basic_auth:
+    external: true
+    name: ${STACK_NAME}_basic_auth_${SECRET_BASIC_AUTH_VERSION}
--- a/config.alloy.tmpl
+++ b/config.alloy.tmpl
@ -0,0 +1,245 @@
+logging {
+  level = "info"
+  format = "logfmt"
+}
+
+discovery.docker "linux" {
+  host = "unix:///var/run/docker.sock"
+}
+
+{{ if ne (env "PROMETHEUS_REMOTE_WRITE_URL") "" }}
+prometheus.exporter.cadvisor "docker" {
+  docker_only = true
+  enabled_metrics = ["cpu", "cpuLoad", "disk", "diskIO", "memory", "network", "process"]
+}
+
+prometheus.exporter.unix "default" {
+  include_exporter_metrics = true
+  rootfs_path = "/rootfs"
+  procfs_path = "/rootfs/proc"
+  sysfs_path = "/rootfs/sys"
+
+  disable_collectors = ["ipvs"]
+
+  filesystem {
+    fs_types_exclude     = "^(autofs|binfmt_misc|bpf|cgroup2?|configfs|debugfs|devpts|devtmpfs|tmpfs|fusectl|hugetlbfs|iso9660|mqueue|nsfs|overlay|proc|procfs|pstore|rpc_pipefs|securityfs|selinuxfs|squashfs|sysfs|tracefs)$"
+    mount_points_exclude = "^/(sys|proc|dev|host|etc)($|/)"
+    mount_timeout        = "5s"
+  }
+
+  netclass { ignored_devices = "^(veth.*)$" }
+  netdev   { device_exclude  = "^(veth.*)$" }
+}
+
+prometheus.exporter.self "alloy" {}
+
+prometheus.scrape "default" {
+  scrape_interval = "120s"
+
+  targets = array.concat(
+    prometheus.exporter.self.alloy.targets,
+    prometheus.exporter.unix.default.targets,
+    prometheus.exporter.cadvisor.docker.targets,
+  )
+
+  forward_to = [prometheus.remote_write.prometheus.receiver]
+}
+
+prometheus.remote_write "prometheus" {
+  endpoint {
+    url = "{{ env "PROMETHEUS_REMOTE_WRITE_URL" }}"
+
+    basic_auth {
+      username = "admin"
+      password = "{{ secret "basic_auth" }}"
+    }
+  }
+}
+
+// Scrape Prometheus metrics from other containers on this host.
+// Containers opt in via Docker labels:
+//   prometheus.io/scrape=true        required: enable scraping
+//   prometheus.io/port=9090          optional: port exposing /metrics (defaults to first exposed port)
+//   prometheus.io/path=/metrics      optional: path to metrics endpoint (default: /metrics)
+//   prometheus.io/auth=basic         optional: use basic auth with the shared basic_auth secret
+//
+// Uses docker_gwbridge — the host-local bridge network Docker attaches all
+// Swarm containers to for outbound connectivity. Alloy can reach any container
+// on the same host via this network without needing to join each stack's
+// overlay network.
+discovery.docker "containers" {
+  host                = "unix:///var/run/docker.sock"
+  match_first_network = false
+}
+
+discovery.relabel "metrics" {
+  targets = discovery.docker.containers.targets
+
+  rule {
+    source_labels = ["__meta_docker_network_name"]
+    regex         = "docker_gwbridge"
+    action        = "keep"
+  }
+
+  rule {
+    source_labels = ["__meta_docker_container_label_prometheus_io_scrape"]
+    regex         = "true"
+    action        = "keep"
+  }
+
+  rule {
+    source_labels = ["__address__", "__meta_docker_container_label_prometheus_io_port"]
+    regex         = `(.+):\d+;(\d+)`
+    target_label  = "__address__"
+    replacement   = "$1:$2"
+  }
+
+  rule {
+    source_labels = ["__meta_docker_container_label_prometheus_io_path"]
+    regex         = `(.+)`
+    target_label  = "__metrics_path__"
+  }
+
+  rule {
+    source_labels = ["__meta_docker_container_label_com_docker_swarm_service_name"]
+    target_label  = "job"
+  }
+}
+
+discovery.relabel "metrics_noauth" {
+  targets = discovery.relabel.metrics.output
+  rule {
+    source_labels = ["__meta_docker_container_label_prometheus_io_auth"]
+    regex         = "^$"
+    action        = "keep"
+  }
+}
+
+discovery.relabel "metrics_basicauth" {
+  targets = discovery.relabel.metrics.output
+  rule {
+    source_labels = ["__meta_docker_container_label_prometheus_io_auth"]
+    regex         = "basic"
+    action        = "keep"
+  }
+}
+
+prometheus.scrape "containers" {
+  scrape_interval = "120s"
+  targets         = discovery.relabel.metrics_noauth.output
+  forward_to      = [prometheus.remote_write.prometheus.receiver]
+}
+
+prometheus.scrape "containers_basicauth" {
+  scrape_interval = "120s"
+  targets         = discovery.relabel.metrics_basicauth.output
+  forward_to      = [prometheus.remote_write.prometheus.receiver]
+  basic_auth {
+    username = "admin"
+    password = "{{ secret "basic_auth" }}"
+  }
+}
+{{ end }}
+
+{{ if ne (env "LOKI_PUSH_URL") "" }}
+discovery.relabel "docker" {
+  targets = discovery.docker.linux.targets
+
+  rule {
+    source_labels = ["__meta_docker_container_name"]
+    target_label  = "container_name"
+  }
+  rule {
+    source_labels = ["__meta_docker_container_id"]
+    target_label  = "container_id"
+  }
+  rule {
+    source_labels = ["__meta_docker_container_label_com_docker_stack_namespace"]
+    target_label  = "stack_namespace"
+  }
+  rule {
+    source_labels = ["__meta_docker_container_label_com_docker_swarm_service_name"]
+    target_label  = "service_name"
+  }
+  rule {
+    source_labels = ["__meta_docker_container_log_stream"]
+    target_label  = "stream"
+  }
+}
+
+loki.source.docker "docker" {
+  host       = "unix:///var/run/docker.sock"
+  targets    = discovery.relabel.docker.output
+  labels     = {"app" = "docker"}
+  forward_to = [loki.write.loki.receiver]
+}
+
+// JOURNALD: reads the systemd journal binary log directly.
+// Use on systemd hosts (most modern Linux distros). Requires no syslogd.
+{{ if eq (env "JOURNALD") "1" }}
+loki.source.journal "journal" {
+  path = "/rootfs/var/log/journal"
+  labels = { job = "{{ env "DOMAIN" }}" }
+  forward_to = [loki.write.loki.receiver]
+}
+{{ end }}
+
+// SYSLOG_FILES: tails all /var/log/*log files (syslog, auth.log, kern.log, etc.).
+// Use on non-systemd hosts where a syslogd writes to /var/log.
+{{ if eq (env "SYSLOG_FILES") "1" }}
+local.file_match "syslog_files" {
+  path_targets = [{ __path__ = "/rootfs/var/log/*log" }]
+}
+
+loki.source.file "syslog_files" {
+  targets    = local.file_match.syslog_files.targets
+  forward_to = [loki.process.syslog_files.receiver]
+}
+
+loki.process "syslog_files" {
+  stage.static_labels {
+    values = { job = "syslog" }
+  }
+  forward_to = [loki.write.loki.receiver]
+}
+{{ end }}
+
+// SYSLOG: opens a network syslog listener on port 514.
+// Use when a remote device or a local syslogd configured to 
+// forward over the network sends logs to this host.
+// Requires compose.syslog.yml to publish port 514 to the host.
+// This is NOT needed for reading local log files — use SYSLOG_FILES instead.
+{{ if eq (env "SYSLOG") "1" }}
+loki.relabel "syslog" {
+  rule {
+    action = "labelmap"
+    regex = "__syslog_(.+)"
+  }
+
+  forward_to = []
+}
+
+loki.source.syslog "syslog" {
+  listener {
+    address = "[::]:514"
+    label_structured_data = true
+    labels = { component = "loki.source.syslog" }
+  }
+
+  relabel_rules = loki.relabel.syslog.rules
+  forward_to = [loki.write.loki.receiver]
+}
+{{ end }}
+
+loki.write "loki" {
+  endpoint {
+    url = "{{ env "LOKI_PUSH_URL" }}"
+
+    basic_auth {
+      username = "admin"
+      password = "{{ secret "basic_auth" }}"
+    }
+  }
+  external_labels = { hostname = "{{ env "DOMAIN" }}" }
+}
+{{ end }}
--- a/grafana-alerts.json.tmpl
+++ b/grafana-alerts.json.tmpl
@ -0,0 +1,315 @@
+{
+  "apiVersion": 1,
+  "groups": [
+    {
+      "orgId": 1,
+      "name": "backupbot",
+      "folder": "node",
+      "interval": "1m",
+      "rules": [
+        {{ if eq (env "ALERT_BACKUP_FAILED_ENABLED") "true" }}
+        {
+          "uid": "de8e5xxup7t34a",
+          "title": "Backup Failed",
+          "condition": "C",
+          "data": [
+            {
+              "refId": "A",
+              "relativeTimeRange": { "from": 600, "to": 0 },
+              "datasourceUid": "PBFA97CFB590B2093",
+              "model": {
+                "disableTextWrap": false,
+                "editorMode": "builder",
+                "expr": "backup",
+                "fullMetaSearch": false,
+                "includeNullMetadata": true,
+                "instant": true,
+                "intervalMs": 1000,
+                "legendFormat": "__auto",
+                "maxDataPoints": 43200,
+                "range": false,
+                "refId": "A",
+                "useBackend": false
+              }
+            },
+            {
+              "refId": "C",
+              "relativeTimeRange": { "from": 600, "to": 0 },
+              "datasourceUid": "__expr__",
+              "model": {
+                "conditions": [
+                  {
+                    "evaluator": { "params": [0], "type": "lt" },
+                    "operator": { "type": "and" },
+                    "query": { "params": ["C"] },
+                    "reducer": { "params": [], "type": "last" },
+                    "type": "query"
+                  }
+                ],
+                "datasource": { "type": "__expr__", "uid": "__expr__" },
+                "expression": "A",
+                "intervalMs": 1000,
+                "maxDataPoints": 43200,
+                "refId": "C",
+                "type": "threshold"
+              }
+            }
+          ],
+          "noDataState": "NoData",
+          "execErrState": "Error",
+          "for": "1m",
+          "isPaused": false
+        },
+        {{ end }}
+        {{ if eq (env "ALERT_BACKUP_MISSING_ENABLED") "true" }}
+        {
+          "uid": "ce8e65uddcwe8d",
+          "title": "Backup Missing",
+          "condition": "B",
+          "data": [
+            {
+              "refId": "A",
+              "relativeTimeRange": { "from": 600, "to": 0 },
+              "datasourceUid": "PBFA97CFB590B2093",
+              "model": {
+                "disableTextWrap": false,
+                "editorMode": "builder",
+                "expr": "rate(backup[24h])",
+                "fullMetaSearch": false,
+                "includeNullMetadata": true,
+                "instant": true,
+                "intervalMs": 1000,
+                "legendFormat": "__auto",
+                "maxDataPoints": 43200,
+                "range": false,
+                "refId": "A",
+                "useBackend": false
+              }
+            },
+            {
+              "refId": "B",
+              "relativeTimeRange": { "from": 600, "to": 0 },
+              "datasourceUid": "__expr__",
+              "model": {
+                "conditions": [
+                  {
+                    "evaluator": { "params": [0, 0], "type": "within_range" },
+                    "operator": { "type": "and" },
+                    "query": { "params": ["C"] },
+                    "reducer": { "params": [], "type": "last" },
+                    "type": "query"
+                  }
+                ],
+                "datasource": { "type": "__expr__", "uid": "__expr__" },
+                "expression": "A",
+                "intervalMs": 1000,
+                "maxDataPoints": 43200,
+                "refId": "B",
+                "type": "threshold"
+              }
+            }
+          ],
+          "noDataState": "NoData",
+          "execErrState": "Error",
+          "for": "5m",
+          "isPaused": false
+        },
+        {{ end }}
+        {{ if eq (env "ALERT_BACKUP_NOT_SUCCESSFULL_ENABLED") "true" }}
+        {
+            "uid": "de8e6bc92a8lcc",
+            "title": "Backup Not Successfull",
+            "condition": "B",
+            "data": [
+                {
+                    "refId": "A",
+                    "relativeTimeRange": {
+                        "from": 60,
+                        "to": 0
+                    },
+                    "datasourceUid": "PBFA97CFB590B2093",
+                    "model": {
+                        "disableTextWrap": false,
+                        "editorMode": "builder",
+                        "expr": "backup",
+                        "fullMetaSearch": false,
+                        "includeNullMetadata": true,
+                        "instant": true,
+                        "intervalMs": 1000,
+                        "legendFormat": "__auto",
+                        "maxDataPoints": 43200,
+                        "range": false,
+                        "refId": "A",
+                        "useBackend": false
+                    }
+                },
+                {
+                    "refId": "B",
+                    "relativeTimeRange": {
+                        "from": 60,
+                        "to": 0
+                    },
+                    "datasourceUid": "__expr__",
+                    "model": {
+                        "conditions": [
+                            {
+                                "evaluator": {
+                                    "params": [
+                                        0
+                                    ],
+                                    "type": "gt"
+                                },
+                                "operator": {
+                                    "type": "and"
+                                },
+                                "query": {
+                                    "params": [
+                                        "C"
+                                    ]
+                                },
+                                "reducer": {
+                                    "params": [],
+                                    "type": "last"
+                                },
+                                "type": "query"
+                            }
+                        ],
+                        "datasource": {
+                            "type": "__expr__",
+                            "uid": "__expr__"
+                        },
+                        "expression": "A",
+                        "intervalMs": 1000,
+                        "maxDataPoints": 43200,
+                        "refId": "B",
+                        "type": "threshold"
+                    }
+                }
+            ],
+            "noDataState": "NoData",
+            "execErrState": "Error",
+            "for": "20m",
+            "annotations": {
+                "summary": "Backup did not finish within 20 minutes"
+            },
+            "labels": {},
+            "isPaused": false
+        }
+        {{ end }}
+      ]
+    },
+    {
+      "orgId": 1,
+      "name": "node",
+      "folder": "node",
+      "interval": "5m",
+      "rules": [
+        {{ if eq (env "ALERT_NODE_DISK_SPACE_ENABLED") "true" }}
+        {
+          "uid": "bds8bhxu97pxca",
+          "title": "Node Disk Space",
+          "condition": "C",
+          "data": [
+            {
+              "refId": "A",
+              "relativeTimeRange": { "from": 600, "to": 0 },
+              "datasourceUid": "PBFA97CFB590B2093",
+              "model": {
+                "editorMode": "code",
+                "expr": "(node_filesystem_free_bytes{fstype=\"ext4\"} / node_filesystem_size_bytes{fstype=\"ext4\"}) * 100",
+                "instant": true,
+                "intervalMs": 1000,
+                "legendFormat": "__auto",
+                "maxDataPoints": 43200,
+                "range": false,
+                "refId": "A"
+              }
+            },
+            {
+              "refId": "C",
+              "relativeTimeRange": { "from": 600, "to": 0 },
+              "datasourceUid": "__expr__",
+              "model": {
+                "conditions": [
+                  {
+                    "evaluator": { "params": [10], "type": "lt" },
+                    "operator": { "type": "and" },
+                    "query": { "params": ["C"] },
+                    "reducer": { "params": [], "type": "last" },
+                    "type": "query"
+                  }
+                ],
+                "datasource": { "type": "__expr__", "uid": "__expr__" },
+                "expression": "A",
+                "intervalMs": 1000,
+                "maxDataPoints": 43200,
+                "refId": "C",
+                "type": "threshold"
+              }
+            }
+          ],
+          "noDataState": "NoData",
+          "execErrState": "Error",
+          "for": "5m",
+          "annotations": {},
+          "labels": {},
+          "isPaused": false
+        },
+        {{ end }}
+        {{ if eq (env "ALERT_NODE_MEMORY_USAGE_ENABLED") "true" }}
+        {
+          "uid": "ads8cswmly96oa",
+          "title": "Node Memory Usage",
+          "condition": "C",
+          "data": [
+            {
+              "refId": "A",
+              "relativeTimeRange": { "from": 600, "to": 0 },
+              "datasourceUid": "PBFA97CFB590B2093",
+              "model": {
+                "editorMode": "code",
+                "expr": "(node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100",
+                "instant": true,
+                "intervalMs": 1000,
+                "legendFormat": "__auto",
+                "maxDataPoints": 43200,
+                "range": false,
+                "refId": "A"
+              }
+            },
+            {
+              "refId": "C",
+              "relativeTimeRange": { "from": 600, "to": 0 },
+              "datasourceUid": "__expr__",
+              "model": {
+                "conditions": [
+                  {
+                    "evaluator": { "params": [90], "type": "gt" },
+                    "operator": { "type": "and" },
+                    "query": { "params": ["C"] },
+                    "reducer": { "params": [], "type": "last" },
+                    "type": "query"
+                  }
+                ],
+                "datasource": { "type": "__expr__", "uid": "__expr__" },
+                "expression": "A",
+                "intervalMs": 1000,
+                "maxDataPoints": 43200,
+                "refId": "C",
+                "type": "threshold"
+              }
+            }
+          ],
+          "noDataState": "NoData",
+          "execErrState": "Error",
+          "for": "5m",
+          "annotations": {},
+          "labels": {},
+          "isPaused": false
+        }
+        {{ end }}
+      ]
+    }
+  ]
+}
+
--- a/grafana_custom.ini
+++ b/grafana_custom.ini
@ -21,7 +21,7 @@ tls_skip_verify_insecure = false
 allow_sign_up = true
 auto_login = true
 client_id = {{ env "OIDC_CLIENT_ID" }}
-client_secret = {{ secret "grafana_oidc_client_secret" }}
+client_secret = {{ secret "gf_oidc_secret" }}
 auth_url = {{ env "OIDC_AUTH_URL" }}
 token_url = {{ env "OIDC_TOKEN_URL" }}
 api_url = {{ env "OIDC_API_URL" }}
--- a/loki.yml.tmpl
+++ b/loki.yml.tmpl
@ -34,7 +34,6 @@ ingester:
  max_chunk_age: 1h           # All chunks will be flushed when they hit this age, default is 1h
  chunk_target_size: 1048576  # Loki will attempt to build chunks up to 1.5MB, flushing first if chunk_idle_period or max_chunk_age is reached first
  chunk_retain_period: 30s    # Must be greater than index read cache TTL if using an index cache (Default index read cache TTL is 5m)
-  max_transfer_retries: 0     # Chunk transfers disabled
  wal:
    dir: "/tmp/wal"

@ -53,7 +52,7 @@ schema_config:
    - from: 2020-10-24
      store: boltdb-shipper
      object_store: filesystem
-      schema: v11
+      schema: v13
      index:
        prefix: index_
        period: 24h
@ -63,7 +62,6 @@ storage_config:
    active_index_directory: /loki/boltdb-shipper-active
    cache_location: /loki/boltdb-shipper-cache
    cache_ttl: 24h         # Can be increased for faster performance over longer query periods, uses more disk space
-    shared_store: filesystem
  filesystem:
    directory: /loki/chunks
 {{ end }}
@ -72,7 +70,6 @@ schema_config:
  configs:
  - from: 2020-11-25
    store: boltdb-shipper
-    object_store: aws
    schema: v11
    index:
      prefix: index_
@ -89,7 +86,7 @@ storage_config:
   endpoint: {{ env "LOKI_AWS_ENDPOINT" }}
   region: {{ env "LOKI_AWS_REGION" }}
   access_key_id: {{ env "LOKI_ACCESS_KEY_ID" }}
-   secret_access_key: {{ secret "loki_aws_secret_access_key" }}
+   secret_access_key: {{ secret "loki_aws_key" }}
   bucketnames: {{ env "LOKI_BUCKET_NAMES" }}
   insecure: false
   sse_encryption: false
@ -103,19 +100,24 @@ storage_config:

 compactor:
  working_directory: /loki/boltdb-shipper-compactor
-  shared_store: filesystem
  compaction_interval: 10m
  retention_enabled: true
  retention_delete_delay: 2h
  retention_delete_worker_count: 150
+{{ if eq (env "LOKI_STORAGE_FILESYSTEM") "1" }}
+  delete_request_store: filesystem
+{{ end }}
+{{ if eq (env "LOKI_STORAGE_S3") "1" }}
+  delete_request_store: aws
+{{ end }}

 limits_config:
-  enforce_metric_name: false
  reject_old_samples: true
  reject_old_samples_max_age: 168h
  retention_period: {{ env "LOKI_RETENTION_PERIOD" }}
  split_queries_by_interval: 24h
  max_query_parallelism: 100
+  allow_structured_metadata: false

 query_scheduler:
  max_outstanding_requests_per_tenant: 4096
@ -123,9 +125,6 @@ query_scheduler:
 frontend:
  max_outstanding_per_tenant: 4096

-chunk_store_config:
-  max_look_back_period: 0s
-
 table_manager:
  retention_deletes_enabled: false
  retention_period: 0s
--- a/node-exporter-entrypoint.sh
+++ b/node-exporter-entrypoint.sh
@ -1,11 +0,0 @@
-#!/bin/sh -e
-
-NODE_NAME=$(cat /etc/nodename)
-
-mkdir -p /etc/node-exporter
-
-echo "node_meta{node_id=\"$NODE_ID\", container_label_com_docker_swarm_node_id=\"$NODE_ID\", node_name=\"$NODE_NAME\"} 1" > /etc/node-exporter/node-meta.prom
-
-set -- /bin/node_exporter "$@"
-
-exec "$@"
--- a/promtail.yml.tmpl
+++ b/promtail.yml.tmpl
@ -1,37 +0,0 @@
-server:
-  http_listen_port: 9080
-  grpc_listen_port: 0
-
-positions:
-  filename: /tmp/positions.yaml
-
-clients:
-  - url: {{ env "LOKI_PUSH_URL" }}
-    basic_auth:
-      username: admin
-      password: {{ secret "basic_auth" }}
-    external_labels:
-      hostname: {{ env "DOMAIN" }}
-
-scrape_configs:
- job_name: system
-  static_configs:
-  - targets:
-      - localhost
-    labels:
-      job: varlogs
-      __path__: /var/log/*log
-
- job_name: "docker"
-  docker_sd_configs:
-    - host: "unix:///var/run/docker.sock"
-      refresh_interval: "10s"
-  relabel_configs:
-    - source_labels: ['__meta_docker_container_name']
-      target_label: "container_name"
-    - source_labels: ['__meta_docker_container_id']
-      target_label: "container_id"
-    - source_labels: ['__meta_docker_container_label_com_docker_stack_namespace']
-      target_label: "stack_namespace"
-    - source_labels: ['__meta_docker_container_label_com_docker_swarm_service_name']
-      target_label: "service_name"
--- a/release/next
+++ b/release/next
@ -6,4 +6,7 @@ COMPOSE_FILE="$COMPOSE_FILE:compose.grafana-oidc.yml"

 COMPOSE_FILE="$COMPOSE_FILE:compose.grafana-smtp.yml"

-3. The scape-config.example.yml file and add_node() command were updated to use a secure endpoint for the traefik metrics instead http. This requires an updated Traefik recipe that publishes the metrics on https.
+3. The scrape-config.example.yml file and add_node() command were updated to use a secure endpoint for the traefik metrics instead of http. This requires an updated Traefik recipe that publishes the metrics on https.
+
+4. Secret and config names were shortened to max 14 characters to prevent going over Docker's 64 character limit when STACK_NAME and VERSION are added to it.
+When upgrading, you need to reinsert the secrets with their shorter names. Run `abra app secret list <domain>` to see which secrets aren't created on the server (because their name was shortened) and run `abra app secret insert <domain> <secret_name> v1 <value>` to reinsert them with the shorter name. Or you can use the migrate_secret_names function in abra.sh to reinsert all existing secrets with their shorter name automatically: `abra app cmd --local <domain> migrate_secret_names`
Author	SHA1	Message	Date
Danny Groenewegen	e247677433	feat: scrape metrics from containers via Docker label discovery Containers opt in with prometheus.io/scrape=true and optionally set prometheus.io/port, prometheus.io/path, and prometheus.io/auth=basic.	2026-06-14 21:38:43 +02:00
Danny Groenewegen	f2310f2b86	improve alloy config and match with main branch - Restrict Alloy UI to loopback - Narrow volume mounts: drop /dev, reduce /var/run to docker.sock:ro - Replace HTTP scrape of :12345 with prometheus.exporter.self Match with main branch (node-exporter / promtail / cadvisor): - Add docker_only and explicit enabled_metrics to cadvisor exporter - Match node-exporter collector config - Match promtail relabeling (container_name, container_id, stack_namespace, service_name) and external hostname label - Add SYSLOG_FILES option to tail /var/log/*log (matches promtail) - Fix journal path and syslog listener address	2026-06-13 22:07:55 +02:00
f	f2711fa16e	fix: upgrades	2026-06-03 00:19:41 -03:00
f	2870b9486c	fix: use the actual health check path	2026-06-02 21:06:05 -03:00
f	3a1fabe4f9	fix: prevent redirections on health check	2026-06-02 21:04:56 -03:00
f	a358837922	wip: relabel syslog according to docs	2026-06-02 21:00:34 -03:00
f	dd0a0c1bb0	fixup! feat: read syslog	2026-06-02 20:19:45 -03:00
f	31cabc36ae	fix: prevent traefik deprecation warnings	2026-06-02 19:16:49 -03:00
f	d25986d5cb	fix: README	2026-06-02 18:51:10 -03:00
f	f8f8004445	feat: read syslog	2026-06-02 18:50:41 -03:00
f	aa05d022da	feat: optionally push to prometheus and loki	2026-06-02 18:50:20 -03:00
f	fb52a76247	BREAKING CHANGE: deprecate node-exporter	2026-06-02 18:49:05 -03:00
f	2e2a52eae0	BREAKING CHANGE: deprecate promtail	2026-06-02 18:48:20 -03:00
f	48419d5afa	fixup! BREAKING CHANGE: no need to expose exporters	2026-06-02 18:46:02 -03:00
f	a0a6e2c509	fix: basic auth secret is always needed	2026-06-02 18:44:32 -03:00
f	024f2a8aec	feat: send docker logs to loki	2026-06-02 18:39:24 -03:00
f	38095e23fa	BREAKING CHANGE: no need to expose exporters	2026-06-02 18:37:56 -03:00
f	641161329e	fix: grafana alternate domain doesn't work the variable is not expanded and the domain name label ends up as a literal "$DOMAIN".	2026-06-02 18:00:00 -03:00
f	cdacfd035e	fix: prometheus querying panel is accessible through basic auth	2026-06-02 17:52:25 -03:00
f	b2d3901f61	fix: bind mounts recommended by docs	2026-06-02 13:24:28 -03:00
f	8becf1c1d6	fixup! feat: node exporter	2026-05-29 16:16:37 -03:00
f	777b1355dd	fixup! feat: node exporter	2026-05-29 16:16:08 -03:00
f	e83433cebd	feat: node exporter	2026-05-29 16:04:19 -03:00
f	a713f98ffb	feat: instance name is domain	2026-05-29 16:03:59 -03:00
f	8dc84c591c	fixup! feat: enable prometheus remote write receiver	2026-05-29 15:38:52 -03:00
f	d9aa05a4b5	feat: send metrics to prometheus	2026-05-28 21:00:10 -03:00
f	349df12204	feat: enable prometheus remote write receiver	2026-05-28 20:44:00 -03:00
f	6c33089078	feat: cadvisor	2026-05-28 20:38:50 -03:00
f	4bedebfab1	BREAKING CHANGES: replace promtail and cadvisor for alloy	2026-05-28 20:33:36 -03:00
moritz	dd320e9f1c	fix: Shorten all secret and config names to max 14 characters (#13 ) Reviewed-on: #13 Reviewed-by: p4u1 <p4u1@noreply.git.coopcloud.tech> Reviewed-by: moritz <moritz@noreply.git.coopcloud.tech>	2026-05-11 15:38:15 +00:00
ammar	9cb997b25a	delete_request_store based on env variable	2026-04-09 04:36:03 +00:00
ammar	48d137d194	update loki config file	2026-04-09 04:36:03 +00:00
ammar	1acb5ebd6a	chore: update image tags	2026-04-09 04:36:03 +00:00
Danny Groenewegen	682f30cef1	Add migrate_secret_names() to abra.sh to reinsert all secrets with shortened names in docker	2026-03-25 16:11:37 +01:00
Danny Groenewegen	694c8a9875	Add instructions for shorter secret names to release notes	2026-03-25 16:11:28 +01:00
Danny Groenewegen	9dfa9cad2a	Shortened all the secret and config names to max 14 characters to prevent running into Docker's 64 character limit when STACK_NAME is appended to it.	2026-03-25 15:58:28 +01:00