coop-cloud/monitoring-ng

Go to file

Philipp Rothmann 6e02886e51 fix scrape config

2023-05-23 12:22:22 +02:00

.env.sample

add timeout

2023-05-23 12:17:50 +02:00

.gitignore

init

2022-03-31 14:52:21 +02:00

abra.sh

fix scrape config

2023-05-23 12:22:22 +02:00

alertmanager.yml.tmpl

init

2022-03-31 14:52:21 +02:00

compose.grafana.yml

fix sso

2023-05-21 11:55:26 +02:00

compose.loki.yml

set loglevel warn

2023-05-21 13:10:51 +02:00

compose.prometheus.yml

prometheus retention time

2023-05-20 11:31:30 +02:00

compose.promtail.yml

promtail stuff

2023-05-15 14:49:30 +02:00

compose.yml

add timeout

2023-05-23 12:17:50 +02:00

demo.yml

foo

2023-05-11 15:23:35 +02:00

grafana_custom.ini

add grafana

2023-02-11 17:17:50 +01:00

grafana-dashboards.yml

add grafana

2023-02-11 17:17:50 +01:00

grafana-datasources.yml

wip loki stuff

2023-02-13 16:10:33 +01:00

grafana-stacks-dashboard.json

add grafana

2023-02-11 17:17:50 +01:00

grafana-swarm-dashboard.json

foo

2023-05-11 15:23:35 +02:00

grafana-traefik-dashboard.json

foo

2023-05-11 15:23:35 +02:00

loki.yml.tmpl

set loglevel warn

2023-05-21 13:10:51 +02:00

node-exporter-entrypoint.sh

wip

2023-02-09 10:07:29 +01:00

prometheus.yml.tmpl

tweak scrape intervals

2023-05-23 12:18:39 +02:00

promtail.yml.tmpl

tweak scrape intervals

2023-05-23 12:18:39 +02:00

README.md

foo

2023-05-11 15:23:35 +02:00

scrape-config.example.yml

fix scrape config

2023-05-23 12:22:22 +02:00

README.md

monitoring-lite

A centralised grafana/prometheus/loki stack. This an alternative approach to coop-cloud/monitoring which does include any of the services which actually gather metrics and/or logs. Instead, this is a useful recipe for folks who need to centralise their monitoring stack into a single grafana/prometheus/loki & several instances of node_exporter/cadvisor/promtail.

Category: Apps
Status: 2, beta
Image: grafana/grafana, 4, upstream
Healthcheck: 3
Backups: 1
Email: 3
Tests: No
SSO: 1

Setup a Metrics Gathering

Where gathering.org is the node you want to gather metrics from.

Configure DNS

monitoring.gathering.org
cadvisor.monitoring.gathering.org
node.monitoring.gathering.org

Configure Traefik to use BasicAuth

abra app config traefik.gathering.org uncomment

# BASIC_AUTH
COMPOSE_FILE="$COMPOSE_FILE:compose.basicauth.yml"
BASIC_AUTH=1
SECRET_USERSFILE_VERSION=v1

Generate userslist with httpasswd hashed password abra app secret insert traefik.gathering.org userslist v1 'admin:hashed-secret' make sure there is no whitespace between admin:hashed-secret, it seems to break stuff...
abra app deploy traefik (might need to undeploy before)

abra app new monitoring-ng
abra app config monitoring.gathering.org for gathering only the main compose.yml is needed, nothing more.
abra app deploy monitoring.gathering.org
check that endpoints are up and basic-auth works

cadvisor.monitoring.gathering.org
node.monitoring.gathering.org

Setup Metrics Browser

Configure DNS
- monitoring.example.org
- loki.monitoring.example.org
- loki.monitoring.example.org
Setup monitoring stack
- abra app new monitoring-ng
- abra app config monitoring.example.org

cp scrape-config.example.yml gathering.org.yml
# adjust domain
# mkdir scrape_configs
abra app cp monitoring.dev.local-it.cloud gathering.org.yml prometheus:/prometheus/scrape_configs/

check that all configured targets are up: https://prometheus.monitoring.example.org/targets

Insert secrets for prometheus
add scrape config (see example) and run abra app cp to copy it
grafana sso secret


Grafana	Email / SSO	monitoring.example.org
Prometheus	traefik basic-auth	prometheus.monitoring.example.org
loki	traefik basic-auth	loki.monitoring.example.org
Cadvisor	traefik basic-auth	cadvisor.monitoring.example.org
Node Exporter	traefik basic-auth	node.monitoring.example.org

TODO

todo:

metrics.compose.yml -> compose.yml
Grafana
- Test SSO
Loki
- s3 aws secret?
- understand config, make it sane
Promtail
- make it work
- test it with second server
prometheus retention / storage size limit
traefik metrics
document example scrape config prometheus

nice to have:

uptime-kuma, dashboard
authentik metrics?
improve prometheus discovery / security things -> multiple scrape_configs in prometheus service -> oauth / header? prometheus could do it, does promtail? does traefik?

This stack requires 3 domains, one for grafana, prometheus, loki. This is due to the need for the gathering tools, such as node_exporter, to have a publicy accessible URL for making connections. We make use of the internal prometheus HTTP basic auth & wire up an Nginx proxy with HTTP basic auth for loki. Grafana uses Keycloak OpenId Connect sign in. The alertmanager setup remains internal and is only connected with grafana. It also assume that you are deploying the coop-cloud/gathering recipe on the machines that you want to gather metrics & logs from. Each instance of the gathering recipe will report back and/or be scraped by your central install of monitoring-lite.

Post-setup guide

configure prometheus/loki/alertmanager as data sources in grafana under Configuration > Data sources
- for loki, you need to set a "Custom HTTP Header": X-Scope-OrgID: fake
configure the SMTP mailer under Alerting > Contact points
- edit the default contact point, choose "Alertmanager" as type & http://alertmanager:9093 as URL
- use the "Test" button to send a test mail. It should fire a request at the alertmanager & that should send a mail
abra app cp your scrap_configs: ... into /prometheus/scrape_configs & log into your prometheus web UI to ensure they're working
load your dashboards in manually under Create > Dashboard
from your dashboard panels, choose Edit > Alert to create alerts based on those panels

THX to the previous work of @decentral1se @knooflok @3wc @cellarspoon @mirsal