monitoring-ng/README.md

# monitoring-lite

A centralised grafana/prometheus/loki stack. This an alternative approach to [`coop-cloud/monitoring`](https://git.coopcloud.tech/coop-cloud/monitoring) which does include any of the services which actually gather metrics and/or logs. Instead, this is a useful recipe for folks who need to centralise their monitoring stack into a single grafana/prometheus/loki & several instances of node_exporter/cadvisor/promtail.

<!-- metadata -->

- **Category**: Apps
- **Status**: 2, beta
- **Image**: [`grafana/grafana`](https://hub.docker.com/r/grafana/grafana), 4, upstream
- **Healthcheck**: 3
- **Backups**: 1
- **Email**: 3
- **Tests**: No
- **SSO**: 1

<!-- endmetadata -->

## Setup a Metrics Gathering

Where gathering.org is the node you want to gather metrics from.

1. Configure DNS
  * monitoring.gathering.org
  * cadvisor.monitoring.gathering.org
  * node.monitoring.gathering.org
1. Configure Traefik to use BasicAuth
  * `abra app config traefik.gathering.org`
    uncomment
    ```
    # BASIC_AUTH
    COMPOSE_FILE="$COMPOSE_FILE:compose.basicauth.yml"
    BASIC_AUTH=1
    SECRET_USERSFILE_VERSION=v1
    ```
  * Generate userslist with httpasswd hashed password
    `abra app secret insert traefik.gathering.org userslist v1 'admin:hashed-secret'`
    make sure there is no whitespace between admin:hashed-secret, it seems to break stuff...
  * `abra app deploy traefik` (might need to undeploy before)
1. `abra app new monitoring-ng`
1. `abra app config monitoring.gathering.org`
    for gathering only this is required:
    `COMPOSE_FILE="$COMPOSE_FILE:compose.metrics.yml"`
1. `abra app deploy monitoring.gathering.org`
1. check that endpoints are up and basic-auth works
  * cadvisor.monitoring.gathering.org
  * node.monitoring.gathering.org

## Setup Metrics Browser

1. Configure DNS
   * monitoring.example.org
   * loki.monitoring.example.org
   * loki.monitoring.example.org


```
cp scrape-config.example.yml gathering.org.yml
# adjust domain
# mkdir scrape_configs
abra app cp monitoring.dev.local-it.cloud gathering.org.yml prometheus:/prometheus/scrape_configs/
```

* check that all configured targets are up:
  https://prometheus.monitoring.example.org/targets


###

1. Insert secrets for prometheus
1. add scrape config (see example)
   and run abra app cp to copy it
1. grafana sso secret

|               |                    |                                   |
| ------------- | ------------------ | --------------------------------- |
| Grafana       | Email / SSO        | monitoring.example.org            |
| Prometheus    | traefik basic-auth | prometheus.monitoring.example.org |
| loki          | traefik basic-auth | loki.monitoring.example.org       |
| Cadvisor      | traefik basic-auth | cadvisor.monitoring.example.org   |
| Node Exporter | traefik basic-auth | node.monitoring.example.org       |


### TODO

* metrics.compose.yml -> compose.yml
* Grafana
  * [ ] Test SSO
* Loki
  * [ ] s3 aws secret?
  * [ ] understand config, make it sane
* [ ] Promtail
  * [ ] make it work
* prometheus retention!
* traefik metrics
* [uptime-kuma](https://github.com/louislam/uptime-kuma/wiki/Prometheus-Integration), [dashboard](https://grafana.com/grafana/dashboards/14847-uptime-kuma/)
* authentik metrics?
* cool alerts
* note: alle gathering nodes will have the same httpasswd basic-auth secret ...
  -> this could be a use case to actually use docker swarm ...
      could use swarm_service_discovery then in prometheus
  -> multiple scrape_configs in prometheus
     service
  -> oauth / header? prometheus could do it, does promtail? does traefik?


This stack requires 3 domains, one for grafana, prometheus, loki. This is due to the need for the gathering tools, such as node_exporter, to have a publicy accessible URL for making connections. We make use of the internal prometheus HTTP basic auth & wire up an Nginx proxy with HTTP basic auth for loki. Grafana uses Keycloak OpenId Connect sign in. The alertmanager setup remains internal and is only connected with grafana. It also assume that you are deploying the [`coop-cloud/gathering`](https://git.coopcloud.tech/knoflook/gathering/) recipe on the machines that you want to gather metrics & logs from. Each instance of the gathering recipe will report back and/or be scraped by your central install of monitoring-lite.

## Post-setup guide

- configure prometheus/loki/alertmanager as data sources in grafana under `Configuration > Data sources`

  - for loki, you need to set a "Custom HTTP Header": `X-Scope-OrgID: fake`

- configure the SMTP mailer under `Alerting > Contact points`

  - edit the default contact point, choose "Alertmanager" as type & `http://alertmanager:9093` as URL
  - use the "Test" button to send a test mail. It should fire a request at the alertmanager & that should send a mail

- `abra app cp` your `scrap_configs: ...` into `/prometheus/scrape_configs` & log into your prometheus web UI to ensure they're working

- load your dashboards in manually under `Create > Dashboard`

- from your dashboard panels, choose `Edit > Alert` to create alerts based on those panels

THX to the previous work of @decentral1se @knooflok @3wc @cellarspoon @mirsal
init 2022-03-31 12:26:41 +00:00			`# monitoring-lite`

			A centralised grafana/prometheus/loki stack. This an alternative approach to [`coop-cloud/monitoring`](https://git.coopcloud.tech/coop-cloud/monitoring) which does include any of the services which actually gather metrics and/or logs. Instead, this is a useful recipe for folks who need to centralise their monitoring stack into a single grafana/prometheus/loki & several instances of node_exporter/cadvisor/promtail.

			`<!-- metadata -->`

wip loki stuff 2023-02-12 18:06:30 +00:00			`- Category: Apps`
			`- Status: 2, beta`
			- Image: [`grafana/grafana`](https://hub.docker.com/r/grafana/grafana), 4, upstream
			`- Healthcheck: 3`
			`- Backups: 1`
			`- Email: 3`
			`- Tests: No`
			`- SSO: 1`
init 2022-03-31 12:26:41 +00:00
			`<!-- endmetadata -->`

wip loki stuff 2023-02-13 15:10:33 +00:00			`## Setup a Metrics Gathering`
init 2022-03-31 12:26:41 +00:00
wip loki stuff 2023-02-13 15:10:33 +00:00			`Where gathering.org is the node you want to gather metrics from.`

			`1. Configure DNS`
			`* monitoring.gathering.org`
			`* cadvisor.monitoring.gathering.org`
			`* node.monitoring.gathering.org`
wip loki stuff 2023-02-12 18:06:30 +00:00			`1. Configure Traefik to use BasicAuth`
wip loki stuff 2023-02-13 15:10:33 +00:00			* `abra app config traefik.gathering.org`
			`uncomment`
			```
			`# BASIC_AUTH`
			`COMPOSE_FILE="$COMPOSE_FILE:compose.basicauth.yml"`
			`BASIC_AUTH=1`
			`SECRET_USERSFILE_VERSION=v1`
			```
			`* Generate userslist with httpasswd hashed password`
			`abra app secret insert traefik.gathering.org userslist v1 'admin:hashed-secret'`
			`make sure there is no whitespace between admin:hashed-secret, it seems to break stuff...`
			* `abra app deploy traefik` (might need to undeploy before)
			1. `abra app new monitoring-ng`
			1. `abra app config monitoring.gathering.org`
			`for gathering only this is required:`
			`COMPOSE_FILE="$COMPOSE_FILE:compose.metrics.yml"`
			1. `abra app deploy monitoring.gathering.org`
			`1. check that endpoints are up and basic-auth works`
			`* cadvisor.monitoring.gathering.org`
			`* node.monitoring.gathering.org`

			`## Setup Metrics Browser`

			`1. Configure DNS`
			`* monitoring.example.org`
			`* loki.monitoring.example.org`
			`* loki.monitoring.example.org`


			```
			`cp scrape-config.example.yml gathering.org.yml`
			`# adjust domain`
			`# mkdir scrape_configs`
			`abra app cp monitoring.dev.local-it.cloud gathering.org.yml prometheus:/prometheus/scrape_configs/`
			```

			`* check that all configured targets are up:`
			`https://prometheus.monitoring.example.org/targets`

wip loki stuff 2023-02-12 18:06:30 +00:00
			`###`

			`1. Insert secrets for prometheus`
			`1. add scrape config (see example)`
			`and run abra app cp to copy it`
			`1. grafana sso secret`

			`\| \| \| \|`
			`\| ------------- \| ------------------ \| --------------------------------- \|`
			`\| Grafana \| Email / SSO \| monitoring.example.org \|`
			`\| Prometheus \| traefik basic-auth \| prometheus.monitoring.example.org \|`
			`\| loki \| traefik basic-auth \| loki.monitoring.example.org \|`
			`\| Cadvisor \| traefik basic-auth \| cadvisor.monitoring.example.org \|`
			`\| Node Exporter \| traefik basic-auth \| node.monitoring.example.org \|`

wip loki stuff 2023-02-13 15:10:33 +00:00

			`### TODO`

			`* metrics.compose.yml -> compose.yml`
loki works 2023-02-14 11:38:12 +00:00			`* Grafana`
			`* [ ] Test SSO`
			`* Loki`
wip loki stuff 2023-02-13 15:10:33 +00:00			`* [ ] s3 aws secret?`
loki works 2023-02-14 11:38:12 +00:00			`* [ ] understand config, make it sane`
wip loki stuff 2023-02-13 15:10:33 +00:00			`* [ ] Promtail`
loki works 2023-02-14 11:38:12 +00:00			`* [ ] make it work`
wip loki stuff 2023-02-13 15:10:33 +00:00			`* prometheus retention!`
			`* traefik metrics`
			`* [uptime-kuma](https://github.com/louislam/uptime-kuma/wiki/Prometheus-Integration), [dashboard](https://grafana.com/grafana/dashboards/14847-uptime-kuma/)`
			`* authentik metrics?`
			`* cool alerts`
			`* note: alle gathering nodes will have the same httpasswd basic-auth secret ...`
			`-> this could be a use case to actually use docker swarm ...`
			`could use swarm_service_discovery then in prometheus`
			`-> multiple scrape_configs in prometheus`
			`service`
			`-> oauth / header? prometheus could do it, does promtail? does traefik?`


wip loki stuff 2023-02-12 18:06:30 +00:00			This stack requires 3 domains, one for grafana, prometheus, loki. This is due to the need for the gathering tools, such as node_exporter, to have a publicy accessible URL for making connections. We make use of the internal prometheus HTTP basic auth & wire up an Nginx proxy with HTTP basic auth for loki. Grafana uses Keycloak OpenId Connect sign in. The alertmanager setup remains internal and is only connected with grafana. It also assume that you are deploying the [`coop-cloud/gathering`](https://git.coopcloud.tech/knoflook/gathering/) recipe on the machines that you want to gather metrics & logs from. Each instance of the gathering recipe will report back and/or be scraped by your central install of monitoring-lite.
init 2022-03-31 12:26:41 +00:00
			`## Post-setup guide`

			- configure prometheus/loki/alertmanager as data sources in grafana under `Configuration > Data sources`
wip loki stuff 2023-02-12 18:06:30 +00:00
init 2022-03-31 12:26:41 +00:00			- for loki, you need to set a "Custom HTTP Header": `X-Scope-OrgID: fake`

			- configure the SMTP mailer under `Alerting > Contact points`
wip loki stuff 2023-02-12 18:06:30 +00:00
init 2022-03-31 12:26:41 +00:00			- edit the default contact point, choose "Alertmanager" as type & `http://alertmanager:9093` as URL
			`- use the "Test" button to send a test mail. It should fire a request at the alertmanager & that should send a mail`

			- `abra app cp` your `scrap_configs: ...` into `/prometheus/scrape_configs` & log into your prometheus web UI to ensure they're working

			- load your dashboards in manually under `Create > Dashboard`

			- from your dashboard panels, choose `Edit > Alert` to create alerts based on those panels
wip loki stuff 2023-02-12 18:06:30 +00:00
			`THX to the previous work of @decentral1se @knooflok @3wc @cellarspoon @mirsal`