Best practice healthcheck conventions #539

New Issue

moritz · 2023-11-30T15:09:31Z

moritz commented

2023-11-30 15:09:31 +00:00

This is a very general discussion. I wonder if we could find some good best practice or conventions of how to define good healthchecks in our recipes. This is a very critical topic for us, because in the last time it caused us a lot of nasty problems. Here are the most annoying ones:

Nextcloud take quite a long time for initialization and upgrading. Especially if I update multiple nextcloud instances on the same server it took even more time per instance. If the container does not get healthy during the start period of the healthcheck it restarts the container. So during the parallel upgrade of 6 nextcloud instances each of them got restarted and every instance was corrupted afterwards. Also if I spin up new nextcloud instances sometimes they are broken because of too early container restarts. I increased the start_period in the recipe again. I would propose to set quite huge start_period values for each recipe, or are the any arguments against a too high start_period?
The healthchecks for wordpress check for a 200 status code. Some wordpress user set the startpage of a wordpress site to private, for some maintenance reasons. After a while the container was in a restart loop and couldn't be reached anymore. It took us some time to figure out that the private setting correlates with the container restarts. Setting a page to private creates a 404 status code and the healthcheck fails. This is really problematic because for us it's not a solution to explain to everyone that people shouldn't set the startpage to private. This opens the question for me how to define reliable healthcare that doesn't have any mysterious side effects.

I think we should also state this in the docs, as this can cause a lot of pain.

This is a very general discussion. I wonder if we could find some good best practice or conventions of how to define good healthchecks in our recipes. This is a very critical topic for us, because in the last time it caused us a lot of nasty problems. Here are the most annoying ones: - Nextcloud take quite a long time for initialization and upgrading. Especially if I update multiple nextcloud instances on the same server it took even more time per instance. If the container does not get healthy during the start period of the healthcheck it restarts the container. So during the parallel upgrade of 6 nextcloud instances each of them got restarted and every instance was corrupted afterwards. Also if I spin up new nextcloud instances sometimes they are broken because of too early container restarts. I increased the `start_period` in the recipe again. I would propose to set quite huge `start_period` values for each recipe, or are the any arguments against a too high `start_period`? - The healthchecks for wordpress check for a 200 status code. Some wordpress user set the startpage of a wordpress site to private, for some maintenance reasons. After a while the container was in a restart loop and couldn't be reached anymore. It took us some time to figure out that the private setting correlates with the container restarts. Setting a page to private creates a 404 status code and the healthcheck fails. This is really problematic because for us it's not a solution to explain to everyone that people shouldn't set the startpage to private. This opens the question for me how to define reliable healthcare that doesn't have any mysterious side effects. I think we should also state this in the docs, as this can cause a lot of pain.

3wordchant commented

2023-12-02 13:46:46 +00:00

The healthchecks for wordpress check for a 200 status code. Some wordpress user set the startpage of a wordpress site to private, for some maintenance reasons

I ran into the same issue on Discourse; if the forum is set to require login, then GET / serves a 403, instead of a 200 (amusewiki does similar, but we didn't define a healthcheck for that yet). Solution was to find the /srv/status endpoint, which works regardless of that setting.

Nextcloud take quite a long time for initialization and upgrading. Especially if I update multiple nextcloud instances on the same server it took even more time per instance. If the container does not get healthy during the start period of the healthcheck it restarts the container. So during the parallel upgrade of 6 nextcloud instances each of them got restarted and every instance was corrupted afterwards

Oh yeah that sounds nightmarish. Tuning healthcheck timings is ｈａｒｄ; too short and you run into problems like you mention, too long and it increases the chance of walking away from a deployment, not noticing it failed, and then being confused later why the app is still running an old version (or worse, mismatched versions between different services). I wonder if there's a way to make values depend on server load? Otherwise, perhaps a little calculator for different combinations of interval / retries / timeout / start_period could help?

> The healthchecks for wordpress check for a 200 status code. Some wordpress user set the startpage of a wordpress site to private, for some maintenance reasons I ran into the same issue on Discourse; if the forum is set to require login, then `GET /` serves a 403, instead of a 200 (`amusewiki` does similar, but we didn't define a healthcheck for that yet). [Solution was to find the `/srv/status` endpoint, which works regardless of that setting](https://git.coopcloud.tech/coop-cloud/discourse/src/branch/main/compose.yml#L48). > Nextcloud take quite a long time for initialization and upgrading. Especially if I update multiple nextcloud instances on the same server it took even more time per instance. If the container does not get healthy during the start period of the healthcheck it restarts the container. So during the parallel upgrade of 6 nextcloud instances each of them got restarted and every instance was corrupted afterwards Oh yeah that sounds nightmarish. Tuning healthcheck timings is ｈａｒｄ; too short and you run into problems like you mention, too long and it increases the chance of walking away from a deployment, not noticing it failed, and then being confused later why the app is still running an old version (or worse, mismatched versions between different services). I wonder if there's a way to make values depend on server load? Otherwise, perhaps a little calculator for different combinations of `interval` / `retries` / `timeout` / `start_period` could help?

decentral1se added the

documentation

label 2024-03-27 06:21:47 +00:00

Sign in to join this conversation.

No Milestone

No project

No Assignees

2 Participants

Notifications

Due Date

The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: coop-cloud/organising#539