Best practice healthcheck conventions #539
Loading…
x
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
This is a very general discussion. I wonder if we could find some good best practice or conventions of how to define good healthchecks in our recipes. This is a very critical topic for us, because in the last time it caused us a lot of nasty problems. Here are the most annoying ones:
start_period
in the recipe again. I would propose to set quite hugestart_period
values for each recipe, or are the any arguments against a too highstart_period
?I think we should also state this in the docs, as this can cause a lot of pain.
I ran into the same issue on Discourse; if the forum is set to require login, then
GET /
serves a 403, instead of a 200 (amusewiki
does similar, but we didn't define a healthcheck for that yet). Solution was to find the/srv/status
endpoint, which works regardless of that setting.Oh yeah that sounds nightmarish. Tuning healthcheck timings is hard; too short and you run into problems like you mention, too long and it increases the chance of walking away from a deployment, not noticing it failed, and then being confused later why the app is still running an old version (or worse, mismatched versions between different services). I wonder if there's a way to make values depend on server load? Otherwise, perhaps a little calculator for different combinations of
interval
/retries
/timeout
/start_period
could help?The latest changes in
abra
make the healthcheck status visible on deployment progress. This might bring back this (necessary) discussion.