Can kadabra auto-restart apps that are down? #439

New Issue

decentral1se · 2023-04-04T10:41:03Z

decentral1se commented

2023-04-04 10:41:03 +00:00

We're running into this for a deployment which has some custom app and also Peertube, which flakes a lot. The apps don't seem to recover after the backup plan quite well and often need to be manually restarted. We tried to write a script using service scale but it doesn't seem to work due to some internal docker issues (scale vs. undeploy/deploy).

Similarly to ENABLE_AUTO_UPDATE=true for kadabra auto-updates, we could implemenet a ENABLE_AUTO_RESTART=true. This would have kadabra check if a app is down and then schedule a restart (undeploy/sleep/deploy) automatically.

AFAIU kadabra would have all the information it needs to do this at the moment, if using the latest version of abra to run deployments. All the env vars are available and the secrets and other state are available locally to it via the daemon.

It could be kadabra restart?

I believe these are the only kadabra docs that we have atm:

https://docs.coopcloud.tech/operators/tutorial/#automatic-upgrades

Thoughts? Ideas on the approach?

/cc @knoflook @moritz

We're running into this for a deployment which has some custom app and also Peertube, which flakes a lot. The apps don't seem to recover after the backup plan quite well and often need to be manually restarted. We tried to write a script using `service scale` but it doesn't seem to work due to some internal docker issues (scale vs. undeploy/deploy). Similarly to `ENABLE_AUTO_UPDATE=true` for `kadabra` auto-updates, we could implemenet a `ENABLE_AUTO_RESTART=true`. This would have `kadabra` check if a app is down and then schedule a restart (undeploy/sleep/deploy) automatically. AFAIU `kadabra` would have all the information it needs to do this at the moment, if using the latest version of `abra` to run deployments. All the env vars are available and the secrets and other state are available locally to it via the daemon. It could be `kadabra restart`? I believe these are the only `kadabra` docs that we have atm: > https://docs.coopcloud.tech/operators/tutorial/#automatic-upgrades Thoughts? Ideas on the approach? /cc @knoflook @moritz

👍 1

decentral1se added the

enhancement

kadabra

labels 2023-04-04 10:41:03 +00:00

moritz commented

2023-04-04 15:59:41 +00:00

This is a great idea.
It could be kadabra monitor that checks if all container are healthy and notifies if something went wrong. And kadabra monitor --restart to automatically restart unhealthy container that have ENABLE_AUTO_RESTART=true.

This is a great idea. It could be `kadabra monitor` that checks if all container are healthy and notifies if something went wrong. And `kadabra monitor --restart` to automatically restart unhealthy container that have `ENABLE_AUTO_RESTART=true`.

❤️ 1

decentral1se commented

2023-04-10 11:07:32 +00:00

I love the kadabra monitor [--restart] UI. I will work on this today via co-funding from a project that Autonomic is hosting. I will try to spec it out at least work towards a prototype implementation shortly.

I love the `kadabra monitor [--restart]` UI. I will work on this today via co-funding from a project that Autonomic is hosting. I will try to spec it out at least work towards a prototype implementation shortly.

decentral1se commented

2023-04-10 12:26:28 +00:00

Here's a proposal:

kadabra app auto-restarter

Goal: kadabra is a server-side binary which can monitor a deployment and make decisions based on the state automagically. One major thing we'd like to have is to be able to undeploy/deploy apps if they fall down. This is due to this being a Very Common fix in practice related to the high entropy of server side shenanigans. Yanno, just a turn off/on again job, it mostly works.

Implementation

run in daemon mode, invoking kadabra monitor [--restart] turns it on
poll all deployed apps and created in-memory cache to record it. re-run this polling every minute (configurable?) as we want to catch other apps that might not have been up when we first polled
if something goes down, then record how and output something useful. possibily including "which service?" / "since when"? / "some logs from service x before it went down" / "if will restart because --restart" / "number of restart attempts if --restart" / "is chaos deploy / or not"
This output can be outputted to shell and then using the 'ol crontab / smtp stuff, you could wire up a mailer. I think Autonomic would be v interested in an "adapter" for Rocket.chat, so we could get notifications in our internal chat system
if --restart, then undeploy, sleep some time to avoid concurrency issues and then deploy. We'd need to record the version to re-deploy the same one. We wouldn't be able to auto-restart --chaos deployed installs since we likely wouldn't have the changes from the local machine
we'd have to do some book keeping when triggering a restart. we could signal if something is being considered for restarting via the in-memory caching to reduce noise
if restarts are flapping, then chill out after some count (3 times to begin with?). if the apps isn't coming up at that point then someone needs to intervene

Estimates

I think the minimial viable implementation would take somewhere in the region of 2-5 hours. That is the basic monitoring and restart logic without any notification "adapters" or configuration. This would hopefully include some docs. Testing costs could be covered by the client projects that deploy this? Annnnd some 1-3 hours to add a rocket.chat adapter would be nice. I guess some 1-2 hours for configurable knobs with some UI/UX consulation with others. Sooo, in total estimation is 4 hrs min, 10 hrs max for everything.

Here's a proposal: ## kadabra app auto-restarter Goal: `kadabra` is a server-side binary which can monitor a deployment and make decisions based on the state automagically. One major thing we'd like to have is to be able to undeploy/deploy apps if they fall down. This is due to this being a Very Common fix in practice related to the high entropy of server side shenanigans. Yanno, just a turn off/on again job, it mostly works. ### Implementation - run in daemon mode, invoking `kadabra monitor [--restart]` turns it on - poll all deployed apps and created in-memory cache to record it. re-run this polling every minute (configurable?) as we want to catch other apps that might not have been up when we first polled - if something goes down, then record how and output something useful. possibily including "which service?" / "since when"? / "some logs from service x before it went down" / "if will restart because --restart" / "number of restart attempts if --restart" / "is chaos deploy / or not" - This output can be outputted to shell and then using the 'ol crontab / smtp stuff, you could wire up a mailer. I think Autonomic would be v interested in an "adapter" for Rocket.chat, so we could get notifications in our internal chat system - if `--restart`, then undeploy, sleep some time to avoid concurrency issues and then deploy. We'd need to record the version to re-deploy the same one. We wouldn't be able to auto-restart `--chaos` deployed installs since we likely wouldn't have the changes from the local machine - we'd have to do some book keeping when triggering a restart. we could signal if something is being considered for restarting via the in-memory caching to reduce noise - if restarts are flapping, then chill out after some count (3 times to begin with?). if the apps isn't coming up at that point then someone needs to intervene ### Estimates I think the minimial viable implementation would take somewhere in the region of 2-5 hours. That is the basic monitoring and restart logic without any notification "adapters" or configuration. This would hopefully include some docs. Testing costs could be covered by the client projects that deploy this? Annnnd some 1-3 hours to add a rocket.chat adapter would be nice. I guess some 1-2 hours for configurable knobs with some UI/UX consulation with others. Sooo, in total estimation is 4 hrs min, 10 hrs max for everything.

decentral1se commented

2023-04-10 22:00:51 +00:00

Threading some feedback back in:

I really like the proposal 👍 I don't think there is a need of running kadabra monitor [--restart] as daemon. The regular polling could be realized via Cron/systemd. This would save some extra logic and configuration inside kadabra, and doesn't require an extra monitoring monitor for the case that the daemon crashes.

Restarting a --chaos deployed install would be possible as long it is based on a published commit. Abra already stores the commit hash of the chaos deployment. This could be used for restarting.

Threading some feedback back in: > * I really like the proposal 👍 I don't think there is a need of running `kadabra monitor [--restart]` as daemon. The regular polling could be realized via Cron/systemd. This would save some extra logic and configuration inside kadabra, and doesn't require an extra monitoring monitor for the case that the daemon crashes. > * Restarting a `--chaos` deployed install would be possible as long it is based on a published commit. Abra already stores the commit hash of the chaos deployment. This could be used for restarting.

👍 1

decentral1se commented

2023-04-20 07:21:48 +00:00

Update: this didn't go ahead just yet due to a change in how the project budget is getting used on the Autonomic side. However, I'll just submit a budget proposal via the Federation soon. If anyone else wants to pick it up tho, please let us know / go ahead 🤟

decentral1se added this to the Medium/large enhancements project 2023-06-08 09:25:53 +00:00

Sign in to join this conversation.

No Milestone

No project

No Assignees

2 Participants

Notifications

Due Date

The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: coop-cloud/organising#439