Can kadabra auto-restart apps that are down? #439
Labels
No Label
abra
abra-gandi
awaiting-feedback
backups
bug
build
ci/cd
community organising
contributing
coopcloud.tech
democracy
design
documentation
duplicate
enhancement
finance
funding
good first issue
help wanted
installer
kadabra
performance
proposal
question
recipes.coopcloud.tech
security
test
wontfix
No Milestone
No project
No Assignees
2 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: toolshed/organising#439
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
We're running into this for a deployment which has some custom app and also Peertube, which flakes a lot. The apps don't seem to recover after the backup plan quite well and often need to be manually restarted. We tried to write a script using
service scale
but it doesn't seem to work due to some internal docker issues (scale vs. undeploy/deploy).Similarly to
ENABLE_AUTO_UPDATE=true
forkadabra
auto-updates, we could implemenet aENABLE_AUTO_RESTART=true
. This would havekadabra
check if a app is down and then schedule a restart (undeploy/sleep/deploy) automatically.AFAIU
kadabra
would have all the information it needs to do this at the moment, if using the latest version ofabra
to run deployments. All the env vars are available and the secrets and other state are available locally to it via the daemon.It could be
kadabra restart
?I believe these are the only
kadabra
docs that we have atm:Thoughts? Ideas on the approach?
/cc @knoflook @moritz
This is a great idea.
It could be
kadabra monitor
that checks if all container are healthy and notifies if something went wrong. Andkadabra monitor --restart
to automatically restart unhealthy container that haveENABLE_AUTO_RESTART=true
.I love the
kadabra monitor [--restart]
UI. I will work on this today via co-funding from a project that Autonomic is hosting. I will try to spec it out at least work towards a prototype implementation shortly.Here's a proposal:
kadabra app auto-restarter
Goal:
kadabra
is a server-side binary which can monitor a deployment and make decisions based on the state automagically. One major thing we'd like to have is to be able to undeploy/deploy apps if they fall down. This is due to this being a Very Common fix in practice related to the high entropy of server side shenanigans. Yanno, just a turn off/on again job, it mostly works.Implementation
run in daemon mode, invoking
kadabra monitor [--restart]
turns it onpoll all deployed apps and created in-memory cache to record it. re-run this polling every minute (configurable?) as we want to catch other apps that might not have been up when we first polled
if something goes down, then record how and output something useful. possibily including "which service?" / "since when"? / "some logs from service x before it went down" / "if will restart because --restart" / "number of restart attempts if --restart" / "is chaos deploy / or not"
This output can be outputted to shell and then using the 'ol crontab / smtp stuff, you could wire up a mailer. I think Autonomic would be v interested in an "adapter" for Rocket.chat, so we could get notifications in our internal chat system
if
--restart
, then undeploy, sleep some time to avoid concurrency issues and then deploy. We'd need to record the version to re-deploy the same one. We wouldn't be able to auto-restart--chaos
deployed installs since we likely wouldn't have the changes from the local machinewe'd have to do some book keeping when triggering a restart. we could signal if something is being considered for restarting via the in-memory caching to reduce noise
if restarts are flapping, then chill out after some count (3 times to begin with?). if the apps isn't coming up at that point then someone needs to intervene
Estimates
I think the minimial viable implementation would take somewhere in the region of 2-5 hours. That is the basic monitoring and restart logic without any notification "adapters" or configuration. This would hopefully include some docs. Testing costs could be covered by the client projects that deploy this? Annnnd some 1-3 hours to add a rocket.chat adapter would be nice. I guess some 1-2 hours for configurable knobs with some UI/UX consulation with others. Sooo, in total estimation is 4 hrs min, 10 hrs max for everything.
Threading some feedback back in:
Update: this didn't go ahead just yet due to a change in how the project budget is getting used on the Autonomic side. However, I'll just submit a budget proposal via the Federation soon. If anyone else wants to pick it up tho, please let us know / go ahead 🤟