YunoHost-esque fearless upgrade scripts? #682
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
@notplants can you set the stage for this one?
From our discussion I understand the wish for:
If there is anything we can do to bake in commands to make upgrades more stable, I'm all for it!
This + the recipe maintainers proposal finally getting implementation could really improve all our lives.
related to #578
I think this comment summarizes the essence — I also made a longer doc here titled “What Can Co-op Cloud Learn From Yunohost”: https://doc.commoninternet.net/RorkLLTWTWabFyr-rqO2bA?both#
I don’t personally know abra well enough yet to really have much to say about implementation details — but I can say that years of doing upgrades on apps with yunohost was significantly less stressful than doing upgrades on co-op cloud apps, and I’m curious if there are ways to even that out without totally going against the coop cloud model.
In particular, feels to me like a “blessed automated upgrade path” has a lot of potential for getting benefit from the “configuration commons” ideal in way that operators all independtly figuring out updates does not.
Automating updates also sort of goes hand-in-hand with seamless rollbacks too (of data +code), because when the magic goes wrong good to know it’s easy to revert.
Seems like code rollbacks are pretty straightforward right now with abra, and with data it’s sort of there with backupbot, but maybe not all quite seamlessly integrated yet into something along the lines of the pseudocode:
In my dream world, if the auto update doesn’t work — it safely rolls back — and then, when you have time to debug, you go back to doing things the current way (a bunch of deploy —chaos), and ideally merge your fixes back into the upgrade script of the recipe so others don’t run into the same issue.
It seems like a lot of these ideas could be implemented fairly cleanly as additional sets of labels which serve as hooks for different operations, similar to
backupbot.restore.pre-hookandpost-hook.While those hooks are triggered by backupbot, there could be labels for other operations (
pre-upgrade,post-deploy, etc) which are triggered by abra at the appropriate time.That's very true @marlon! How did I forget about the labels.... that could be very nice indeed. The only remaining issue is to understand how to know exactly from the
docker stack deployinternals when a deployment has updated, failed, rolled back, etc. We need to have some precision on this so we can trigger the labels correctly and that has somehow escaped me down through the years... @notplants is potentially calling in reinforcements from Docker staff themselves for more information on that one.Not sure if the labels is the best place for this information, since we need to habe the information about each individual breaking version, so we can upgrade multiple versions at once. A better place would be the
abra.shfile I think. This would also allow easy manual migration withabra app cmd <domain> upgrade_3.9. The naming could work similarly to our release note detectionOn the question of detecting failed deployments, my suggestion would be to not 🤠
If we can expose a rollback command and let the user invoke it when they've determined that a deployment failed, we'll be most of the way there, with a clear path to automating the rollback once we figure out the hairy swarm questions.
abra could even prompt after upgrading - "Keep the upgraded version or rollback?"
I guess I was imagining labels allowing us to trigger a different handler script for each service, for example one for
appone fordb, which would allow for flexibility in recipes where optional compose files might add additional services that require their own upgrade path.Either way your point is well taken that any script would need a way to know what version we're on, what version we're upgrading to, and then infer what upgrade tasks to run in order...
will send more info on the docker contact soon,
but in the meantime was looking through a couple yunohost app upgrade scripts to see how they handle keeping track of from_version and to_version, for some references:
nextcloud_ynh does this interesting/weird thing where they actually upgrade in a while loop, iteratively incrementing the major version and doing an upgrade until they get to the latest version... kind of interesting... relevant lines:
f6d1bb0dec/scripts/upgrade (L174)in the peertube_ynh update script, there are some parts where they don't keep track of versions, but just do things in an idempotent way. like just do little checks to see if things that need to be changed are in an old condition... like seeing if the database has an older name that needs to be changed to a new name... relevant lines:
4d4d2f116a/scripts/upgrade (L52)... but then there are other parts in peertube_ynh where they are keeping track of what version they are upgrading from, and running specific migrations based on that in a big consecutive if statement... relevant lines:
4d4d2f116a/scripts/upgrade (L125)personally I kind of like the "big consecutive if statement" approach in the peertube_ynh script (in combination with occasional idempotent checks if thats all thats needed)
etherpad_ynh doing the idempotent checks approach also:
2e4b67aacc/scripts/upgrade (L28)outline_ynh also has the "version number if statements":
3351011921/scripts/upgrade (L49)here is what I got about docker swarm from "someone who is familiar with the matter" at docker:
I guess because "docker swarm" is declarative and based on state-reconciliation, what it does internally to reach a state is also maybe sort of opaque by design?
Maybe though we could just set a fairly long default timeout before declaring that a deploy failed, and for cases like @3wordchant mentioned on particular hardware or recipes where things might be really slow maybe setting an easy way to override that (timeout env?) (as an operator this is essentially what I'm doing manually anyway... watching the deploy until it succeeds, fails or I think its stuck and then ctrl+c... although maybe im missing something)
tysm @notplants, that's great input (also kinda scary haha).
Following the point of @marlon in #682 (comment): "On the question of detecting failed deployments, my suggestion would be to not 🤠" and the point of @notplants: "because "docker swarm" is declarative and based on state-reconciliation, what it does internally to reach a state is also maybe sort of opaque by design?" - it does make me wonder if we should really just ditch swarm. The real coop cloud was the compose spec we should have used more along the way?
If I check uncloud.run, they're using the latest compose spec library (source) to manipulate the configs and to deploy them, they seem to have come up with their own implementation 😱 https://github.com/psviderski/uncloud/blob/main/pkg/client/deploy/deploy.go
Checking the
README.mdthey seem to have some sort of control over when containers go up and down and what happens to them. Maybe we could rip out all the absolute chaos copy/pasta i did from docker swarm and use a legit library and implementation? Maybe there is also a chance to work on this together....Does anyone have time to do a hands-on test of uncloud.run and report back?
I think we need to talk to uncloud.run next! I will try to schedule in a call and let you know!
They seem to be in the depths of working this out right now outside the context of swarm?
I've mailed the maintainers 😆
Brain dump with me: https://docs.coopcloud.tech/abra/swarm/
this link throws an ssl error!
i may be missing the point, but i'd prefer to have "long term support" for patch versions i know i can upgrade without much hazzle, at least while that version is still maintained upstream. also i haven't had that bad of a experience upgrading minor and major versions, at least with the recipes we're using.
oops, my server had just crashed out for a minute. should be back now
so this week, I changed a few recipes I was maintaining to use
pgautoupgrade/pgautoupgradeimage, and seems to work greatI got inspired by how that works, + the yunohost examples I linked above, to try to do some other automatic updating like that
seems to be going well !
I have been making additional entrypoint scripts for containers, that do idempotent checks to see if a migration or initialization is needed, and then if it is, it runs it
here is the key line, where the backend container runs migrate.sh if AUTO_MIGRATE is true (which runs migrations on the database):
src
and then in
migrate.sh, it only runs the migration if its needed: srcand here is something similar for the minio container, which initializes buckets if they haven't already been initialized:
src
I currently have AUTO_MIGRATE set to true by default, but someone could turn it off if they really wanted to do it manually
I think this pattern might accommodate my biggest wish wrt updates without any further changes to coop cloud
also has the bonus benefit that there are now fewer steps on first deployment (all initialization happens automatically on deploy)
... having a little bit "constraints are the mother of invention" moment
maybe this pattern was already used elsewhere, but I wasn't aware of it
it still sounds like there would be some other nice features from getting off of swarm and having a less mysterious deployment run-of-show
but I think this entrypoint answers my biggest wish, which is to shift the main work of updates off of operators and onto recipe maintainers instead so the work can be shared
Seems like a limitation of the entrypoint approach is that the service the migration script is injected into won't be running at the time the script runs...
I wonder what about running a separate migration container which has the docker socket mounted, allowing it to run arbitrary commands in the context of the other containers in the deployment. Too sketchy? Does that even work in swarm?
in the minio initialization example (here), I had the initialization script start in the background using
command &in the entrypoint, so that the container still starts up normally at the same time with the following command immediately, and then in the initialization script it waits to confirm the container has really started before doing the needed initialization (hereso maybe kind of hard to read and confusing, but does seem to work (afaict)
here is a minimal bash example to reproduce the effect locally
I like that this would be a clear pattern, keeping migration logic all in one place.
on the other hand, with the entrypoint approach I used in lasuite-docs, one upside is all of the containers in the compose are actually running so after deploy you don't see any services with "0/1" running -- if it worked they should all be 1/1. a small cosmetic thing, but I like that at a glance can see everything is running. vs with a sidecar container for migrations I guess it would show up as 0/1 in
docker service ls(unless there is some way to avoid this)edit: I guess if there is a migration that requires different things to happen in multiple containers in a specific order, than the sidecar you are describing sounds perfect. I could imagine some migration needing that, although I haven't seen it yet.
another pro of a sidecar migration container, is it could have its own docker volume where it keeps data, to simply keep track of what migrations have been run so far (which sometimes could be simpler than doing independent checks like I was doing in the per-container-entrypoint-idempotent-migrations)
Oh nice, I didn't realize you could background processes like that in the entrypoint, but it makes sense.
The multi-container ordered upgrade thing is mostly what I'm thinking about. I can imagine ways to get the same functionality with per-container scripts, but they're all somewhat convoluted.
The other reason this has been on my mind is: what about distroless containers, in which you can't run a custom entrypoint at all? A sidecar would be a necessity for those, I think.
The big downside to the sidecar proposal is that each deployment has the potential to tamper with any other container on the server, which does seem like bad design.
mm yes, maybe I'd cross that bridge when we come to it?
if we can get auto-migrations without giving a deployed app access to the all the other apps that seems safer when possible
#687
Already ran into a case where the entrypoints approach was not sufficient 🤦
for latest authentik upgrade, it seemed to require that a previous version was deployed before the latest, or else the built-in migrations break (https://docs.goauthentik.io/install-config/upgrade)
So I had to fallback again to release notes to tell operators to deploy an earlier version, and then the latest:
9d621ed1de/release/10.2.0+2026.2.1Feel like this could be an interesting example to think through for designing abra hooks/signal/update-orchestrator to also be able to handle this case (cc #687)