fix(db): bump DB_ENTRYPOINT_VERSION to v3 so the entrypoint config reloads

The install-user fix changed the entrypoint content; swarm configs are immutable, so the config name (which embeds DB_ENTRYPOINT_VERSION) must change for a redeploy to pick up the new script.
fix(db): run pg_upgrade as the old cluster's real install user
2026-06-16 18:04:05 +00:00 · 2026-06-16 17:59:26 +00:00 · 2026-06-16 17:00:16 +00:00 · 2026-06-15 17:37:14 +00:00 · 2026-06-11 22:52:37 +00:00 · 2026-06-05 02:03:34 +00:00
3 changed files with 31 additions and 13 deletions
--- a/abra.sh
+++ b/abra.sh
@ -1,2 +1,2 @@
-export DB_ENTRYPOINT_VERSION=v1
+export DB_ENTRYPOINT_VERSION=v3
 export PG_BACKUP_VERSION=v2
--- a/compose.yml
+++ b/compose.yml
@ -43,7 +43,7 @@ services:
        #- "traefik.http.routers.${STACK_NAME}.middlewares=${STACK_NAME}-redirect"
        #- "traefik.http.middlewares.${STACK_NAME}-redirect.headers.SSLForceHost=true"
        #- "traefik.http.middlewares.${STACK_NAME}-redirect.headers.SSLHost=${DOMAIN}"
-        - "coop-cloud.${STACK_NAME}.version=0.8.0+3.5.0"
+        - "coop-cloud.${STACK_NAME}.version=0.10.0+3.5.0"
    healthcheck:
      test: "ruby -e \"require 'uri'; require 'net/http'; uri = URI('http://localhost:3000/srv/status'); res = Net::HTTP.get_response(uri); if res.is_a?(Net::HTTPSuccess) then exit (0) else exit (1) end\""
      interval: 30s
@ -80,7 +80,7 @@ services:
        backupbot.restore.post-hook: "/pg_backup.sh restore"

  redis:
-    image: redis:7.4-alpine
+    image: redis:8.8-alpine
    networks:
      - internal
    volumes:
--- a/entrypoint.postgres.sh.tmpl
+++ b/entrypoint.postgres.sh.tmpl
@ -2,16 +2,23 @@

 set -e

-MIGRATION_MARKER=$PGDATA/migration_in_progress
 OLDDATA=$PGDATA/old_data
 NEWDATA=$PGDATA/new_data

 echo "Running as $(id)"

-if [ -e $MIGRATION_MARKER ]; then
-  echo "FATAL: migration was started but did not complete in a previous run. manual recovery necessary"
-  exit 1
-fi
+# The migration uses $OLDDATA/$NEWDATA as scratch and removes them when it
+# finishes; a leftover *empty* one means a run was interrupted before any data
+# moved (data still intact at $PGDATA) so we clear it and retry, while a
+# *non-empty* one means data may live only there, so we stop for manual recovery.
+for scratch in $OLDDATA $NEWDATA; do
+  if [ -d "$scratch" ] && [ -n "$(ls -A "$scratch")" ]; then
+    echo "FATAL: $scratch exists and is not empty - a previous migration did not"
+    echo "complete and the data may only exist there. manual recovery necessary."
+    exit 1
+  fi
+done
+rm -rf $OLDDATA $NEWDATA

 if [ -f $PGDATA/PG_VERSION ]; then
  DATA_VERSION=$(cat $PGDATA/PG_VERSION)
@ -23,22 +30,33 @@ if [ -f $PGDATA/PG_VERSION ]; then
    apt-get update && apt-get install -y --no-install-recommends \
      postgresql-$DATA_VERSION \
      && rm -rf /var/lib/apt/lists/*
+    # pg_upgrade must run as the old cluster's bootstrap superuser (the "install
+    # user", oid 10), and the new cluster must be initialised with that same
+    # user. It is not necessarily $POSTGRES_USER (e.g. clusters created with the
+    # default "postgres" superuser and a separate app role), so read it from the
+    # old cluster: briefly start it and ask, connecting as the app role we know.
+    PGBIN=/usr/lib/postgresql/$DATA_VERSION/bin
+    gosu postgres $PGBIN/pg_ctl -D $PGDATA -w \
+      -o "-c listen_addresses= -c unix_socket_directories=/tmp" start
+    INSTALL_USER=$(gosu postgres psql -h /tmp -U "$POSTGRES_USER" -d postgres -tAc \
+      "select rolname from pg_roles where oid = 10")
+    gosu postgres $PGBIN/pg_ctl -D $PGDATA -w stop
+    echo "old cluster install user: $INSTALL_USER"
    echo "shuffling around"
    gosu postgres mkdir $OLDDATA $NEWDATA
    chmod 700 $OLDDATA $NEWDATA
    mv $PGDATA/* $OLDDATA/ || true
-    touch $MIGRATION_MARKER
    echo "running initdb"
-    # abuse entrypoint script for initdb by making server error out
-    gosu postgres bash -c "export PGDATA=$NEWDATA ; /usr/local/bin/docker-entrypoint.sh --invalid-arg || true"
+    # abuse entrypoint script for initdb by making server error out; initialise
+    # the new cluster with the same superuser as the old one so pg_upgrade matches
+    gosu postgres bash -c "export PGDATA=$NEWDATA POSTGRES_USER=$INSTALL_USER ; /usr/local/bin/docker-entrypoint.sh --invalid-arg || true"
    echo "running pg_upgrade"
    cd /tmp
-    gosu postgres pg_upgrade --link -b /usr/lib/postgresql/$DATA_VERSION/bin -d $OLDDATA -D $NEWDATA -U $POSTGRES_USER
+    gosu postgres pg_upgrade --link -b /usr/lib/postgresql/$DATA_VERSION/bin -d $OLDDATA -D $NEWDATA -U $INSTALL_USER
    cp $OLDDATA/pg_hba.conf $NEWDATA/
    mv $NEWDATA/* $PGDATA
    rm -rf $OLDDATA
    rmdir $NEWDATA
-    rm $MIGRATION_MARKER
    echo "migration complete"
  fi
 fi
Author	SHA1	Message	Date
notplants	bd5f181737	fix(db): bump DB_ENTRYPOINT_VERSION to v3 so the entrypoint config reloads The install-user fix changed the entrypoint content; swarm configs are immutable, so the config name (which embeds DB_ENTRYPOINT_VERSION) must change for a redeploy to pick up the new script.	2026-06-16 18:04:05 +00:00
notplants	57f5ee2531	fix(db): run pg_upgrade as the old cluster's real install user pg_upgrade must run as the old cluster's bootstrap superuser (oid 10), and the new cluster must be initialised with that same user, otherwise it fails the "database user is the install user" consistency check. The install user is not necessarily $POSTGRES_USER: clusters created with the default "postgres" superuser plus a separate app role (e.g. discourse) are common. Detect it from the old cluster by briefly starting it and reading pg_roles (oid = 10) as the known app role, then use it for both the new cluster's initdb and the pg_upgrade -U argument.	2026-06-16 17:59:26 +00:00
notplants	101ffe1964	fix(db): make pg_upgrade migration idempotent & crash-safe The postgres major-version migration in the db entrypoint was not safe to re-run. If the container was killed mid-migration it could crash-loop forever ("mkdir: cannot create directory .../old_data: File exists") or silently initdb a fresh empty cluster over the live data once PG_VERSION had been moved out of $PGDATA but before the in-progress marker was written. Replace the marker file with a state-driven guard keyed on the scratch dirs: empty old_data/new_data means the run was interrupted before any data moved, so discard and retry (idempotent); non-empty means data may only live there, so stop for manual recovery. Bump DB_ENTRYPOINT_VERSION v1->v2 so swarm picks up the new (immutable) config.	2026-06-16 17:00:16 +00:00
notplants	433ce12dbc	Merge pull request 'chore: upgrade to 0.10.0+3.5.0' (#2 ) from upgrade-0.8.0+3.5.0 into main Reviewed-on: https://git.autonomic.zone/recipe-maintainers/discourse/pulls/2	2026-06-15 17:37:14 +00:00
autonomic-bot	b7d8a244d7	chore: upgrade to 0.10.0+3.5.0 (redis 8.0->8.8-alpine)	2026-06-11 22:52:37 +00:00
autonomic-bot	7ae7b0f76e	chore: upgrade to 0.9.0+3.5.0	2026-06-05 02:03:34 +00:00