Switch to endpoint-mode dnsrr instead of vip #50
Loading…
x
Reference in New Issue
Block a user
No description provided.
Delete Branch "mirsal/traefik:endpoint-mode-dnsrr"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
The default docker swarm endpoint mode (vip) introduces unnecessary
indirection in the communication between services, namely the
docker-proxy and a dynamic haproxy endpoint container. This PR
switches the socket-proxy service to endpoint_mode: dnsrr by default and
the traefik service when using host-mode port publishing.
I would strongly recommend considering switching to host-mode port
publishing by default, especially as most coop-cloud deployments are
single-server.
This switch has implications wrt traefik service labels, so we should discuss this before merging, I'm opening this PR not as a bug fix but as a way to start that discussion.
See: toolshed/organising#648
214b6eaba2
toad0a3c2b41
@mirsal to clarify by "
..especially as mose coop-cloud deployment are single-node.
does node mean single server or apps per server?single-server, sorry for the docker jargon ^^'
ad0a3c2b41
toe2ca964492
For reference, those considerations are documented there: https://docs.docker.com/engine/swarm/ingress/
Basically, we are using traefik as a load-balancer, so imho, we should bypass the ingress routing-mesh as well as the (hidden) vip endpoint load balancer container, which will lead to better performance, out-of-the-box IPv6 ingress support and generally less complicated networking internally (less hidden containers, less networking endpoints and lesser reliance on docker-proxy)
Dayum, great work @mirsal as always! I'm all for it, altho I don't understand it all. #50 (comment) makes the effort sound very much worth it. Do we have any idea of any migration steps that will be required for others?
changing the endpoint mode to dnsrr for socket-proxy is kind of the low hanging fruit there, it shouldn't require any steps and it's a small easy win :)
for the bigger change of switching traefik to host-mode port publishing by default, we would need to find out about edge-cases and weird deployments (like is someone running multiple traefik instances on the same server for some reason? or hosting on a mobile phone or on a boat?) because those might require running
abra app config traefik.example.com
, changing a line in there and then redeploy traefik.e2ca964492
toabbb3255f8
aight then this seems good to merge if someone can take on the recipe release labours!
i'd love to hear more reasoning on this. i think we originally thought that not-host-mode-by-default is "better" because not all services are exposed by default. you'd probably know better. what are the pros/cons here to switching the default? is it worth the migration / (potential) confusion cost?
@decentral1se thanks for stepping in!
that is especially confusing because in docker terms, host-mode networking and host-mode port publishing are completely different things, here we're talking about host-mode port publishing and not host-mode networking. It applies only to exposed ports.
To add to the confusion, the way port publishing works in docker swarm depends on the service endpoint_mode setting (under deploy in the compose format)
By default, published ports in a swarm use "ingress" mode. (which requires services to run with enpoint_mode: vip) in this mode, swarm runs a hidden haproxy container acting as an internal load-balancer and uses docker-proxy in order to map exposed ports to running container endpoints through the ingress routing mesh. In practice, it means that in multi-server deployments, traffic reaching the exposed port on any server will be transparently routed to the loadbalancer endpoint and forwarded to one of the user containers that's part of the service exposing the port.
That makes for a lot of moving parts, breaks ipv6 ingress, and degrades performance, but that makes for "automagic" cluster ingress in which whatever server is reached by ingress traffic does not matter as docker will route and load-balance it transparently through the routing mesh.
OTOH, host-mode port publishing bypasses the routing mesh and exposes the published ports directly on the host on which the endpoint is running. With endpoint_mode: dnsrr, docker does not provision that hidden haproxy container. The tradeoff with running host-mode port publishing is that it allows only one container to expose a specific port on a specific host, and it does not allow transparent routing from one server in the swarm to another when traffic reaches a host on which the exposed service is not running. I believe that's not a problem for us as long as we don't need to run multiple instances of traefik on the same host, and that we don't use the ingress routing mesh for server redundancy.
Migration is where it gets a bit complicated, because that is potentially a breaking change in some situations, so we would need some knowledge about how people are using coop-cloud.
in simpler terms, traefik is our load-balancer, we don't need docker to put another load-balancer in front of it.
Thanks for simplifying it down. Some interesting stuff down that 🐰 🕳️ 👌
Thanks for the explanation!
Oh yeh, OK, I'm starting to get there. Still reading it over a few times. Seems like we deffo want this change!
Curious to understand what you would need to know. We can try to bring these points to discussion in the matrix/fedi/etc. channels.
I'm unsure but does this mean that some recipes which specify a ".port=XXX" need to be unique? Are there some conflicts we could see when switching over?
knowing of any deployment on which there are multiple traefik instances exposing the same ports on the same host (thinking maybe CI/CD or redundant traefik deployments) or deployments on which the ingress network interface changes over time (server alternating between wifi and ethernet for instance) would help figuring out the impact of migrating.
For the vast majority of deployments, I believe it can be done pretty smoothly by upgrading the traefik recipe
no, those ports are not exposed, they are bound by backend services on virtual interfaces and not required to be unique.
it's about the docker-compose.yml service ports definitions:
for example, if a recipe has:
then traefik can not bind port 1234 with publish mode host
(with the current default, that is
mode: ingress
, docker would load-balance between all services exposing the same port, I doubt there is any coop-cloud use case where that is relevant, but you never know... whereas if we switch to host-mode port publishing, traefik would simply refuse to start until all its exposed ports are available)This branch has diverged from
origin
@mirsal you can pull from my branch if you wanna save time basebuilder/traefikNever heard of one, let's ask (again?) in Matrix but I think it's probably safe to assume nobody is doing this.
Shall we just merge it? 🤓
I'm gonna rebase / merge this / release this tmoro (or next days, at least), I guess.
Curious about improvements we might see from this one.
not much tbh, the important bit would be to switch traefik to host-mode port publishing, that would give us out-of-the box IPv6 ingress support.
acb4c6960a
Tested a few deployments, seems fine ✌️
i have to say that the release note wasn't of much help and i didn't get most of the discussion... so i'm not sure what to do, but the upgrade seems to be working.
are these messages on the log related to this?
@fauno yes, this is related: with this change, the docker socket proxy's internal IP address changes when the container is restarted, you can manually trigger it by running
docker kill
on the socket proxy and watch traefik logs. As soon as the socket proxy is restarted and the service discovery TTL is hit (a few seconds), traefik should pick up the change. If that's not the case, then that's a problem and this change should be reverted.Apologies for the confusing discussion, we were also discussing a different change than the one we implemented
@mirsal thanks! i don't have a socket proxy running that i know of, in fact i took this upgrade as an opportunity to disable gitea's ssh port since we aren't using it.
the socket proxy is enabled by default as part of the traefik recipe (traefik is configured to use the proxy to access the swarm control plane-and read service labels)
see:
Pull request closed