It has observed defunct containerd processes accumulating over
time while dockerd was permanently failing to restart containerd.
Due to a bug in the runContainerdDaemon() function, dockerd does not clean up
its child process if containerd already exits very soon after the (re)start.
The reproducer and analysis below comes from docker 1.12.x but bug
still applies on latest master.
- from libcontainerd/remote_linux.go:
329 func (r *remote) runContainerdDaemon() error {
:
: // start the containerd child process
:
403 if err := cmd.Start(); err != nil {
404 return err
405 }
:
: // If containerd exits very soon after (re)start, it is
possible
: // that containerd is already in defunct state at the time
when
: // dockerd gets here. The setOOMScore() function tries to
write
: // to /proc/PID_OF_CONTAINERD/oom_score_adj. However, this
fails
: // with errno EINVAL because containerd is defunct. Please see
: // snippets of kernel source code and further explanation
below.
:
407 if err := setOOMScore(cmd.Process.Pid, r.oomScore); err != nil
{
408 utils.KillProcess(cmd.Process.Pid)
:
: // Due to the error from write() we return here. As
the
: // goroutine that would clean up the child has not
been
: // started yet, containerd remains in the defunct
state
: // and never gets reaped.
:
409 return err
410 }
:
417 go func() {
418 cmd.Wait()
419 close(r.daemonWaitCh)
420 }() // Reap our child when needed
:
423 }
This is the kernel function that gets invoked when dockerd tries to
write
to /proc/PID_OF_CONTAINERD/oom_score_adj.
- from fs/proc/base.c:
1197 static ssize_t oom_score_adj_write(struct file *file, ...
1198 size_t count, loff_t
*ppos)
1199 {
:
1223 task = get_proc_task(file_inode(file));
:
: // The defunct containerd process does not have a virtual
: // address space anymore, i.e. task->mm is NULL. Thus the
: // following code returns errno EINVAL to dockerd.
:
1230 if (!task->mm) {
1231 err = -EINVAL;
1232 goto err_task_lock;
1233 }
:
1253 err_task_lock:
:
1257 return err < 0 ? err : count;
1258 }
The purpose of the following program is to demonstrate the behavior of
the oom_score_adj_write() function in connection with a defunct process.
$ cat defunct_test.c
\#include <unistd.h>
main()
{
pid_t pid = fork();
if (pid == 0)
// child
_exit(0);
// parent
pause();
}
$ make defunct_test
cc defunct_test.c -o defunct_test
$ ./defunct_test &
[1] 3142
$ ps -f | grep defunct_test | grep -v grep
root 3142 2956 0 13:04 pts/0 00:00:00 ./defunct_test
root 3143 3142 0 13:04 pts/0 00:00:00 [defunct_test] <defunct>
$ echo "ps 3143" | crash -s
PID PPID CPU TASK ST %MEM VSZ RSS COMM
3143 3142 2 ffff880035def300 ZO 0.0 0 0
defunct_test
$ echo "px ((struct task_struct *)0xffff880035def300)->mm" | crash -s
$1 = (struct mm_struct *) 0x0
^^^ task->mm is NULL
$ cat /proc/3143/oom_score_adj
0
$ echo 0 > /proc/3143/oom_score_adj
-bash: echo: write error: Invalid argument"
---
This patch fixes the above issue by making sure we start the reaper
goroutine as soon as possible.
Signed-off-by: Antonio Murdaca <runcom@redhat.com>
(cherry picked from commit 27087eacbf96e6ef9d48a6d3dc89c7c1cff155b4)
Signed-off-by: Eli Uriegas <eli.uriegas@docker.com>
Signed-off-by: Eli Uriegas <eli.uriegas@docker.com>
Signed-off-by: Daniel Nephin <dnephin@docker.com>
(cherry picked from commit f1ade82d82e6436971c6b7d08eb1da57ed9ba756)
Signed-off-by: Eli Uriegas <eli.uriegas@docker.com>
Signed-off-by: Ying Li <ying.li@docker.com>
(cherry picked from commit d60f18204978d438d1eb336512576d47991c8ac1)
Signed-off-by: Eli Uriegas <eli.uriegas@docker.com>
If a container mount the socket the daemon is listening on into
container while the daemon is being shutdown, the socket will
not exist on the host, then daemon will assume it's a directory
and create it on the host, this will cause the daemon can't start
next time.
fix issue https://github.com/moby/moby/issues/30348
To reproduce this issue, you can add following code
```
--- a/daemon/oci_linux.go
+++ b/daemon/oci_linux.go
@@ -8,6 +8,7 @@ import (
"sort"
"strconv"
"strings"
+ "time"
"github.com/Sirupsen/logrus"
"github.com/docker/docker/container"
@@ -666,7 +667,8 @@ func (daemon *Daemon) createSpec(c *container.Container) (*libcontainerd.Spec, e
if err := daemon.setupIpcDirs(c); err != nil {
return nil, err
}
-
+ fmt.Printf("===please stop the daemon===\n")
+ time.Sleep(time.Second * 2)
ms, err := daemon.setupMounts(c)
if err != nil {
return nil, err
```
step1 run a container which has `--restart always` and `-v /var/run/docker.sock:/sock`
```
$ docker run -ti --restart always -v /var/run/docker.sock:/sock busybox
/ #
```
step2 exit the the container
```
/ # exit
```
and kill the daemon when you see
```
===please stop the daemon===
```
in the daemon log
The daemon can't restart again and fail with `can't create unix socket /var/run/docker.sock: is a directory`.
Signed-off-by: Lei Jitang <leijitang@huawei.com>
(cherry picked from commit 7318eba5b2f8bb4b867ca943c3229260ca98a3bc)
Signed-off-by: Eli Uriegas <eli.uriegas@docker.com>
Signed-off-by: Eli Uriegas <eli.uriegas@docker.com>
The health check process doesn't have all the environment
varialbes in the container or has them set incorrectly.
This patch should fix that problem.
Signed-off-by: Boaz Shuster <ripcurld.github@gmail.com>
(cherry picked from commit 5836d86ac4d617e837d94010aa60384648ab59ea)
Signed-off-by: Eli Uriegas <eli.uriegas@docker.com>
If a service alias is copied to task, then the DNS resolution on the
service name will resolve to service VIP and all of Task-IPs and that
will break the concept of vip based load-balancing resulting in all the
dns-rr caching issues.
This is a regression introduced in #33130
Signed-off-by: Madhu Venugopal <madhu@docker.com>
(cherry picked from commit 38c15531501578b96d34be5ce7f33a0be6be078f)
Signed-off-by: Eli Uriegas <eli.uriegas@docker.com>
the source was missing from the second dispatch
Signed-off-by: Daniel Nephin <dnephin@docker.com>
(cherry picked from commit 3f2604157790408acf5ad05c74cebe105f2b6979)
Note that go1.8.2 contains a security fix (CVE-2017-8932).
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
(cherry picked from commit 0c7c900e9e66335a6bd486be008af43ae83a5a37)
However, do clear the directory if init or join fails, because we don't
want to leave it in a half-finished state.
Signed-off-by: Ying Li <ying.li@docker.com>
(cherry picked from commit bf3e9293a66c77a2fddf4e691222898846b4af9f)
DeviceMapper tasks in go use SetFinalizer to clean up C construct
counterparts in the C LVM library. While thats well and good, it relies
heavily on the exact interpretation of when the golang garbage collector
determines that an object is unreachable is subject to reclaimation.
While common sense would assert that for stack variables (which these DM
tasks always are), are unreachable when the stack frame in which they
are declared returns, thats not the case. According to this:
https://golang.org/pkg/runtime/#SetFinalizer
The garbage collector decides that, if a function calls into a
systemcall (which task.run() always will in LVM), and there are no
subsequent references to the task variable within that stack frame, then
it can be reclaimed. Those conditions are met in several devmapper.go
routines, and if the garbage collector runs in the middle of a
deviceMapper operation, then the task can be destroyed while the
operation is in progress, leading to crashes, failed operations and
other unpredictable behavior.
The fix is to use the KeepAlive interface:
https://golang.org/pkg/runtime/#KeepAlive
The KeepAlive method is effectively an empy reference that fools the
garbage collector into thinking that a variable is still reachable. By
adding a call to KeepAlive in the task.run() method, we can ensure that
the garbage collector won't reclaim a task object until its execution
within the deviceMapper C library is complete.
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
(cherry picked from commit d764d8b16624e4924b3949273089f851efa0f717)
This was mistakenly unmounting everything under `plugins/*` instead of
just `plugins/<id>/*` anytime a plugin is removed.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
(cherry picked from commit db5f31732a9868c1e9e4f9a49be70b794ff82d4f)
Commit abd72d4008dde7ee8249170d49eb4bc963c51e24 added
a "FIXME" comment to the container "State", mentioning
that a container cannot be both "Running" and "Paused".
This comment was incorrect, because containers on
Linux actually _must_ be running in order to be
paused.
This patch adds additional information both in a
comment, and in the API documentation to clarify
that these booleans are not mutually exclusive.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Upstream-commit: b654b6244d6d63a0758029488a95feb446e089bd
Component: engine