It has observed defunct containerd processes accumulating over
time while dockerd was permanently failing to restart containerd.
Due to a bug in the runContainerdDaemon() function, dockerd does not clean up
its child process if containerd already exits very soon after the (re)start.
The reproducer and analysis below comes from docker 1.12.x but bug
still applies on latest master.
- from libcontainerd/remote_linux.go:
329 func (r *remote) runContainerdDaemon() error {
:
: // start the containerd child process
:
403 if err := cmd.Start(); err != nil {
404 return err
405 }
:
: // If containerd exits very soon after (re)start, it is
possible
: // that containerd is already in defunct state at the time
when
: // dockerd gets here. The setOOMScore() function tries to
write
: // to /proc/PID_OF_CONTAINERD/oom_score_adj. However, this
fails
: // with errno EINVAL because containerd is defunct. Please see
: // snippets of kernel source code and further explanation
below.
:
407 if err := setOOMScore(cmd.Process.Pid, r.oomScore); err != nil
{
408 utils.KillProcess(cmd.Process.Pid)
:
: // Due to the error from write() we return here. As
the
: // goroutine that would clean up the child has not
been
: // started yet, containerd remains in the defunct
state
: // and never gets reaped.
:
409 return err
410 }
:
417 go func() {
418 cmd.Wait()
419 close(r.daemonWaitCh)
420 }() // Reap our child when needed
:
423 }
This is the kernel function that gets invoked when dockerd tries to
write
to /proc/PID_OF_CONTAINERD/oom_score_adj.
- from fs/proc/base.c:
1197 static ssize_t oom_score_adj_write(struct file *file, ...
1198 size_t count, loff_t
*ppos)
1199 {
:
1223 task = get_proc_task(file_inode(file));
:
: // The defunct containerd process does not have a virtual
: // address space anymore, i.e. task->mm is NULL. Thus the
: // following code returns errno EINVAL to dockerd.
:
1230 if (!task->mm) {
1231 err = -EINVAL;
1232 goto err_task_lock;
1233 }
:
1253 err_task_lock:
:
1257 return err < 0 ? err : count;
1258 }
The purpose of the following program is to demonstrate the behavior of
the oom_score_adj_write() function in connection with a defunct process.
$ cat defunct_test.c
\#include <unistd.h>
main()
{
pid_t pid = fork();
if (pid == 0)
// child
_exit(0);
// parent
pause();
}
$ make defunct_test
cc defunct_test.c -o defunct_test
$ ./defunct_test &
[1] 3142
$ ps -f | grep defunct_test | grep -v grep
root 3142 2956 0 13:04 pts/0 00:00:00 ./defunct_test
root 3143 3142 0 13:04 pts/0 00:00:00 [defunct_test] <defunct>
$ echo "ps 3143" | crash -s
PID PPID CPU TASK ST %MEM VSZ RSS COMM
3143 3142 2 ffff880035def300 ZO 0.0 0 0
defunct_test
$ echo "px ((struct task_struct *)0xffff880035def300)->mm" | crash -s
$1 = (struct mm_struct *) 0x0
^^^ task->mm is NULL
$ cat /proc/3143/oom_score_adj
0
$ echo 0 > /proc/3143/oom_score_adj
-bash: echo: write error: Invalid argument"
---
This patch fixes the above issue by making sure we start the reaper
goroutine as soon as possible.
Signed-off-by: Antonio Murdaca <runcom@redhat.com>
(cherry picked from commit 27087eacbf96e6ef9d48a6d3dc89c7c1cff155b4)
Signed-off-by: Eli Uriegas <eli.uriegas@docker.com>
Signed-off-by: Eli Uriegas <eli.uriegas@docker.com>
the source was missing from the second dispatch
Signed-off-by: Daniel Nephin <dnephin@docker.com>
(cherry picked from commit 3f2604157790408acf5ad05c74cebe105f2b6979)
Note that go1.8.2 contains a security fix (CVE-2017-8932).
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
(cherry picked from commit 0c7c900e9e66335a6bd486be008af43ae83a5a37)
However, do clear the directory if init or join fails, because we don't
want to leave it in a half-finished state.
Signed-off-by: Ying Li <ying.li@docker.com>
(cherry picked from commit bf3e9293a66c77a2fddf4e691222898846b4af9f)
DeviceMapper tasks in go use SetFinalizer to clean up C construct
counterparts in the C LVM library. While thats well and good, it relies
heavily on the exact interpretation of when the golang garbage collector
determines that an object is unreachable is subject to reclaimation.
While common sense would assert that for stack variables (which these DM
tasks always are), are unreachable when the stack frame in which they
are declared returns, thats not the case. According to this:
https://golang.org/pkg/runtime/#SetFinalizer
The garbage collector decides that, if a function calls into a
systemcall (which task.run() always will in LVM), and there are no
subsequent references to the task variable within that stack frame, then
it can be reclaimed. Those conditions are met in several devmapper.go
routines, and if the garbage collector runs in the middle of a
deviceMapper operation, then the task can be destroyed while the
operation is in progress, leading to crashes, failed operations and
other unpredictable behavior.
The fix is to use the KeepAlive interface:
https://golang.org/pkg/runtime/#KeepAlive
The KeepAlive method is effectively an empy reference that fools the
garbage collector into thinking that a variable is still reachable. By
adding a call to KeepAlive in the task.run() method, we can ensure that
the garbage collector won't reclaim a task object until its execution
within the deviceMapper C library is complete.
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
(cherry picked from commit d764d8b16624e4924b3949273089f851efa0f717)
This was mistakenly unmounting everything under `plugins/*` instead of
just `plugins/<id>/*` anytime a plugin is removed.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
(cherry picked from commit db5f31732a9868c1e9e4f9a49be70b794ff82d4f)
Commit abd72d4008dde7ee8249170d49eb4bc963c51e24 added
a "FIXME" comment to the container "State", mentioning
that a container cannot be both "Running" and "Paused".
This comment was incorrect, because containers on
Linux actually _must_ be running in order to be
paused.
This patch adds additional information both in a
comment, and in the API documentation to clarify
that these booleans are not mutually exclusive.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Upstream-commit: b654b6244d6d63a0758029488a95feb446e089bd
Component: engine
These are already present in `docker/cli` 👼
Signed-off-by: Vincent Demeester <vincent@sbr.pm>
Upstream-commit: 61e527f16ccbbbca137c4db1b13b558ee4a51213
Component: engine
when `cli.post(...)` fails `errC <- err` blocks because `errC` is unbufferd.
Signed-off-by: Simon Menke <simon.menke@gmail.com>
Upstream-commit: 4d2d2ea39336aade783c5c415b83d129bdd339bb
Component: engine
OverlayFS is supported on top of btrfs as of Linux Kernel 4.7.
Skip the hard enforcement when on kernel 4.7 or newer and
respect the kernel check override flag on older kernels.
https://btrfs.wiki.kernel.org/index.php/Changelog#By_feature
Signed-off-by: Derek McGowan <derek@mcgstyle.net>
Upstream-commit: f64a4ad008e68996afcec3ab34a869887716f944
Component: engine
ineffectual assignment to isCanonical, delete it, and make the "if" sentence to fit the golang usage
Upstream-commit: b0dd3dfc1184b210854b62c231b7b074dd6dbd26
Component: engine
Since this new version of the CLI resolves image digests for swarm
services by default, and we do not want integration tests to talk to
Docker Hub, update CLI tests to suppress this behavior.
Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>
Upstream-commit: d012569b78c27e6e52bc5006d9a1d7a2099b1c2b
Component: engine