This is required to address a race condition described in #5553,
where a container can be partially deleted -- for example, the
root filesystem but not the init filesystem -- which makes
it impossible to delete the container without re-adding the
missing filesystems manually.
This behavior has been witnessed when rebooting boxes that
are configured to remove containers on shutdown in parallel
with stopping the Docker daemon.
Docker-DCO-1.1-Signed-off-by: Gabriel Monroy <gabriel@opdemand.com> (github: gabrtv)
Upstream-commit: 9f152aacf8427cbd20a70d52d633f8a6d624aff5
Component: engine
We don't have the flexibility to do extra things with lxc because it is
a black box and most fo the magic happens before we get a chance to
interact with it in dockerinit.
Docker-DCO-1.1-Signed-off-by: Michael Crosby <michael@crosbymichael.com> (github: crosbymichael)
Upstream-commit: 59fe77bfa638001cbe9af386f350d6e0dbb23398
Component: engine
This also cleans up some of the left over restriction paths code from
before.
Docker-DCO-1.1-Signed-off-by: Michael Crosby <michael@crosbymichael.com> (github: crosbymichael)
Upstream-commit: f5139233b930e436707a65cc032aa2952edd6e4a
Component: engine
It has been pointed out that some files in /proc and /sys can be used
to break out of containers. However, if those filesystems are mounted
read-only, most of the known exploits are mitigated, since they rely
on writing some file in those filesystems.
This does not replace security modules (like SELinux or AppArmor), it
is just another layer of security. Likewise, it doesn't mean that the
other mitigations (shadowing parts of /proc or /sys with bind mounts)
are useless. Those measures are still useful. As such, the shadowing
of /proc/kcore is still enabled with both LXC and native drivers.
Special care has to be taken with /proc/1/attr, which still needs to
be mounted read-write in order to enable the AppArmor profile. It is
bind-mounted from a private read-write mount of procfs.
All that enforcement is done in dockerinit. The code doing the real
work is in libcontainer. The init function for the LXC driver calls
the function from libcontainer to avoid code duplication.
Docker-DCO-1.1-Signed-off-by: Jérôme Petazzoni <jerome@docker.com> (github: jpetazzo)
Upstream-commit: 1c4202a6142d238d41f10deff1f0548f7591350b
Component: engine
Kernel capabilities for privileged syslog operations are currently splitted into
CAP_SYS_ADMIN and CAP_SYSLOG since the following commit:
http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=ce6ada35bdf710d16582cc4869c26722547e6f11
This patch drops CAP_SYSLOG to prevent containers from messing with
host's syslog (e.g. `dmesg -c` clears up host's printk ring buffer).
Closes#5491
Docker-DCO-1.1-Signed-off-by: Eiichi Tsukata <devel@etsukata.com> (github: Etsukata)
Docker-DCO-1.1-Signed-off-by: Michael Crosby <michael@crosbymichael.com> (github: crosbymichael)
Upstream-commit: cac0cea03f85191b3d92cdaeae827fdd93fb1b29
Component: engine
This duplicates some of the Exec code but I think it it worth it because
the native driver is more straight forward and does not have the
complexity have handling the type issues for now.
Docker-DCO-1.1-Signed-off-by: Michael Crosby <michael@crosbymichael.com> (github: crosbymichael)
Upstream-commit: 60e4276f5af360dd3292e22993c0c132a86edc2e
Component: engine
This temp. expands the Exec method's signature but adds a more robust
way to know when the container's process is actually released and begins
to run. The network interfaces are not guaranteed to be up yet but this
provides a more accurate view with a single callback at this time.
Docker-DCO-1.1-Signed-off-by: Michael Crosby <michael@crosbymichael.com> (github: crosbymichael)
Upstream-commit: f1104014372e71e1f8ae7a63d17e18de5e2fa93a
Component: engine
Without this patch, containers inherit the open file descriptors of the daemon, so my "exec 42>&2" allows us to "echo >&42 some nasty error with some bad advice" directly into the daemon log. :)
Also, "hack/dind" was already doing this due to issues caused by the inheritance, so I'm removing that hack too since this patch obsoletes it by generalizing it for all containers.
Docker-DCO-1.1-Signed-off-by: Andrew Page <admwiggin@gmail.com> (github: tianon)
Upstream-commit: d5d62ff95574a48816890d8d6e0785a79f559c3c
Component: engine
Added --selinux-enable switch to daemon to enable SELinux labeling.
The daemon will now generate a new unique random SELinux label when a
container starts, and remove it when the container is removed. The MCS
labels will be stored in the daemon memory. The labels of containers will
be stored in the container.json file.
When the daemon restarts on boot or if done by an admin, it will read all containers json files and reserve the MCS labels.
A potential problem would be conflicts if you setup thousands of containers,
current scheme would handle ~500,000 containers.
Docker-DCO-1.1-Signed-off-by: Dan Walsh <dwalsh@redhat.com> (github: rhatdan)
Docker-DCO-1.1-Signed-off-by: Dan Walsh <dwalsh@redhat.com> (github: crosbymichael)
Upstream-commit: b7942ec2ca7c7568df0c3b7eb554b05e2c3a3081
Component: engine
This has every container using the docker daemon's pid for the processes
label so it does not work correctly.
Docker-DCO-1.1-Signed-off-by: Michael Crosby <michael@crosbymichael.com> (github: crosbymichael)
Upstream-commit: f0e6e135a8d733af173bf0b8732c704c9ec716d7
Component: engine
This will allow for these to be set independently. Keep the current Docker behavior where Memory and MemoryReservation are set to the value of Memory.
Docker-DCO-1.1-Signed-off-by: Victor Marmol <vmarmol@google.com> (github: vmarmol)
Upstream-commit: f188b9f623e23ee624aca8654bf00f49ee3bae29
Component: engine
container.Kill() might read a pid of 0 from
container.State.Pid due to losing a race with
container.monitor() calling
container.State.SetStopped(). Sending a SIGKILL to
pid 0 is undesirable as "If pid equals 0, then sig
is sent to every process in the process group of
the calling process."
Docker-DCO-1.1-Signed-off-by: Daniel Norberg <daniel.norberg@gmail.com> (github: danielnorberg)
Upstream-commit: b3ddc31b9581665eb15dedd0aa45bd37c1eb6815
Component: engine
We used to mount in Create() to be able to create a few files that
needs to be in each device. However, this mount is problematic for
selinux, as we need to set the mount label at mount-time, and it
is not known at the time of Create().
This change just moves the file creation to first Get() call and
drops the mount from Create(). Additionally, this lets us remove
some complexities we had to avoid an extra unmount+mount cycle.
Docker-DCO-1.1-Signed-off-by: Alexander Larsson <alexl@redhat.com> (github: alexlarsson)
Upstream-commit: 73d9ede12c9328c44e38699dbe3a04479d3926e6
Component: engine
container.Register() checks both IsRunning() and IsGhost(), but at
this point IsGhost() is always true if IsRunning() is true. For a
newly created container both are false, and for a restored-from-disk
container Daemon.load() sets Ghost to true if IsRunning is true. So we
just drop the IsGhost check.
This was the last call to IsGhost, so we remove It and all other
traces of the ghost state.
Docker-DCO-1.1-Signed-off-by: Alexander Larsson <alexl@redhat.com> (github: alexlarsson)
Upstream-commit: cf997aa905c5c6f5a29fa3658d904ffc81a1a4a1
Component: engine