On Ubuntu and Debian there is a sysctl which allows to block
clone(CLONE_NEWUSER) via "sysctl kernel.unprivileged_userns_clone=0"
for unprivileged users that do not have CAP_SYS_ADMIN.
See: https://lists.ubuntu.com/archives/kernel-team/2016-January/067926.html
The DockerSuite.TestRunSeccompUnconfinedCloneUserns testcase fails if
"kernel.unprivileged_userns_clone" is set to 0:
docker_cli_run_unix_test.go:1040:
c.Fatalf("expected clone userns with --security-opt seccomp=unconfined
to succeed, got %s: %v", out, err)
... Error: expected clone userns with --security-opt seccomp=unconfined
to succeed, got clone failed: Operation not permitted
: exit status 1
So add a check and skip the testcase if kernel.unprivileged_userns_clone is 0.
Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Upstream-commit: 87e4e3af68741afcebf11499d1dcbc91b655b349
Component: engine
While testing #24510 I noticed that 32 bit syscalls were incorrectly being
blocked and we did not have a test for this, so adding one.
This is only tested on amd64 as it is the only architecture that
reliably supports 32 bit code execution, others only do sometimes.
There is no 32 bit libc in the buildpack-deps so we cannot build
32 bit C code easily so use the simplest assembly program which
just calls the exit syscall.
Signed-off-by: Justin Cormack <justin.cormack@docker.com>
Upstream-commit: 93bbc76ee53240e0862c6f1ff409e7a4ee0883dc
Component: engine
To implement seccomp for s390x the following changes are required:
1) seccomp_default: Add s390 compat mode
On s390x (64 bit) we can run s390 (32 bit) programs in 32 bit
compat mode. Therefore add this information to arches().
2) seccomp_default: Use correct flags parameter for sys_clone on s390x
On s390x the second parameter for the clone system call is the flags
parameter. On all other architectures it is the first one.
See kernel code kernel/fork.c:
#elif defined(CONFIG_CLONE_BACKWARDS2)
SYSCALL_DEFINE5(clone, unsigned long, newsp, unsigned long, clone_flags,
int __user *, parent_tidptr,
So fix the docker default seccomp rule and check for the second
parameter on s390/s390x.
3) seccomp_default: Add s390 specific syscalls
For s390 we currently have three additional system calls that should
be added to the seccomp whitelist:
- Other architectures can read/write unprivileged from/to PCI MMIO memory.
On s390 the instructions are privileged and therefore we need system
calls for that purpose:
* s390_pci_mmio_write()
* s390_pci_mmio_read()
- Runtime instrumentation:
* s390_runtime_instr()
4) test_integration: Do not run seccomp default profile test on s390x
The generated profile that we check in is for amd64 and i386
architectures and does not work correctly on s390x.
See also: 75385dc216e ("Do not run the seccomp tests that use
default.json on non x86 architectures")
5) Dockerfile.s390x: Add "seccomp" to DOCKER_BUILDTAGS
Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Upstream-commit: bf2a577c131d8998eb6ecac986d80e1289e6c801
Component: engine
Return the correct status code on flag parsins errors.
Signed-off-by: Daniel Nephin <dnephin@docker.com>
Upstream-commit: 5ab24342258c70438ab8edf708ebc466b1677f38
Component: engine
If we attach to a running container and stream is closed afterwards, we
can never be sure if the container is stopped or detached. Adding a new
type of `detach` event can explicitly notify client that container is
detached, so client will know that there's no need to wait for its exit
code and it can move forward to next step now.
Signed-off-by: Zhang Wei <zhangwei555@huawei.com>
Upstream-commit: 83ad006d4724929ccbde4bdf768374fad0eeab44
Component: engine
The Seccomp tests ran 11 tests in parallel and this appears to be
hitting some sort of bug on CI. Splitting into two tests means that
I can no longer repeoduce the failure on the slow laptop where I could
reproduce the failures before.
Obviously this does not fix the underlying issue, which I will
continue to investigate, but not having the tests failing a lot
before the freeze for 1.12 would be rather helpful.
Signed-off-by: Justin Cormack <justin.cormack@docker.com>
Upstream-commit: cfca3255a83c7cbaeaa623617bf71688723b21aa
Component: engine
This fix tries to address the issue raised in #22420. When
`--tmpfs` is specified with `/tmp`, the default value is
`rw,nosuid,nodev,noexec,relatime,size=65536k`. When `--tmpfs`
is specified with `/tmp:rw`, then the value changed to
`rw,nosuid,nodev,noexec,relatime`.
The reason for such an inconsistency is because docker tries
to add `size=65536k` option only when user provides no option.
This fix tries to address this issue by always pre-progating
`size=65536k` along with `rw,nosuid,nodev,noexec,relatime`.
If user provides a different value (e.g., `size=8192k`), it
will override the `size=65536k` anyway since the combined
options will be parsed and merged to remove any duplicates.
Additional test cases have been added to cover the changes
in this fix.
This fix fixes#22420.
Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
Upstream-commit: 397a6fefadf9ac91a5c9de2447f4dea607296470
Component: engine
The generated profile that we check in is for amd64 and i386 architectures
and does not work correctly on arm as it is missing required syscalls,
and also specifies the architectures that are supported. It works on
ppc64le at the moment but better to skip the test as it is likely to
break in future.
Signed-off-by: Justin Cormack <justin.cormack@docker.com>
Upstream-commit: 75385dc216e784d24535326376352de03eaeb059
Component: engine
Currently, using a custom detach key with an invalid sequence, eats a
part of the sequence, making it weird and difficult to enter some key
sequence.
This fixes by keeping the input read when trying to see if it's the key
sequence or not, and "writing" then is the key sequence is not the right
one, preserving the initial input.
Signed-off-by: Vincent Demeester <vincent@sbr.pm>
Upstream-commit: 0fb6190243d6101f96283e487cd4911142a05483
Component: engine
This was not changed when the additional tests were added.
It may be the reason for occasional test failures.
Signed-off-by: Justin Cormack <justin.cormack@docker.com>
Upstream-commit: 3598f2e33198686f0afa08aca640dbda8697fcb2
Component: engine
Currently the default seccomp profile is fixed. This changes it
so that it varies depending on the Linux capabilities selected with
the --cap-add and --cap-drop options. Without this, if a user adds
privileges, eg to allow ptrace with --cap-add sys_ptrace then still
cannot actually use ptrace as it is still blocked by seccomp, so
they will probably disable seccomp or use --privileged. With this
change the syscalls that are needed for the capability are also
allowed by the seccomp profile based on the selected capabilities.
While this patch makes it easier to do things with for example
cap_sys_admin enabled, as it will now allow creating new namespaces
and use of mount, it still allows less than --cap-add cap_sys_admin
--security-opt seccomp:unconfined would have previously. It is not
recommended that users run containers with cap_sys_admin as this does
give full access to the host machine.
It also cleans up some architecture specific system calls to be
only selected when needed.
Signed-off-by: Justin Cormack <justin.cormack@docker.com>
Upstream-commit: a01c4dc8f85827f32d88522e5153dddc02f11806
Component: engine
Add the swapMemorySupport requirement to all tests related to the OOM killer. The --memory option has the subtle side effect of defaulting --memory-swap to double the value of --memory. The OOM killer doesn't kick in until the container exhausts memory+swap, and so without the memory swap cgroup the tests will timeout due to swap being effectively unlimited.
Document the default behavior of --memory-swap in the docker run man page.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Upstream-commit: adabb51311ecac031bd72378c5ed1669d1835d40
Component: engine
This fix tries to address the issue raised in #22271 where
relative symlinks don't work with --device argument.
Previously, the symlinks in --device was implemneted (#20684)
with `os.Readlink()` which does not resolve if the linked
target is a relative path. In this fix, `filepath.EvalSymlinks()`
has been used which will reolve correctly with relative
paths.
An additional test case has been added to the existing
`TestRunDeviceSymlink` to cover changes in this fix.
This fix is related to #13840 and #20684, #22271.
This fix fixes#22271.
Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
Upstream-commit: 632b314b239d1cd5e2498f198503a2983233a9f4
Component: engine
There was an error in validation logic before, should use period
instead of quota, and also add check for negative
number here, if not with that, it would had cpu.cfs_period_us: invalid argument
which is not good for users.
Signed-off-by: Kai Qiang Wu(Kennan) <wkqwu@cn.ibm.com>
Upstream-commit: 62cb06a6c1db5599f1f5b9b95b298be83c509860
Component: engine
This patch will allow users to specify namespace specific "kernel parameters"
for running inside of a container.
Signed-off-by: Dan Walsh <dwalsh@redhat.com>
Upstream-commit: 9caf7aeefd23263a209c26c8439d26c147972d81
Component: engine
Kernel has no limit for memory reservation, but in different
kernel versions, the default behavior is different.
On kernel 3.13,
docker run --rm --memory-reservation 1k busybox cat /sys/fs/cgroup/memory/memory.soft_limit_in_bytes
the output would be 4096, but on kernel 4.1, the output is 0.
Since we have minimum limit for memory and kernel memory, we
can have this limit for memory reservation as well, to make
the behavior consistent.
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
Upstream-commit: 50a61810056a421fb94acf26277995f2c1f31ede
Component: engine
All other options we have use `=` as separator, labels,
log configurations, graph configurations and so on.
We should be consistent and use `=` for the security
options too.
Signed-off-by: David Calavera <david.calavera@gmail.com>
Upstream-commit: cb9aeb0413ca75bb3af7fa723a1f2e6b2bdbcb0e
Component: engine
Progress toward being able to run integration-cli campaign using a
client hitting a remote host.
Most of these fixes imply flagging tests that assume they are running on
the same host than the Daemon. Also fixes the `contrib/httpserver` image
that couldn't run because of a dynamically linked Go binary inside the
busybox image.
Signed-off-by: Arnaud Porterie <arnaud.porterie@docker.com>
Upstream-commit: a943c401509e7994ae5c574a4b7e23354e44a105
Component: engine
1. Replace raw `docker inspect -f xxx` with `inspectField`, to make code
cleaner and more consistent
2. assert the error in function `inspectField*` so we don't need to
assert the return value of it every time, this will make inspect easier.
Signed-off-by: Zhang Wei <zhangwei555@huawei.com>
Upstream-commit: 62a856e9129c9d5cf7db9ea6322c9073d68e3ea4
Component: engine