| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
So far, we supported two modes:
1. when running unpriv we'd get the mounts from mountfsd, and the userns
from nsresourced
2. when running priv we'd do the mounts/userns ourselves
This untangles this a bit, so that we can also use mountfsd/nsresourced
when running privilged.
I think this is generally a bit nicer, and probably something we should
switch to entirely one day, as it reduces the variety of codepaths.
With this patch the default behaviour remains unchanged, but by
selecting the new "managed" option for --private-users= the codepaths
via mountfsd/nsresourced can be explicitly requested even when running
with privs.
This is mostly just reworks that we check for arg_userns_mode !=
USER_NAMESPACE_MANAGED rather than arg_privileged for a number of
codepaths, but requires more fixes, too. The devil is in the details.
|
|
|
|
|
|
|
|
|
|
| |
This adds a new "foreign" value to --private-users-ownership= which is a
lot like "map", but maps from the host's foreign UID range rather than from the
host's 0.
(This has nothing much to do with making unprivileged directory-based
containers work, it's just very handy that we can run privileged
contains with such a mapping too, with an easy switch)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This simply calls into mountfsd to acquire the root mount and uses it as
root for the container.
Note that this also makes one more change: previously we ran containers
directory off their backing directory. Except when we didn't, and there
were a variety of exceptions: if we had no privs, if we ran off a disk
image, if the directory was the host's root dir, and some others.
This simplifies the logic a bit: we now simply always create a temporary
directory in /tmp/ and bind mount everything there, in all code paths.
This simplifies our code a bit. After all, in order to control
propagation we need to turn the root into a mount point anyway, hence we
might just do it at one place for all cases.
|
| |
|
| |
|
| |
|
|\
| |
| |
| |
| |
| |
| |
| | |
PIDFD_GET_*_NAMESPACE (#35242)
Supersedes #35308 (cherry-picked one commit and replaced the rest)
(I left a few comments that's folded by GitHub. Please make sure to
check them too.)
|
| |
| |
| |
| |
| |
| |
| |
| | |
- Make fd_is_namespace() take NamespaceType
- Drop support for kernel without NS_GET_NSTYPE (< 4.11)
- Port is_our_namespace() to namespace_open_by_type()
(preparation for later commits, where the latter
would go by pidfd if available, avoiding procfs)
|
|/ |
|
| |
|
|\ |
|
| | |
|
| | |
|
|/ |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We nowadays support unprivileged invocation of systemd-nspawn +
systemd-vmspawn, but there was no support for discovering suitable disk
images (i.e. no per-user counterpart of /var/lib/machines). Add this
now, and hook it up everywhere.
Instead of hardcoding machined's, importd's, portabled's, sysupdated's
image discovery to RUNTIME_SCOPE_SYSTEM I introduced a field that make
the scope variable, even if this field is always initialized to
RUNTIME_SCOPE_SYSTEM for now. I think these four services should
eventually be updated to support a per-user concept too, this is
preparation for that, even though it doesn't outright add support for
this.
This is for the largest part not user visible, except for in nspawn,
vmspawn and the dissect tool. For the latter I added a pair of
--user/--system switches to select the discovery scope.
|
| |
|
| |
|
|
|
|
|
|
|
|
| |
This function pins the *API* FS, i.e. /proc/ + /sys/, not just any fs.
Hence clarify this in the name.
(At least we call these two fs "API (V)FS" in our codebase, hence
continue to do so here)
|
| |
|
| |
|
|
|
|
|
|
| |
Then, it is not necessary to manually drain PTY forwarder by the user
side. Also, not necessary to free PTY forwarder earlier explicitly to
make it disconnected.
|
|
|
|
|
| |
Currently we do that in the user of PTY forwarder, e.g. nspawn.
But, let's do that unconditionally in the PTY forwarder.
|
|
|
|
|
|
|
|
| |
Let's bump the kernel baseline a bit to 4.3 and thus require ambient
caps.
This allows us to remove support for a variety of special casing, most
importantly the ExecStart=!! hack.
|
|
|
|
|
|
|
| |
Then, switch the default value to "0600", due to general security
concerns about terminals being written to by other users.
Closing #35599.
|
|
|
|
| |
to perms
|
|
|
|
|
|
| |
When registering we condition this on "arg_register". Let's do the same
when unregistering, otherwise we might end up trying to unregister a
machine we never registered.
|
|
|
|
| |
It's the PID that is wrong, not the UID/GID, be precise.
|
|
|
|
|
|
|
|
|
| |
The wrong error code was logged.
But actually given that userns_mkdir() is fine with existing dirs, let's
drop the redundant conditionalization.
Follow-up for: a1fcaa1549d86098d0ba75254b6afc96c786b3b6
|
| |
|
|
|
|
|
|
|
| |
unspecified
Follow-up for efedb6b0f3cff37950112fd37cb750c16d599bc7.
Closes #35116.
|
|
|
|
|
|
|
|
|
| |
copy_devnodes()
While doing that, even if mknod() failed, we anyway try to fall back to
use bind mount if arg_uid_shift == 0.
Mostly no functional change, just refactoring and preparation for later commit.
|
|
|
|
|
|
|
|
| |
Follow-up for dc3223919f663b7c8b8d8d1d6072b4487df7709b.
If nspawn is invoked with DevicePolicy= but DeviceAllow= does not
contain /dev/fuse, nspawn will fail to get fuse version with -EPERM.
Let's silence the warning in that case.
|
| |
|
|
|
|
| |
(#34893)
|
| |
|
|\
| |
| | |
Change systemd-nspawn man page to strongly recommend private users
|
| |
| |
| |
| |
| | |
Both spellings were used, but the dictionary says that "lightweight"
is the standard spelling.
|
|/ |
|
| |
|
|
|
|
| |
In order to distinguish it from libc function naming.
|
|
|
|
| |
Follow-up for d7a6bb9891ecc38a1bedef9689d00671bb0001ff.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This tries to get rid of most manual sigprocmask() changes, in favour
of:
1. The SD_EVENT_SIGNAL_PROCMASK flag to sd_event_add_signal()
2. The sd_event_set_signal_exit() call for handling SIGTERM/SIGINT
3. Move masking of SIGWINCH into ptyfwd, out of nspawn/vmspawn/run
And while we are at it get rid of a bunch of event source fields whose
lifetime is bound to the sd_event object they belong to anyway, and make
use of the "floating" event source feature of sd-event instead.
|
|
|
|
|
|
|
|
| |
Follow-up for dc3223919f663b7c8b8d8d1d6072b4487df7709b.
Addresses https://github.com/systemd/systemd/pull/34067#discussion_r1748061156.
Error codes other than ENOSYS may not come here, but if it comes, still
there is nothing we can do here, so let's not log the failure loudly.
|
|\
| |
| | |
nspawn: make --volatile work with -U
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
The root directory is already mounted with a picked UID shift, hence
it is not necessary to remount with idmap. However, /usr/ is a bind-mount,
hence it must be remounted with idmap.
With this change, now '-U --volatile=yes' works fine.
Fixes #34254.
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Previously, remount_idmap() failed as /var/ was already mounted, thus
remounting (strictly speaking, unmounting old root directory) failed
with -EBUSY.
As tmpfs /var/ is mounted with picked UID shift, it should not be
remounted with idmap, but needs to be mounted after the root directory
being remounted.
This makes '-U --volatile=state' work as expected.
|
| |
| |
| |
| |
| |
| | |
remount_idmap()
No functional change, just refactoring and preparation for later change.
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Linux kernel v4.18 (2018-08-12) added user-namespace support to FUSE, and
bumped the FUSE version to 7.27 (see: da315f6e0398 (Merge tag
'fuse-update-4.18' of
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse, Linus Torvalds,
2018-06-07). This means that on such kernels it is safe to enable FUSE in
nspawn containers.
In outer_child(), before calling copy_devnodes(), check the FUSE version to
decide whether enable (>=7.27) or disable (<7.27) FUSE in the container. We
look at the FUSE version instead of the kernel version in order to enable FUSE
support on older-versioned kernels that may have the mentioned patchset
backported ([as requested by @poettering][1]). However, I am not sure that
this is safe; user-namespace support is not a documented part of the FUSE
protocol, which is what FUSE_KERNEL_VERSION/FUSE_KERNEL_MINOR_VERSION are meant
to capture. While the same patchset
- added FUSE_ABORT_ERROR (which is all that the 7.27 version bump
is documented as including),
- bumped FUSE_KERNEL_MINOR_VERSION from 26 to 27, and
- added user-namespace support
these 3 things are not inseparable; it is conceivable to me that a backport
could include the first 2 of those things and exclude the 3rd; perhaps it would
be safer to check the kernel version.
Do note that our get_fuse_version() function uses the fsopen() family of
syscalls, which were not added until Linux kernel v5.2 (2019-07-07); so if
nothing has been backported, then the minimum kernel version for FUSE-in-nspawn
is actually v5.2, not v4.18.
Pass whether or not to enable FUSE to copy_devnodes(); have copy_devnodes()
copy in /dev/fuse if enabled.
Pass whether or not to enable FUSE back over fd_outer_socket to run_container()
so that it can pass that to append_machine_properties() (via either
register_machine() or allocate_scope()); have append_machine_properties()
append "DeviceAllow=/dev/fuse rw" if enabled.
For testing, simply check that /dev/fuse can be opened for reading and writing,
but that actually reading from it fails with EPERM. The test assumes that if
FUSE is supported (/dev/fuse exists), then the testsuite is running on a kernel
with FUSE >= 7.27; I am unsure how to go about writing a test that validates
that the version check disables FUSE on old kernels.
[1]: https://github.com/systemd/systemd/issues/17607#issuecomment-745418835
Closes #17607
|
| | |
|
| | |
|