linux/fs
Christian Brauner 7a54947e72
Merge patch series "fs: allow changing idmappings"
Christian Brauner <brauner@kernel.org> says:

Currently, it isn't possible to change the idmapping of an idmapped
mount. This is becoming an obstacle for various use-cases.

  /* idmapped home directories with systemd-homed */

  On newer systems /home is can be an idmapped mount such that each file
  on disk is owned by 65536 and a subfolder exists for foreign id ranges
  such as containers. For example, a home directory might look like this
  (using an arbitrary folder as an example):

  user1@localhost:~/data/mount-idmapped$ ls -al /data/
  total 16
  drwxrwxrwx 1      65536      65536  36 Jan 27 12:15 .
  drwxrwxr-x 1      root       root  184 Jan 27 12:06 ..
  -rw-r--r-- 1      65536      65536   0 Jan 27 12:07 aaa
  -rw-r--r-- 1      65536      65536   0 Jan 27 12:07 bbb
  -rw-r--r-- 1      65536      65536   0 Jan 27 12:07 cc
  drwxr-xr-x 1 2147352576 2147352576   0 Jan 27 19:06 containers

  When logging in home is mounted as an idmapped mount with the following
  idmappings:

  65536:$(id -u):1            // uid mapping
  65536:$(id -g):1            // gid mapping
  2147352576:2147352576:65536 // uid mapping
  2147352576:2147352576:65536 // gid mapping

  So for a user with uid/gid 1000 an idmapped /home would like like this:

  user1@localhost:~/data/mount-idmapped$ ls -aln /mnt/
  total 16
  drwxrwxrwx 1       1000       1000  36 Jan 27 12:15 .
  drwxrwxr-x 1          0          0 184 Jan 27 12:06 ..
  -rw-r--r-- 1       1000       1000   0 Jan 27 12:07 aaa
  -rw-r--r-- 1       1000       1000   0 Jan 27 12:07 bbb
  -rw-r--r-- 1       1000       1000   0 Jan 27 12:07 cc
  drwxr-xr-x 1 2147352576 2147352576   0 Jan 27 19:06 containers

  In other words, 65536 is mapped to the user's uid/gid and the range
  2147352576 up to 2147352576 + 65536 is an identity mapping for
  containers.

  When a container is started a transient uid/gid range is allocated
  outside of both mappings of the idmapped mount. For example, the
  container might get the idmapping:

  $ cat /proc/1742611/uid_map
           0  537985024      65536

  This container will be allowed to write to disk within the allocated
  foreign id range 2147352576 to 2147352576 + 65536. To do this an
  idmapped mount must be created from an already idmapped mount such that:

  - The mappings for the user's uid/gid must be dropped, i.e., the
    following mappings are removed:

    65536:$(id -u):1            // uid mapping
    65536:$(id -g):1            // gid mapping

  - A mapping for the transient uid/gid range to the foreign uid/gid range
    is added:

    2147352576:537985024:65536

  In combination this will mean that the container will write to disk
  within the foreign id range 2147352576 to 2147352576 + 65536.

  /* nested containers */

  When the outer container makes use of idmapped mounts it isn't posssible
  to create an idmapped mount for the inner container with a differen
  idmapping from the outer container's idmapped mount.

There are other usecases and the two above just serve as an illustration
of the problem.

This patchset makes it possible to create a new idmapped mount from an
already idmapped mount. It aims to adhere to current performance
constraints and requirements:

- Idmapped mounts aim to have near zero performance implications for
  path lookup. That is why no refernce counting, locking or any other
  mechanism can be required that would impact performance.

  This works be ensuring that a regular mount transitions to an idmapped
  mount once going from a static nop_mnt_idmap mapping to a non-static
  idmapping.

- The idmapping of a mount change anymore for the lifetime of the mount
  afterwards. This not just avoids UAF issues it also avoids pitfalls
  such as generating non-matching uid/gid values.

Changing idmappings could be solved by:

- Idmappings could simply be reference counted (above the simple
  reference count when sharing them across multiple mounts).

  This would require pairing mnt_idmap_get() with mnt_idmap_put() which
  would end up being sprinkled everywhere into the VFS and some
  filesystems that access idmappings directly.

  It wouldn't just be quite ugly and introduce new complexity it would
  have a noticeable performance impact.

- Idmappings could gain RCU protection. This would help the LOOKUP_RCU
  case and avoids taking reference counts under RCU.

  When not under LOOKUP_RCU reference counts need to be acquired on each
  idmapping. This would require pairing mnt_idmap_get() with
  mnt_idmap_put() which would end up being sprinkled everywhere into the
  VFS and some filesystems that access idmappings directly.

  This would have the same downsides as mentioned earlier.

- The earlier solutions work by updating the mnt->mnt_idmap pointer with
  the new idmapping. Instead of this it would be possible to change the
  idmapping itself to avoid UAF issues.

  To do this a sequence counter would have to be added to struct mount.
  When retrieving the idmapping to generate uid/gid values the sequence
  counter would need to be sampled and the generation of the uid/gid
  would spin until the update of the idmap is finished.

  This has problems as well but the biggest issue will be that this can
  lead to inconsistent permission checking and inconsistent uid/gid
  pairs even more than this is already possible today. Specifically,
  during creation it could happen that:

  idmap = mnt_idmap(mnt);
  inode_permission(idmap, ...);
  may_create(idmap);
  // create file with uid/gid based on @idmap

  in between the permission checking and the generation of the uid/gid
  value the idmapping could change leading to the permission checking
  and uid/gid value that is actually used to create a file on disk being
  out of sync.

  Similarly if two values are generated like:

  idmap = mnt_idmap(mnt)
  vfsgid = make_vfsgid(idmap);
  // idmapping gets update concurrently
  vfsuid = make_vfsuid(idmap);

  @vfsgid and @vfsuid could be out of sync if the idmapping was changed
  in between. The generation of vfsgid/vfsuid could span a lot of
  codelines so to guard against this a sequence count would have to be
  passed around.

  The performance impact of this solutio are less clear but very likely
  not zero.

- Using SRCU similar to fanotify that can sleep. I find that not just
  ugly but it would have memory consumption implications and is overall
  pretty ugly.

/* solution */

So, to avoid all of these pitfalls creating an idmapped mount from an
already idmapped mount will be done atomically, i.e., a new detached
mount is created and a new set of mount properties applied to it without
it ever having been exposed to userspace at all.

This can be done in two ways. A new flag to open_tree() is added
OPEN_TREE_CLEAR_IDMAP that clears the old idmapping and returns a mount
that isn't idmapped. And then it is possible to set mount attributes on
it again including creation of an idmapped mount.

This has the consequence that a file descriptor must exist in userspace
that doesn't have any idmapping applied and it will thus never work in
unpriviledged scenarios. As a container would be able to remove the
idmapping of the mount it has been given. That should be avoided.

Instead, we add open_tree_attr() which works just like open_tree() but
takes an optional struct mount_attr parameter. This is useful beyond
idmappings as it fills a gap where a mount never exists in userspace
without the necessary mount properties applied.

This is particularly useful for mount options such as
MOUNT_ATTR_{RDONLY,NOSUID,NODEV,NOEXEC}.

To create a new idmapped mount the following works:

// Create a first idmapped mount
struct mount_attr attr = {
        .attr_set = MOUNT_ATTR_IDMAP
        .userns_fd = fd_userns
};

fd_tree = open_tree(-EBADF, "/", OPEN_TREE_CLONE, &attr, sizeof(attr));
move_mount(fd_tree, "", -EBADF, "/mnt", MOVE_MOUNT_F_EMPTY_PATH);

// Create a second idmapped mount from the first idmapped mount
attr.attr_set = MOUNT_ATTR_IDMAP;
attr.userns_fd = fd_userns2;
fd_tree2 = open_tree(-EBADF, "/mnt", OPEN_TREE_CLONE, &attr, sizeof(attr));

// Create a second non-idmapped mount from the first idmapped mount:
memset(&attr, 0, sizeof(attr));
attr.attr_clr = MOUNT_ATTR_IDMAP;
fd_tree2 = open_tree(-EBADF, "/mnt", OPEN_TREE_CLONE, &attr, sizeof(attr));

* patches from https://lore.kernel.org/r/20250128-work-mnt_idmap-update-v2-v1-0-c25feb0d2eb3@kernel.org:
  fs: allow changing idmappings
  fs: add kflags member to struct mount_kattr
  fs: add open_tree_attr()
  fs: add copy_mount_setattr() helper
  fs: add vfs_open_tree() helper

Link: https://lore.kernel.org/r/20250128-work-mnt_idmap-update-v2-v1-0-c25feb0d2eb3@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-12 12:12:34 +01:00
..
9p Provide stable parent and name to ->d_revalidate() instances 2025-01-30 09:13:35 -08:00
adfs Merge patch series "adfs, affs, befs, hfs, hfsplus: convert to new mount api" 2024-10-08 14:41:53 +02:00
affs Merge patch series "adfs, affs, befs, hfs, hfsplus: convert to new mount api" 2024-10-08 14:41:53 +02:00
afs Provide stable parent and name to ->d_revalidate() instances 2025-01-30 09:13:35 -08:00
autofs fs: support O_PATH fds with FSCONFIG_SET_FD 2025-02-12 10:02:10 +01:00
bcachefs assorted stuff for this merge window 2025-02-01 15:07:56 -08:00
befs befs: convert befs to use the new mount api 2024-09-18 11:44:43 +02:00
bfs fs: Convert aops->write_begin to take a folio 2024-08-07 11:33:21 +02:00
btrfs The various patchsets are summarized below. Plus of course many 2025-01-26 18:36:23 -08:00
cachefiles treewide: const qualify ctl_tables where applicable 2025-01-28 13:48:37 +01:00
ceph A fix for a memory leak from Antoine (marked for stable) and two 2025-01-31 10:30:34 -08:00
coda Provide stable parent and name to ->d_revalidate() instances 2025-01-30 09:13:35 -08:00
configfs configfs: improve item creation performance 2024-11-14 07:45:20 +01:00
cramfs vfs-6.11.module.description 2024-07-15 11:14:59 -07:00
crypto fscrypt_d_revalidate(): use stable parent inode passed by caller 2025-01-27 19:25:23 -05:00
debugfs debugfs: Fix the missing initializations in __debugfs_file_get() 2025-01-30 08:22:31 +01:00
devpts treewide: const qualify ctl_tables where applicable 2025-01-28 13:48:37 +01:00
dlm dlm: return -ENOENT if no comm was found 2024-12-19 13:11:24 -06:00
ecryptfs Pass parent directory inode and expected name to ->d_revalidate() 2025-01-27 19:25:23 -05:00
efivarfs efivarfs: add variable resync after hibernation 2025-01-22 13:28:17 +01:00
efs efs: fix the efs new mount api implementation 2024-10-15 15:58:36 +02:00
erofs assorted stuff for this merge window 2025-02-01 15:07:56 -08:00
exfat Provide stable parent and name to ->d_revalidate() instances 2025-01-30 09:13:35 -08:00
exportfs fs: prepare for "explicit connectable" file handles 2024-11-15 11:34:57 +01:00
ext2 vfs-6.12.file 2024-09-16 09:14:02 +02:00
ext4 Provide stable parent and name to ->d_revalidate() instances 2025-01-30 09:13:35 -08:00
f2fs f2fs-for-6.14-rc1 2025-01-27 20:58:58 -08:00
fat vfat_revalidate{,_ci}(): use stable parent inode passed by caller 2025-01-27 19:25:24 -05:00
freevxfs freevxfs: Replace one-element array with flexible array member 2024-11-06 10:42:06 +01:00
fuse Provide stable parent and name to ->d_revalidate() instances 2025-01-30 09:13:35 -08:00
gfs2 Provide stable parent and name to ->d_revalidate() instances 2025-01-30 09:13:35 -08:00
hfs Provide stable parent and name to ->d_revalidate() instances 2025-01-30 09:13:35 -08:00
hfsplus vfs-6.13.misc 2024-11-18 09:35:30 -08:00
hostfs hostfs __dentry_name() fix 2025-01-31 09:33:54 -08:00
hpfs hpfs: convert hpfs to use the new mount api 2024-10-08 14:41:53 +02:00
hugetlbfs mm/hugetlb: rename avoid_reserve to cow_from_owner 2025-01-25 20:22:30 -08:00
iomap The various patchsets are summarized below. Plus of course many 2025-01-26 18:36:23 -08:00
isofs isofs: Partially convert zisofs_read_folio to use a folio 2024-12-05 13:52:37 +01:00
jbd2 CRC updates for 6.14 2025-01-22 19:55:08 -08:00
jffs2 jffs2: Fix rtime decompressor 2024-12-05 12:31:40 +01:00
jfs Pass parent directory inode and expected name to ->d_revalidate() 2025-01-27 19:25:23 -05:00
kernfs assorted stuff for this merge window 2025-02-01 15:07:56 -08:00
lockd treewide: const qualify ctl_tables where applicable 2025-01-28 13:48:37 +01:00
minix buffer: Convert __block_write_begin() to take a folio 2024-08-07 11:33:36 +02:00
netfs vfs-6.14-rc1.netfs 2025-01-20 09:29:11 -08:00
nfs Provide stable parent and name to ->d_revalidate() instances 2025-01-30 09:13:35 -08:00
nfs_common nfs: fix incorrect error handling in LOCALIO 2025-01-21 11:34:43 -05:00
nfsd NFS Client Updates for Linux 6.14 2025-01-28 14:23:46 -08:00
nilfs2 nilfs2: fix possible int overflows in nilfs_fiemap() 2025-02-01 03:53:26 -08:00
nls move asm/unaligned.h to linux/unaligned.h 2024-10-02 17:23:23 -04:00
notify fanotify: notify on mount attach and detach 2025-02-05 17:21:07 +01:00
ntfs3 fs/ntfs3: Unify inode corruption marking with _ntfs_bad_inode() 2024-12-30 11:37:40 +03:00
ocfs2 ocfs2: fix incorrect CPU endianness conversion causing mount failure 2025-02-01 03:53:24 -08:00
omfs fs: Convert aops->write_begin to take a folio 2024-08-07 11:33:21 +02:00
openpromfs
orangefs orangefs: fix a oob in orangefs_debug_write 2025-01-31 09:40:31 -08:00
overlayfs assorted stuff for this merge window 2025-02-01 15:07:56 -08:00
proc Provide stable parent and name to ->d_revalidate() instances 2025-01-30 09:13:35 -08:00
pstore pstore updates for v6.14-rc1 2025-01-20 13:37:14 -08:00
qnx4
qnx6 fs/qnx6: Fix building with GCC 15 2024-12-03 10:40:36 +01:00
quota treewide: const qualify ctl_tables where applicable 2025-01-28 13:48:37 +01:00
ramfs
romfs romfs: fix romfs_read_folio() 2024-08-21 22:32:58 +02:00
smb twenty one cifs/smb3 client fixes, many for special file type handling 2025-02-01 11:30:41 -08:00
squashfs squashfs: convert squashfs_fill_page() to take a folio 2025-01-24 22:47:22 -08:00
sysfs sysfs: constify bin_attribute argument of sysfs_bin_attr_simple_read() 2025-01-09 10:43:58 +01:00
sysv buffer: Convert __block_write_begin() to take a folio 2024-08-07 11:33:36 +02:00
tests execve: Move KUnit tests to tests/ subdirectory 2024-07-22 18:25:47 -07:00
tracefs Pass parent directory inode and expected name to ->d_revalidate() 2025-01-27 19:25:23 -05:00
ubifs ubifs: skip dumping tnc tree when zroot is null 2025-01-18 15:31:35 +01:00
udf udf: Verify inode link counts before performing rename 2024-11-26 22:54:24 +01:00
ufs ufs: ufs_sb_private_info: remove unused s_{2,3}apb fields 2024-11-12 19:02:12 -05:00
unicode Revert "unicode: Don't special case ignorable code points" 2024-12-11 14:11:23 -08:00
vboxsf Provide stable parent and name to ->d_revalidate() instances 2025-01-30 09:13:35 -08:00
verity treewide: const qualify ctl_tables where applicable 2025-01-28 13:48:37 +01:00
xfs treewide: const qualify ctl_tables where applicable 2025-01-28 13:48:37 +01:00
zonefs zonefs fixes for 6.12-rc2 2024-10-02 12:02:15 -07:00
Kconfig reiserfs: The last commit 2024-10-21 16:29:38 +02:00
Kconfig.binfmt
Makefile reiserfs: The last commit 2024-10-21 16:29:38 +02:00
aio.c treewide: const qualify ctl_tables where applicable 2025-01-28 13:48:37 +01:00
anon_inodes.c add a string-to-qstr constructor 2025-01-27 19:25:45 -05:00
attr.c fs: handle delegated timestamps in setattr_copy_mgtime 2024-10-10 10:20:51 +02:00
backing-file.c tree-wide: s/revert_creds_light()/revert_creds()/g 2024-12-02 11:25:09 +01:00
bad_inode.c
binfmt_elf.c fs: don't block write during exec on pre-content watched files 2024-12-11 17:45:18 +01:00
binfmt_elf_fdpic.c fs: don't block write during exec on pre-content watched files 2024-12-11 17:45:18 +01:00
binfmt_flat.c binfmt_flat: Fix integer overflow bug on 32 bit systems 2025-01-10 08:49:05 -08:00
binfmt_misc.c execve updates for v6.14-rc1 2025-01-20 13:27:58 -08:00
binfmt_script.c
bpf_fs_kfuncs.c bpf: Add kfunc bpf_get_dentry_xattr() to read xattr from dentry 2024-08-07 11:26:54 -07:00
buffer.c - The series "zram: optimal post-processing target selection" from 2024-11-23 09:58:07 -08:00
char_dev.c fs: Reorganize kerneldoc parameter names 2024-10-22 11:16:57 +02:00
compat_binfmt_elf.c binfmt_elf: Wire up AT_HWCAP3 at AT_HWCAP4 2024-10-17 18:38:49 +01:00
coredump.c treewide: const qualify ctl_tables where applicable 2025-01-28 13:48:37 +01:00
d_path.c
dax.c fsdax: dax_unshare_iter needs to copy entire blocks 2024-10-07 13:51:47 +02:00
dcache.c Provide stable parent and name to ->d_revalidate() instances 2025-01-30 09:13:35 -08:00
direct-io.c fs/direct-io: Remove linux/prefetch.h include 2024-08-19 13:45:02 +02:00
drop_caches.c sysctl: treewide: constify the ctl_table argument of proc_handlers 2024-07-24 20:59:29 +02:00
eventfd.c fdget(), trivial conversions 2024-11-03 01:28:06 -05:00
eventpoll.c treewide: const qualify ctl_tables where applicable 2025-01-28 13:48:37 +01:00
exec.c treewide: const qualify ctl_tables where applicable 2025-01-28 13:48:37 +01:00
fcntl.c fs: get rid of __FMODE_NONOTIFY kludge 2024-12-09 11:34:29 +01:00
fhandle.c exportfs: add permission method 2024-12-17 09:16:11 +01:00
file.c vfs-6.14-rc1.misc 2025-01-20 09:40:49 -08:00
file_table.c assorted stuff for this merge window 2025-02-01 15:07:56 -08:00
filesystems.c
fs-writeback.c Merge patch series "two little writeback cleanups v2" 2024-11-13 14:08:34 +01:00
fs_context.c fs: fc_log replace magic number 7 with ARRAY_SIZE() 2024-12-22 11:29:52 +01:00
fs_parser.c bcachefs: add support for true/false & yes/no in bool-type options 2024-12-21 01:36:17 -05:00
fs_pin.c
fs_struct.c
fs_types.c
fsopen.c fs: support O_PATH fds with FSCONFIG_SET_FD 2025-02-12 10:02:10 +01:00
init.c
inode.c treewide: const qualify ctl_tables where applicable 2025-01-28 13:48:37 +01:00
internal.h statmount: allow to retrieve idmappings 2025-02-12 12:12:27 +01:00
ioctl.c fdget(), trivial conversions 2024-11-03 01:28:06 -05:00
kernel_read_file.c fdget(), trivial conversions 2024-11-03 01:28:06 -05:00
libfs.c Provide stable parent and name to ->d_revalidate() instances 2025-01-30 09:13:35 -08:00
locks.c treewide: const qualify ctl_tables where applicable 2025-01-28 13:48:37 +01:00
mbcache.c
mnt_idmapping.c statmount: allow to retrieve idmappings 2025-02-12 12:12:27 +01:00
mount.h vfs: add notifications for mount attach and detach 2025-02-05 17:21:11 +01:00
mpage.c fs/writeback: convert wbc_account_cgroup_owner to take a folio 2024-10-28 13:26:54 +01:00
namei.c Provide stable parent and name to ->d_revalidate() instances 2025-01-30 09:13:35 -08:00
namespace.c Merge patch series "fs: allow changing idmappings" 2025-02-12 12:12:34 +01:00
nsfs.c fs: lockless mntns lookup for nsfs 2025-01-09 16:58:52 +01:00
open.c The various patchsets are summarized below. Plus of course many 2025-01-26 18:36:23 -08:00
pidfs.c pidfs: allow bind-mounts 2024-12-22 11:03:10 +01:00
pipe.c treewide: const qualify ctl_tables where applicable 2025-01-28 13:48:37 +01:00
pnode.c vfs: add notifications for mount attach and detach 2025-02-05 17:21:11 +01:00
pnode.h
posix_acl.c acl: Annotate struct posix_acl with __counted_by() 2024-10-22 11:16:59 +02:00
proc_namespace.c
read_write.c the bulk of struct fd memory safety stuff 2024-11-18 12:24:06 -08:00
readdir.c introduce "fd_pos" class, convert fdget_pos() users to it. 2024-11-03 01:28:06 -05:00
remap_range.c convert vfs_dedupe_file_range(). 2024-11-03 01:28:07 -05:00
select.c select: Fix unbalanced user_access_end() 2025-01-13 16:24:16 +01:00
seq_file.c fs: Reorganize kerneldoc parameter names 2024-10-22 11:16:57 +02:00
signalfd.c fdget(), trivial conversions 2024-11-03 01:28:06 -05:00
splice.c mm: alloc_pages_bulk: rename API 2025-01-25 20:22:31 -08:00
stack.c
stat.c fs: add STATX_DIO_READ_ALIGN 2025-01-09 16:23:17 +01:00
statfs.c fdget_raw() users: switch to CLASS(fd_raw) 2024-11-03 01:28:06 -05:00
super.c lib/list_debug.c: add object information in case of invalid object 2025-01-25 20:22:23 -08:00
sync.c fdget(), trivial conversions 2024-11-03 01:28:06 -05:00
sysctls.c treewide: const qualify ctl_tables where applicable 2025-01-28 13:48:37 +01:00
timerfd.c A rather large update for timekeeping and timers: 2024-11-19 16:35:06 -08:00
userfaultfd.c treewide: const qualify ctl_tables where applicable 2025-01-28 13:48:37 +01:00
utimes.c fdget(), more trivial conversions 2024-11-03 01:28:06 -05:00
xattr.c xattr: remove redundant check on variable err 2024-11-06 13:00:01 -05:00