linux/Documentation/filesystems
David Hildenbrand 9dc21bbd62 prctl: extend PR_SET_THP_DISABLE to optionally exclude VM_HUGEPAGE
Patch series "prctl: extend PR_SET_THP_DISABLE to only provide THPs when
advised", v5.

This will allow individual processes to opt-out of THP = "always" into THP
= "madvise", without affecting other workloads on the system.  This has
been extensively discussed on the mailing list and has been summarized
very well by David in the first patch which also includes the links to
alternatives, please refer to the first patch commit message for the
motivation for this series.

Patch 1 adds the PR_THP_DISABLE_EXCEPT_ADVISED flag to implement this,
along with the MMF changes.

Patch 2 is a cleanup patch for tva_flags that will allow the forced
collapse case to be transmitted to vma_thp_disabled (which is done in
patch 3).

Patch 4 adds documentation for PR_SET_THP_DISABLE/PR_GET_THP_DISABLE.

Patches 6-7 implement the selftests for PR_SET_THP_DISABLE for completely
disabling THPs (old behaviour) and only enabling it at advise
(PR_THP_DISABLE_EXCEPT_ADVISED).


This patch (of 7):

People want to make use of more THPs, for example, moving from the "never"
system policy to "madvise", or from "madvise" to "always".

While this is great news for every THP desperately waiting to get
allocated out there, apparently there are some workloads that require a
bit of care during that transition: individual processes may need to
opt-out from this behavior for various reasons, and this should be
permitted without needing to make all other workloads on the system
similarly opt-out.

The following scenarios are imaginable:

(1) Switch from "none" system policy to "madvise"/"always", but keep THPs
    disabled for selected workloads.

(2) Stay at "none" system policy, but enable THPs for selected
    workloads, making only these workloads use the "madvise" or "always"
    policy.

(3) Switch from "madvise" system policy to "always", but keep the
    "madvise" policy for selected workloads: allocate THPs only when
    advised.

(4) Stay at "madvise" system policy, but enable THPs even when not advised
    for selected workloads -- "always" policy.

Once can emulate (2) through (1), by setting the system policy to
"madvise"/"always" while disabling THPs for all processes that don't want
THPs.  It requires configuring all workloads, but that is a user-space
problem to sort out.

(4) can be emulated through (3) in a similar way.

Back when (1) was relevant in the past, as people started enabling THPs,
we added PR_SET_THP_DISABLE, so relevant workloads that were not ready yet
(i.e., used by Redis) were able to just disable THPs completely.  Redis
still implements the option to use this interface to disable THPs
completely.

With PR_SET_THP_DISABLE, we added a way to force-disable THPs for a
workload -- a process, including fork+exec'ed process hierarchy.  That
essentially made us support (1): simply disable THPs for all workloads
that are not ready for THPs yet, while still enabling THPs system-wide.

The quest for handling (3) and (4) started, but current approaches
(completely new prctl, options to set other policies per process,
alternatives to prctl -- mctrl, cgroup handling) don't look particularly
promising.  Likely, the future will use bpf or something similar to
implement better policies, in particular to also make better decisions
about THP sizes to use, but this will certainly take a while as that work
just started.

Long story short: a simple enable/disable is not really suitable for the
future, so we're not willing to add completely new toggles.

While we could emulate (3)+(4) through (1)+(2) by simply disabling THPs
completely for these processes, this is a step backwards, because these
processes can no longer allocate THPs in regions where THPs were
explicitly advised: regions flagged as VM_HUGEPAGE.  Apparently, that
imposes a problem for relevant workloads, because "not THPs" is certainly
worse than "THPs only when advised".

Could we simply relax PR_SET_THP_DISABLE, to "disable THPs unless not
explicitly advised by the app through MAD_HUGEPAGE"?  *maybe*, but this
would change the documented semantics quite a bit, and the versatility to
use it for debugging purposes, so I am not 100% sure that is what we want
-- although it would certainly be much easier.

So instead, as an easy way forward for (3) and (4), add an option to
make PR_SET_THP_DISABLE disable *less* THPs for a process.

In essence, this patch:

(A) Adds PR_THP_DISABLE_EXCEPT_ADVISED, to be used as a flag in arg3
    of prctl(PR_SET_THP_DISABLE) when disabling THPs (arg2 != 0).

    prctl(PR_SET_THP_DISABLE, 1, PR_THP_DISABLE_EXCEPT_ADVISED).

(B) Makes prctl(PR_GET_THP_DISABLE) return 3 if
    PR_THP_DISABLE_EXCEPT_ADVISED was set while disabling.

    Previously, it would return 1 if THPs were disabled completely. Now
    it returns the set flags as well: 3 if PR_THP_DISABLE_EXCEPT_ADVISED
    was set.

(C) Renames MMF_DISABLE_THP to MMF_DISABLE_THP_COMPLETELY, to express
    the semantics clearly.

    Fortunately, there are only two instances outside of prctl() code.

(D) Adds MMF_DISABLE_THP_EXCEPT_ADVISED to express "no THP except for VMAs
    with VM_HUGEPAGE" -- essentially "thp=madvise" behavior

    Fortunately, we only have to extend vma_thp_disabled().

(E) Indicates "THP_enabled: 0" in /proc/pid/status only if THPs are
    disabled completely

    Only indicating that THPs are disabled when they are really disabled
    completely, not only partially.

    For now, we don't add another interface to obtained whether THPs
    are disabled partially (PR_THP_DISABLE_EXCEPT_ADVISED was set). If
    ever required, we could add a new entry.

The documented semantics in the man page for PR_SET_THP_DISABLE "is
inherited by a child created via fork(2) and is preserved across
execve(2)" is maintained.  This behavior, for example, allows for
disabling THPs for a workload through the launching process (e.g., systemd
where we fork() a helper process to then exec()).

For now, MADV_COLLAPSE will *fail* in regions without VM_HUGEPAGE and
VM_NOHUGEPAGE.  As MADV_COLLAPSE is a clear advise that user space thinks
a THP is a good idea, we'll enable that separately next (requiring a bit
of cleanup first).

There is currently not way to prevent that a process will not issue
PR_SET_THP_DISABLE itself to re-enable THP.  There are not really known
users for re-enabling it, and it's against the purpose of the original
interface.  So if ever required, we could investigate just forbidding to
re-enable them, or make this somehow configurable.

Link: https://lkml.kernel.org/r/20250815135549.130506-1-usamaarif642@gmail.com
Link: https://lkml.kernel.org/r/20250815135549.130506-2-usamaarif642@gmail.com
Acked-by: Zi Yan <ziy@nvidia.com>
Acked-by: Usama Arif <usamaarif642@gmail.com>
Tested-by: Usama Arif <usamaarif642@gmail.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Signed-off-by: Usama Arif <usamaarif642@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Jann Horn <jannh@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Mariano Pache <npache@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yafang <laoar.shao@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-13 16:55:05 -07:00
..
bcachefs docs: bcachefs: add casefolding reference 2025-05-21 20:14:39 -04:00
caching doc: correcting the debug path for cachefiles 2024-10-24 13:50:27 +02:00
ext4 Documentation: ext4: Move inode table short docs into its own file 2025-07-02 16:57:20 -06:00
iomap iomap: add read_folio_range() handler for buffered writes 2025-07-14 10:51:33 +02:00
nfs nfsd: disallow file locking and delegations for NFSv4 reexport 2025-03-10 09:11:08 -04:00
smb cifs: add documentation for smbdirect setup 2025-06-05 10:20:48 -05:00
spufs Documentation: spufs: correct a duplicate word typo 2022-09-27 13:21:44 -06:00
xfs Documentation: Remove repeated word in docs 2025-02-10 10:54:50 -07:00
9p.rst 9p update for 6.15-rc1 2025-04-03 15:35:46 -07:00
adfs.rst
affs.rst
afs.rst afs: Documentation: correct reference to CONFIG_AFS_FS 2023-07-21 13:46:02 -06:00
api-summary.rst doc: split buffer.rst out of api-summary.rst 2024-05-05 17:53:40 -07:00
autofs-mount-control.rst autofs: use flexible array in ioctl structure 2023-05-30 16:42:00 -07:00
autofs.rst Documentation: filesystems: update filename extensions 2024-11-22 10:31:04 -07:00
automount-support.rst
befs.rst Documentation: Fix typos 2023-08-18 11:29:03 -06:00
bfs.rst
btrfs.rst MAINTAINERS: remove links to obsolete btrfs.wiki.kernel.org 2023-09-08 14:21:27 +02:00
buffer.rst doc: split buffer.rst out of api-summary.rst 2024-05-05 17:53:40 -07:00
ceph.rst doc: ceph: update userspace command to get CephFS metadata 2024-05-23 10:35:47 +02:00
coda.rst documentation/filesystems: fix spelling mistakes 2025-02-10 10:42:28 -07:00
configfs.rst Documentation: Fix typos 2023-08-18 11:29:03 -06:00
cramfs.rst
dax.rst doc: Remove misleading reference to brd in dax.rst 2025-06-25 12:49:29 -06:00
debugfs.rst docs: debugfs: do not recommend debugfs_remove_recursive 2025-04-30 19:11:04 +02:00
devpts.rst Documentation: Fix typos 2023-08-18 11:29:03 -06:00
directory-locking.rst Docs: typos/spelling 2024-05-02 10:02:29 -06:00
dlmfs.rst Documentation: filesystems: update filename extensions 2024-11-22 10:31:04 -07:00
dnotify.rst
ecryptfs.rst
efivarfs.rst Documentation: Mark the 'efivars' sysfs interface as removed 2024-04-13 10:33:02 +02:00
erofs.rst erofs: add 'fsoffset' mount option to specify filesystem offset 2025-05-22 11:57:57 +08:00
ext2.rst ext2: remove nobh support 2022-08-02 12:34:04 -04:00
ext3.rst
f2fs.rst f2fs-for-6.17-rc1 2025-08-04 16:27:21 -07:00
fiemap.rst fiemap: use kernel-doc includes in fiemap docbook 2024-12-22 11:29:50 +01:00
files.rst docs: filesystems: fix typo in docs 2024-02-09 10:37:20 +01:00
fscrypt.rst fscrypt: Don't use problematic non-inline crypto engines 2025-07-04 10:25:26 -07:00
fsverity.rst fsverity: Switch from crypto_shash to SHA-2 library 2025-07-14 11:29:32 -07:00
fuse-io-uring.rst fuse: Add fuse-io-uring design documentation 2025-01-24 11:53:56 +01:00
fuse-io.rst docs/fuse-io: Document the usage of DIRECT_IO_ALLOW_MMAP 2023-12-04 10:16:53 +01:00
fuse-passthrough.rst docs: filesystems: add fuse-passthrough.rst 2025-05-12 10:02:08 +02:00
fuse.rst fuse: Add module param for CAP_SYS_ADMIN access bypassing allow_other 2022-07-21 16:06:19 +02:00
gfs2-glocks.rst gfs2: Get rid of demote_ok checks 2024-05-29 15:34:55 +02:00
gfs2-uevents.rst
gfs2.rst
hfs.rst
hfsplus.rst
hpfs.rst
idmappings.rst doc: correcting two prefix errors in idmappings.rst 2025-03-05 11:54:18 +01:00
index.rst fuse update for 6.16 2025-06-02 15:31:05 -07:00
inotify.rst
isofs.rst
journalling.rst jbd2: remove unused transaction->t_private_list 2025-02-10 07:48:24 -05:00
locking.rst vfs-6.17-rc1.fileattr 2025-07-28 15:24:14 -07:00
locks.rst docs: fs: locks.rst: update comment about mandatory file locking 2021-10-19 06:48:21 -04:00
mount_api.rst fs/fs_parse: Remove unused and problematic validate_constant_table() 2025-04-21 10:27:59 +02:00
multigrain-ts.rst Documentation: add a new file documenting multigrain timestamps 2024-10-10 10:20:52 +02:00
netfs_library.rst fs/netfs: remove unused flag NETFS_SREQ_SEEK_DATA_READ 2025-05-21 14:34:37 +02:00
nilfs2.rst Documentation: Fix typos 2023-08-18 11:29:03 -06:00
ntfs3.rst Documentation: Fix typos 2023-08-18 11:29:03 -06:00
ocfs2-online-filecheck.rst
ocfs2.rst docs: update ocfs2-devel mailing list address 2023-07-08 09:29:29 -07:00
omfs.rst
orangefs.rst Documentation: Fix typos 2023-08-18 11:29:03 -06:00
overlayfs.rst overlayfs.rst: fix typos 2025-07-15 13:53:46 -06:00
path-lookup.rst Documentation: filesystems: update filename extensions 2024-11-22 10:31:04 -07:00
path-lookup.txt Documentation: filesystems: update filename extensions 2024-11-22 10:31:04 -07:00
porting.rst vfs-6.17-rc1.mmap_prepare 2025-07-28 13:43:25 -07:00
proc.rst prctl: extend PR_SET_THP_DISABLE to optionally exclude VM_HUGEPAGE 2025-09-13 16:55:05 -07:00
propagate_umount.txt mount: separate the flags accessed only under namespace_sem 2025-06-29 19:03:29 -04:00
qnx6.rst Documentation: Fix typos 2023-08-18 11:29:03 -06:00
quota.rst
ramfs-rootfs-initramfs.rst Documentation: filesystems: update filename extensions 2024-11-22 10:31:04 -07:00
relay.rst - The 3 patch series "hung_task: extend blocking task stacktrace dump to 2025-05-31 19:12:53 -07:00
resctrl.rst x86,fs/resctrl: Move the resctrl filesystem code to live in /fs/resctrl 2025-05-16 14:36:09 +02:00
romfs.rst
seq_file.rst Documentation: Fix typos 2023-08-18 11:29:03 -06:00
sharedsubtree.rst Documentation/filesystems: sharedsubtree: add section headings 2023-05-16 12:50:05 -06:00
splice.rst
squashfs.rst Documentation: update the Squashfs filesystem documentation 2025-01-24 22:47:21 -08:00
sysfs.rst driver core: bus: mark the struct bus_type for sysfs callbacks as constant 2023-03-23 13:20:40 +01:00
tmpfs.rst docs: tmpfs: Add casefold options 2024-10-28 13:36:55 +01:00
ubifs-authentication.rst Documentation: treewide: Replace remaining spinics links with lore 2025-06-21 14:20:51 -06:00
ubifs.rst Documentation: ubifs: Fix compression idiom 2022-10-10 13:01:10 -06:00
udf.rst
vfat.rst Documentation: Fix typos 2023-08-18 11:29:03 -06:00
vfs.rst vfs-6.17-rc1.fileattr 2025-07-28 15:24:14 -07:00
virtiofs.rst
zonefs.rst Documentation: Fix typos 2023-08-18 11:29:03 -06:00