linux/Documentation
Dan Schatzberg 68cd9050d8 mm: add swappiness= arg to memory.reclaim
Allow proactive reclaimers to submit an additional swappiness=<val>
argument to memory.reclaim.  This overrides the global or per-memcg
swappiness setting for that reclaim attempt.

For example:

echo "2M swappiness=0" > /sys/fs/cgroup/memory.reclaim

will perform reclaim on the rootcg with a swappiness setting of 0 (no
swap) regardless of the vm.swappiness sysctl setting.

Userspace proactive reclaimers use the memory.reclaim interface to trigger
reclaim.  The memory.reclaim interface does not allow for any way to
effect the balance of file vs anon during proactive reclaim.  The only
approach is to adjust the vm.swappiness setting.  However, there are a few
reasons we look to control the balance of file vs anon during proactive
reclaim, separately from reactive reclaim:

* Swapout should be limited to manage SSD write endurance.  In near-OOM
  situations we are fine with lots of swap-out to avoid OOMs.  As these
  are typically rare events, they have relatively little impact on write
  endurance.  However, proactive reclaim runs continuously and so its
  impact on SSD write endurance is more significant.  Therefore it is
  desireable to control swap-out for proactive reclaim separately from
  reactive reclaim

* Some userspace OOM killers like systemd-oomd[1] support OOM killing on
  swap exhaustion.  This makes sense if the swap exhaustion is triggered
  due to reactive reclaim but less so if it is triggered due to proactive
  reclaim (e.g.  one could see OOMs when free memory is ample but anon is
  just particularly cold).  Therefore, it's desireable to have proactive
  reclaim reduce or stop swap-out before the threshold at which OOM
  killing occurs.

In the case of Meta's Senpai proactive reclaimer, we adjust vm.swappiness
before writes to memory.reclaim[2].  This has been in production for
nearly two years and has addressed our needs to control proactive vs
reactive reclaim behavior but is still not ideal for a number of reasons:

* vm.swappiness is a global setting, adjusting it can race/interfere
  with other system administration that wishes to control vm.swappiness. 
  In our case, we need to disable Senpai before adjusting vm.swappiness.

* vm.swappiness is stateful - so a crash or restart of Senpai can leave
  a misconfigured setting.  This requires some additional management to
  record the "desired" setting and ensure Senpai always adjusts to it.

With this patch, we avoid these downsides of adjusting vm.swappiness
globally.

[1]https://www.freedesktop.org/software/systemd/man/latest/systemd-oomd.service.html
[2]https://github.com/facebookincubator/oomd/blob/main/src/oomd/plugins/Senpai.cpp#L585-L598

Link: https://lkml.kernel.org/r/20240103164841.2800183-3-schatzberg.dan@gmail.com
Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>
Suggested-by: Yosry Ahmed <yosryahmed@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Chris Li <chrisl@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Yue Zhao <findns94@gmail.com>
Cc: Zefan Li <lizefan.x@bytedance.com>
Cc: Nhat Pham <nphamcs@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-04 18:05:55 -07:00
..
ABI mm/damon/sysfs-schemes: add target_nid on sysfs-schemes 2024-07-03 19:30:12 -07:00
PCI Merge branch 'pci/enumeration' 2024-05-16 18:14:10 -05:00
RCU doc: Remove references to arrayRCU.rst 2024-04-09 15:13:05 +02:00
accel
accounting
admin-guide mm: add swappiness= arg to memory.reclaim 2024-07-04 18:05:55 -07:00
arch Documentation: RISC-V: uabi: Only scalar misaligned loads are supported 2024-05-30 09:42:53 -07:00
block
bpf bpf, docs: Fix the description of 'src' in ALU instructions 2024-05-15 09:34:54 -07:00
cdrom scsi: sr: Fix unintentional arithmetic wraparound 2024-05-15 10:05:24 -04:00
core-api mm: remove page_mkclean() 2024-07-03 19:30:17 -07:00
cpu-freq
crypto
dev-tools kmsan: allow disabling KMSAN checks for the current task 2024-07-03 19:30:22 -07:00
devicetree Including fixes from can, bpf and netfilter. 2024-06-27 10:05:35 -07:00
doc-guide doc: fix spelling about ReStructured Text 2024-04-02 10:07:51 -06:00
driver-api Char/Misc and other driver subsystem changes for 6.10-rc1 2024-05-22 12:26:46 -07:00
fault-injection
fb
features
filesystems /proc/pid/smaps: add mseal info for vma 2024-06-24 20:52:09 -07:00
firmware-guide Documentation: firmware-guide: ACPI: Fix namespace typo 2024-04-26 18:58:13 +02:00
firmware_class
fpga
gpu Merge tag 'drm-misc-next-2024-04-19' of https://gitlab.freedesktop.org/drm/misc/kernel into drm-next 2024-04-22 12:29:18 +10:00
hid Merge branch 'for-6.10/intel-ish' into for-linus 2024-05-14 13:53:15 +02:00
hwmon hwmon: (emc1403) Add support for EMC1428 and EMC1438. 2024-05-12 09:02:00 -07:00
i2c docs: i2c: summary: be clearer with 'controller/target' and 'adapter/client' pairs 2024-06-22 10:11:56 +02:00
iio docs: iio: ad7944: add documentation for chain mode 2024-04-29 20:53:25 +01:00
images
infiniband
input
isdn
kbuild kbuild: doc: Update default INSTALL_MOD_DIR from extra to updates 2024-06-26 00:18:57 +09:00
kernel-hacking
leds
litmus-tests Documentation/litmus-tests: Make cmpxchg() tests safe for klitmus 2024-05-06 14:29:21 -07:00
livepatch
locking
maintainer
mhi
misc-devices
mm Docs/mm/damon/maintainer-profile: document DAMON community meetups 2024-07-03 19:30:20 -07:00
netlabel
netlink Including fixes from can, bpf and netfilter. 2024-06-27 10:05:35 -07:00
networking Revert "xsk: Document ability to redirect to any socket bound to the same umem" 2024-06-05 09:43:05 +02:00
nvdimm
nvme
pcmcia
peci
power
process docs: netdev: Fix typo in Signed-off-by tag 2024-05-27 17:15:22 -07:00
rust RISC-V Patches for the 6.10 Merge Window, Part 1 2024-05-22 09:56:00 -07:00
scheduler
scsi scsi: documentation: Clean up scsi_mid_low_api.rst 2024-04-08 22:10:05 -04:00
security Another not-too-busy cycle for documentation, including: 2024-05-13 10:51:53 -07:00
sound Documentation: sound: Fix trailing whitespaces 2024-05-16 16:00:30 +02:00
sphinx docs: kernel_include.py: Cope with docutils 0.21 2024-05-02 09:50:59 -06:00
sphinx-static
spi spi: pxa2xx: Drop the stale entry in documentation TOC 2024-05-07 23:53:21 +09:00
staging
target
tee Documentation: tee: Add TS-TEE driver 2024-04-03 14:03:09 +02:00
timers sched/isolation: Prevent boot crash when the boot CPU is nohz_full 2024-04-28 10:07:12 +02:00
tools rtla: Documentation: Fix -t, --trace 2024-05-16 16:52:16 +02:00
trace Char/Misc and other driver subsystem changes for 6.10-rc1 2024-05-22 12:26:46 -07:00
translations pci-v6.10-changes 2024-05-21 10:09:28 -07:00
usb
userspace-api mm/memfd: add documentation for MFD_NOEXEC_SEAL MFD_EXEC 2024-06-15 10:43:07 -07:00
virt Documentation: hyperv: Improve synic and interrupt handling description 2024-05-28 05:28:42 +00:00
w1
watchdog
wmi platform/x86: wmi: Add MSI WMI Platform driver 2024-04-29 12:06:21 +02:00
.gitignore
Changes
CodingStyle
Kconfig
Makefile Kbuild updates for v6.10 2024-05-18 12:39:20 -07:00
SubmittingPatches
atomic_bitops.txt
atomic_t.txt Documentation/atomic_t: Emphasize that failed atomic operations give no ordering 2024-05-06 14:29:04 -07:00
conf.py compiler_types: add Endianness-dependent __counted_by_{le,be} 2024-03-28 18:50:47 -07:00
docutils.conf
dontdiff
index.rst doc: fix spelling about ReStructured Text 2024-04-02 10:07:51 -06:00
memory-barriers.txt
subsystem-apis.rst