linux/kernel
Christian Brauner 0fb482728b
pidfs: improve multi-threaded exec and premature thread-group leader exit polling
This is another attempt trying to make pidfd polling for multi-threaded
exec and premature thread-group leader exit consistent.

A quick recap of these two cases:

(1) During a multi-threaded exec by a subthread, i.e., non-thread-group
    leader thread, all other threads in the thread-group including the
    thread-group leader are killed and the struct pid of the
    thread-group leader will be taken over by the subthread that called
    exec. IOW, two tasks change their TIDs.

(2) A premature thread-group leader exit means that the thread-group
    leader exited before all of the other subthreads in the thread-group
    have exited.

Both cases lead to inconsistencies for pidfd polling with PIDFD_THREAD.
Any caller that holds a PIDFD_THREAD pidfd to the current thread-group
leader may or may not see an exit notification on the file descriptor
depending on when poll is performed. If the poll is performed before the
exec of the subthread has concluded an exit notification is generated
for the old thread-group leader. If the poll is performed after the exec
of the subthread has concluded no exit notification is generated for the
old thread-group leader.

The correct behavior would be to simply not generate an exit
notification on the struct pid of a subhthread exec because the struct
pid is taken over by the subthread and thus remains alive.

But this is difficult to handle because a thread-group may exit
prematurely as mentioned in (2). In that case an exit notification is
reliably generated but the subthreads may continue to run for an
indeterminate amount of time and thus also may exec at some point.

So far there was no way to distinguish between (1) and (2) internally.
This tiny series tries to address this problem by discarding
PIDFD_THREAD notification on premature thread-group leader exit.

If that works correctly then no exit notifications are generated for a
PIDFD_THREAD pidfd for a thread-group leader until all subthreads have
been reaped. If a subthread should exec aftewards no exit notification
will be generated until that task exits or it creates subthreads and
repeates the cycle.

Co-Developed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/r/20250320-work-pidfs-thread_group-v4-1-da678ce805bf@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-20 15:32:43 +01:00
..
bpf Summary: 2025-01-29 10:35:40 -08:00
cgroup drm next for 6.14-rc1 2025-01-21 16:09:47 -08:00
configs configs/debug: make sure PROVE_RCU_LIST=y takes effect 2024-10-28 10:21:09 -07:00
debug kdb: Remove unused flags stack 2025-01-25 08:22:26 +00:00
dma dma-debug: fix physical address calculation for struct dma_debug_entry 2024-11-28 10:19:16 +01:00
entry sched: Add TIF_NEED_RESCHED_LAZY infrastructure 2024-11-05 12:55:37 +01:00
events kernel: be more careful about dup_mmap() failures and uprobe registering 2025-02-01 03:53:25 -08:00
futex Mainly individually changelogged singleton patches. The patch series in 2025-01-26 17:50:53 -08:00
gcov gcov: clang: use correct function param names 2025-01-24 22:47:27 -08:00
irq Updates for the interrupt subsystem: 2025-01-21 13:51:07 -08:00
kcsan kcsan: Remove redundant call of kallsyms_lookup_name() 2024-10-14 16:44:56 +02:00
livepatch livepatch: Add stack_order sysfs attribute 2024-12-09 11:44:03 +01:00
locking treewide: const qualify ctl_tables where applicable 2025-01-28 13:48:37 +01:00
module Kbuild updates for v6.14 2025-01-31 12:07:07 -08:00
power More power management updates for 6.14-rc1 2025-01-30 15:10:34 -08:00
printk treewide: const qualify ctl_tables where applicable 2025-01-28 13:48:37 +01:00
rcu The various patchsets are summarized below. Plus of course many 2025-01-26 18:36:23 -08:00
sched More power management updates for 6.14-rc1 2025-01-30 15:10:34 -08:00
time treewide: const qualify ctl_tables where applicable 2025-01-28 13:48:37 +01:00
trace treewide: const qualify ctl_tables where applicable 2025-01-28 13:48:37 +01:00
.gitignore
Kconfig.freezer
Kconfig.hz
Kconfig.kexec crash, powerpc: default to CRASH_DUMP=n on PPC_BOOK3S_32 2024-11-14 22:43:48 -08:00
Kconfig.locks
Kconfig.preempt sched: No PREEMPT_RT=y for all{yes,mod}config 2024-11-07 15:25:05 +01:00
Makefile mm: move kernel/numa.c to mm/ 2024-09-03 21:15:26 -07:00
acct.c treewide: const qualify ctl_tables where applicable 2025-01-28 13:48:37 +01:00
async.c
audit.c audit: Initialize lsmctx to avoid memory allocation error 2025-01-29 20:02:04 -05:00
audit.h audit: change context data from secid to lsm_prop 2024-10-11 14:34:16 -04:00
audit_fsnotify.c
audit_tree.c fsnotify: create a wrapper fsnotify_find_inode_mark() 2024-04-04 16:24:16 +02:00
audit_watch.c fsnotify: create a wrapper fsnotify_find_inode_mark() 2024-04-04 16:24:16 +02:00
auditfilter.c audit: fix suffixed '/' filename matching 2024-12-05 19:22:38 -05:00
auditsc.c lsm/stable-6.14 PR 20250121 2025-01-21 20:03:04 -08:00
backtracetest.c backtracetest: add MODULE_DESCRIPTION() 2024-06-24 22:24:55 -07:00
bounds.c bounds: Use the right number of bits for power-of-two CONFIG_NR_CPUS 2024-04-29 08:29:29 -07:00
capability.c kernel: remove get_task_comm() and print task comm directly 2025-01-12 20:21:15 -08:00
cfi.c
compat.c
configs.c
context_tracking.c context_tracking, rcu: Rename rcu_dyntick trace event into rcu_watching 2024-08-15 21:30:43 +05:30
cpu.c The various patchsets are summarized below. Plus of course many 2025-01-26 18:36:23 -08:00
cpu_pm.c
crash_core.c kexec/crash: no crash update when kexec in progress 2024-11-05 17:12:27 -08:00
crash_reserve.c crash: fix crash memory reserve exceed system memory bug 2024-09-01 20:43:30 -07:00
cred.c cred: remove old {override,revert}_creds() helpers 2024-12-02 11:25:09 +01:00
delayacct.c treewide: const qualify ctl_tables where applicable 2025-01-28 13:48:37 +01:00
dma.c
elfcorehdr.c
exec_domain.c
exit.c pidfs: improve multi-threaded exec and premature thread-group leader exit polling 2025-03-20 15:32:43 +01:00
exit.h
extable.c
fail_function.c
fork.c pidfs: ensure that PIDFS_INFO_EXIT is available 2025-03-19 14:40:18 +01:00
freezer.c sched/fair: Fix external p->on_rq users 2024-10-14 09:14:35 +02:00
gen_kheaders.sh Kbuild updates for v6.14 2025-01-31 12:07:07 -08:00
groups.c
hung_task.c treewide: const qualify ctl_tables where applicable 2025-01-28 13:48:37 +01:00
iomem.c
irq_work.c kasan: make kasan_record_aux_stack_noalloc() the default behaviour 2025-01-13 22:40:36 -08:00
jump_label.c jump_label: Fix static_key_slow_dec() yet again 2024-09-10 11:57:27 +02:00
kallsyms.c kallsyms: Match symbols exactly with CONFIG_LTO_CLANG 2024-08-15 09:33:35 -07:00
kallsyms_internal.h kallsyms: get rid of code for absolute kallsyms 2024-07-20 16:33:21 +09:00
kallsyms_selftest.c kallsyms: Use kthread_run_on_cpu() 2025-01-02 22:12:12 +01:00
kallsyms_selftest.h
kcmp.c get rid of ...lookup...fdget_rcu() family 2024-10-07 13:34:41 -04:00
kcov.c kcov: mark in_softirq_really() as __always_inline 2024-12-30 17:59:08 -08:00
kexec.c crash: add a new kexec flag for hotplug support 2024-04-23 14:59:01 +10:00
kexec_core.c treewide: const qualify ctl_tables where applicable 2025-01-28 13:48:37 +01:00
kexec_elf.c
kexec_file.c kexec_file: fix elfcorehdr digest exclusion when CONFIG_CRASH_HOTPLUG=y 2024-09-01 17:59:01 -07:00
kexec_internal.h kexec: use atomic_try_cmpxchg_acquire() in kexec_trylock() 2024-09-01 20:43:23 -07:00
kheaders.c kheaders: Simplify attribute through __BIN_ATTR_SIMPLE_RO() 2024-12-24 09:46:49 +01:00
kprobes.c treewide: const qualify ctl_tables where applicable 2025-01-28 13:48:37 +01:00
ksyms_common.c
ksysfs.c kernel/ksysfs.c: simplify bin_attribute definition 2025-01-07 16:59:15 +01:00
kthread.c Mainly individually changelogged singleton patches. The patch series in 2025-01-26 17:50:53 -08:00
latencytop.c treewide: const qualify ctl_tables where applicable 2025-01-28 13:48:37 +01:00
module_signature.c
notifier.c reboot: move reboot_notifier_list to kernel/reboot.c 2024-11-05 17:12:31 -08:00
nsproxy.c fdget(), trivial conversions 2024-11-03 01:28:06 -05:00
padata.c padata: avoid UAF for reorder_work 2025-01-19 12:44:28 +08:00
panic.c treewide: const qualify ctl_tables where applicable 2025-01-28 13:48:37 +01:00
params.c module: Constify 'struct module_attribute' 2025-01-26 13:05:23 +01:00
pid.c pidfd: add PIDFD_SELF* sentinels to refer to own thread/process 2025-02-05 15:14:37 +01:00
pid_namespace.c treewide: const qualify ctl_tables where applicable 2025-01-28 13:48:37 +01:00
pid_sysctl.h treewide: const qualify ctl_tables where applicable 2025-01-28 13:48:37 +01:00
profile.c profiling: remove profile=sleep support 2024-08-04 13:36:28 -07:00
ptrace.c
range.c
reboot.c treewide: const qualify ctl_tables where applicable 2025-01-28 13:48:37 +01:00
regset.c regset: use kvzalloc() for regset_get_alloc() 2024-04-25 21:07:03 -07:00
relay.c [tree-wide] finally take no_llseek out 2024-09-27 08:18:43 -07:00
resource.c kernel/resource: simplify API __devm_release_region() implementation 2025-01-12 20:20:58 -08:00
resource_kunit.c resource, kunit: fix user-after-free in resource_test_region_intersects() 2024-10-09 12:47:19 -07:00
rseq.c rseq: Fix rseq unregistration regression 2025-01-21 08:10:51 +01:00
scftorture.c scftorture: Handle NULL argument passed to scf_add_to_free_list(). 2024-11-14 16:09:51 -08:00
scs.c
seccomp.c treewide: const qualify ctl_tables where applicable 2025-01-28 13:48:37 +01:00
signal.c pidfs: improve multi-threaded exec and premature thread-group leader exit polling 2025-03-20 15:32:43 +01:00
smp.c CSD-lock pull request for v6.14 2025-01-28 11:34:03 -08:00
smpboot.c
smpboot.h
softirq.c softirq: Allow raising SCHED_SOFTIRQ from SMP-call-function on RT kernel 2024-12-02 12:01:27 +01:00
stackleak.c treewide: const qualify ctl_tables where applicable 2025-01-28 13:48:37 +01:00
stacktrace.c
static_call.c
static_call_inline.c x86/static-call: provide a way to do very early static-call updates 2024-12-13 09:28:32 +01:00
stop_machine.c stop_machine: Fix rcu_momentary_eqs() call in multi_cpu_stop() 2024-12-11 20:50:47 -08:00
sys.c tracing: Add task_prctl_unknown tracepoint 2024-12-22 20:28:11 -08:00
sys_ni.c Probes updates for v6.11: 2024-07-18 12:19:20 -07:00
sysctl-test.c treewide: const qualify ctl_tables where applicable 2025-01-28 13:48:37 +01:00
sysctl.c treewide: const qualify ctl_tables where applicable 2025-01-28 13:48:37 +01:00
task_work.c kasan: make kasan_record_aux_stack_noalloc() the default behaviour 2025-01-13 22:40:36 -08:00
taskstats.c fdget(), more trivial conversions 2024-11-03 01:28:06 -05:00
torture.c torture: Add MODULE_DESCRIPTION() 2024-05-30 15:31:38 -07:00
tracepoint.c tracing: Fix syscall tracepoint use-after-free 2024-11-01 14:37:31 -04:00
tsacct.c tsacct: replace strncpy() with strscpy() 2024-07-12 16:39:53 -07:00
ucount.c ucounts: move kfree() out of critical zone protected by ucounts_lock 2025-01-12 20:21:00 -08:00
uid16.c
uid16.h
umh.c treewide: const qualify ctl_tables where applicable 2025-01-28 13:48:37 +01:00
up.c
user-return-notifier.c
user.c uidgid: make sure we fit into one cacheline 2024-09-12 12:16:09 +02:00
user_namespace.c user_namespace: use kmemdup_array() instead of kmemdup() for multiple allocation 2024-09-09 16:47:42 -07:00
usermode_driver.c
utsname.c
utsname_sysctl.c treewide: const qualify ctl_tables where applicable 2025-01-28 13:48:37 +01:00
vhost_task.c vhost_task: Handle SIGKILL by flushing work and exiting 2024-05-22 08:31:15 -04:00
vmcore_info.c mm: support only one page_type per page 2024-09-03 21:15:43 -07:00
watch_queue.c watch_queue: Use page->private instead of page->index 2024-12-22 11:29:51 +01:00
watchdog.c treewide: const qualify ctl_tables where applicable 2025-01-28 13:48:37 +01:00
watchdog_buddy.c
watchdog_perf.c watchdog/perf: properly initialize the turbo mode timestamp and rearm counter 2024-07-17 21:11:34 -07:00
workqueue.c The various patchsets are summarized below. Plus of course many 2025-01-26 18:36:23 -08:00
workqueue_internal.h