Commit Graph

51037 Commits

Author SHA1 Message Date
Linus Torvalds d5273fd3ca bpf-fixes
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmnAGisACgkQ6rmadz2v
 bTqjsw/9GfHT/fdnjfA/q27TQH28ZdrZfq90BpI3m5BfTO8/l+Kt+g1HDGpku+C/
 iWh66rg9t/P9nMvtdzvPsdT833UbwbY6fPEK3r7ANgf7SBb1DNvaGHBM6XNefvZV
 j+VcykKUaEo8U1GeG+gI4TyAALSqvvMeBPYpAPZDUYguYLyE+YIl2Pl6tWt+A7yf
 9V3JjCSz63t75qqnhY2SIBZv2pqWiMaCI8uPgaF7drhQM5Xc0l/R75CMPGeF9BrT
 GRtTVJhY+6UyI2Q0ZRSRSVHZ1j2kYHI/eK3Kamxwal5hNh37BYHm3pT5TSHbZTe1
 xO7c1AB0vds8kznRkclQfsMdjVwuBQj03ukLVNqnnaaE4Ir7JlXlXYgeG0KJbbfW
 kQG8UyDD7tMWZkvaA0Z51FC88WJNLJoNAku519alcMtgAf1CrxzG9aUAYEWE4erh
 E/FKKvFqQ6T0mOFSXlk1NFeMjNXcg5Tu2KKKKOjAWT6goUc4hw80IWydTyxMy32m
 8/eLmdTZpAQovc2rS+5LSTigQ3DT082J950sxdQ3yRaLTWBGNC06gkA/WcRq2ZI+
 hBdW6GI1XFwkXGw5+F9fN9Bt5FmE42v44i+RrlNZV1R5bVr0Za/ofkWP3dm1/SOg
 QRSJk30hx9JveR9gD/xWawycYFuwmha/BL0tur2T32M67MneJpo=
 =Ye1S
 -----END PGP SIGNATURE-----

Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf

Pull bpf fixes from Alexei Starovoitov:

 - Fix how linked registers track zero extension of subregisters (Daniel
   Borkmann)

 - Fix unsound scalar fork for OR instructions (Daniel Wade)

 - Fix exception exit lock check for subprogs (Ihor Solodrai)

 - Fix undefined behavior in interpreter for SDIV/SMOD instructions
   (Jenny Guanni Qu)

 - Release module's BTF when module is unloaded (Kumar Kartikeya
   Dwivedi)

 - Fix constant blinding for PROBE_MEM32 instructions (Sachin Kumar)

 - Reset register ID for END instructions to prevent incorrect value
   tracking (Yazhou Tang)

* tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  selftests/bpf: Add a test cases for sync_linked_regs regarding zext propagation
  bpf: Fix sync_linked_regs regarding BPF_ADD_CONST32 zext propagation
  selftests/bpf: Add tests for maybe_fork_scalars() OR vs AND handling
  bpf: Fix unsound scalar forking in maybe_fork_scalars() for BPF_OR
  selftests/bpf: Add tests for sdiv32/smod32 with INT_MIN dividend
  bpf: Fix undefined behavior in interpreter sdiv/smod for INT_MIN
  selftests/bpf: Add tests for bpf_throw lock leak from subprogs
  bpf: Fix exception exit lock checking for subprogs
  bpf: Release module BTF IDR before module unload
  selftests/bpf: Fix pkg-config call on static builds
  bpf: Fix constant blinding for PROBE_MEM32 stores
  selftests/bpf: Add test for BPF_END register ID reset
  bpf: Reset register ID for BPF_END value tracking
2026-03-22 11:16:06 -07:00
Linus Torvalds ac57fa9faf tracing fixes for 7.0:
- Revert "tracing: Remove pid in task_rename tracing output"
 
   A change was made to remove the pid field from the task_rename event
   because it was thought that it was always done for the current task and
   recording the pid would be redundant. This turned out to be incorrect and
   there are a few corner case where this is not true and caused some
   regressions in tooling.
 
 - Fix the reading from user space for migration
 
   The reading of user space uses a seq lock type of logic where it uses a
   per-cpu temporary buffer and disables migration, then enables preemption,
   does the copy from user space, disables preemption, enables migration and
   checks if there was any schedule switches while preemption was enabled. If
   there was a context switch, then it is considered that the per-cpu buffer
   could be corrupted and it tries again. There's a protection check that
   tests if it takes a hundred tries, it issues a warning and exits out to
   prevent a live lock.
 
   This was triggered because the task was selected by the load balancer to
   be migrated to another CPU, every time preemption is enabled the migration
   task would schedule in try to migrate the task but can't because migration
   is disabled and let it run again. This caused the scheduler to schedule out
   the task every time it enabled preemption and made the loop never exit
   (until the 100 iteration test triggered).
 
   Fix this by enabling and disabling preemption and keeping migration
   enabled if the reading from user space needs to be done again. This will
   let the migration thread migrate the task and the copy from user space
   will likely pass on the next iteration.
 
 - Fix trace_marker copy option freeing
 
   The "copy_trace_marker" option allows a tracing instance to get a copy of
   a write to the trace_marker file of the top level instance. This is
   managed by a link list protected by RCU. When an instance is removed, a
   check is made if the option is set, and if so synchronized_rcu() is
   called. The problem is that an iteration is made to reset all the flags to
   what they were when the instance was created (to perform clean ups) was
   done before the check of the copy_trace_marker option and that option was
   cleared, so the synchronize_rcu() was never called.
 
   Move the clearing of all the flags after the check of copy_trace_marker to
   do synchronize_rcu() so that the option is still set if it was before and
   the synchronization is performed.
 
 - Fix entries setting when validating the persistent ring buffer
 
   When validating the persistent ring buffer on boot up, the number of
   events per sub-buffer is added to the sub-buffer meta page. The validator
   was updating cpu_buffer->head_page (the first sub-buffer of the per-cpu
   buffer) and not the "head_page" variable that was iterating the
   sub-buffers. This was causing the first sub-buffer to be assigned the
   entries for each sub-buffer and not the sub-buffer that was supposed to be
   updated.
 
 - Use "hash" value to update the direct callers
 
   When updating the ftrace direct callers, it assigned a temporary callback
   to all the callback functions of the ftrace ops and not just the
   functions represented by the passed in hash. This causes an unnecessary
   slow down of the functions of the ftrace_ops that is not being modified.
   Only update the functions that are going to be modified to call the
   ftrace loop function so that the update can be made on those functions.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYKADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCacAMahQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qr0sAQCoI4L3iAR5HU1z8dw2GWhOz9fTnzfw
 9VPRZAsga9J5xgEA1Y0bvKBM0UPHFAL2POkaILYV1aT00lZ7aIVHPqfdYgA=
 =OoGW
 -----END PGP SIGNATURE-----

Merge tag 'trace-v7.0-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracing fixes from Steven Rostedt:

 - Revert "tracing: Remove pid in task_rename tracing output"

   A change was made to remove the pid field from the task_rename event
   because it was thought that it was always done for the current task
   and recording the pid would be redundant. This turned out to be
   incorrect and there are a few corner case where this is not true and
   caused some regressions in tooling.

 - Fix the reading from user space for migration

   The reading of user space uses a seq lock type of logic where it uses
   a per-cpu temporary buffer and disables migration, then enables
   preemption, does the copy from user space, disables preemption,
   enables migration and checks if there was any schedule switches while
   preemption was enabled. If there was a context switch, then it is
   considered that the per-cpu buffer could be corrupted and it tries
   again. There's a protection check that tests if it takes a hundred
   tries, it issues a warning and exits out to prevent a live lock.

   This was triggered because the task was selected by the load balancer
   to be migrated to another CPU, every time preemption is enabled the
   migration task would schedule in try to migrate the task but can't
   because migration is disabled and let it run again. This caused the
   scheduler to schedule out the task every time it enabled preemption
   and made the loop never exit (until the 100 iteration test
   triggered).

   Fix this by enabling and disabling preemption and keeping migration
   enabled if the reading from user space needs to be done again. This
   will let the migration thread migrate the task and the copy from user
   space will likely pass on the next iteration.

 - Fix trace_marker copy option freeing

   The "copy_trace_marker" option allows a tracing instance to get a
   copy of a write to the trace_marker file of the top level instance.
   This is managed by a link list protected by RCU. When an instance is
   removed, a check is made if the option is set, and if so
   synchronized_rcu() is called.

   The problem is that an iteration is made to reset all the flags to
   what they were when the instance was created (to perform clean ups)
   was done before the check of the copy_trace_marker option and that
   option was cleared, so the synchronize_rcu() was never called.

   Move the clearing of all the flags after the check of
   copy_trace_marker to do synchronize_rcu() so that the option is still
   set if it was before and the synchronization is performed.

 - Fix entries setting when validating the persistent ring buffer

   When validating the persistent ring buffer on boot up, the number of
   events per sub-buffer is added to the sub-buffer meta page. The
   validator was updating cpu_buffer->head_page (the first sub-buffer of
   the per-cpu buffer) and not the "head_page" variable that was
   iterating the sub-buffers. This was causing the first sub-buffer to
   be assigned the entries for each sub-buffer and not the sub-buffer
   that was supposed to be updated.

 - Use "hash" value to update the direct callers

   When updating the ftrace direct callers, it assigned a temporary
   callback to all the callback functions of the ftrace ops and not just
   the functions represented by the passed in hash. This causes an
   unnecessary slow down of the functions of the ftrace_ops that is not
   being modified. Only update the functions that are going to be
   modified to call the ftrace loop function so that the update can be
   made on those functions.

* tag 'trace-v7.0-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  ftrace: Use hash argument for tmp_ops in update_ftrace_direct_mod
  ring-buffer: Fix to update per-subbuf entries of persistent ring buffer
  tracing: Fix trace_marker copy link list updates
  tracing: Fix failure to read user space from system call trace events
  tracing: Revert "tracing: Remove pid in task_rename tracing output"
2026-03-22 11:10:31 -07:00
Linus Torvalds ebfd9b7af2 Miscellaneous perf fixes:
- Fix a PMU driver crash on AMD EPYC systems, caused by
    a race condition in x86_pmu_enable()
 
  - Fix a possible counter-initialization bug in x86_pmu_enable()
 
  - Fix a counter inheritance bug in inherit_event() and
    __perf_event_read()
 
  - Fix an Intel PMU driver branch constraints handling bug
    found by UBSAN
 
  - Fix the Intel PMU driver's new Off-Module Response (OMR)
    support code for Diamond Rapids / Nova lake, to fix a snoop
    information parsing bug
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmm/ptcRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1i8yBAAsbAbTzyJOxiE29beCiW3r9V4ELtrSlpG
 syUtZ4Y1cozK1w6/8S2oWHJkuWa+ToO6bn9rKxNemWIrnsxe5B4iKfeNWRFTz24g
 41xTXaxUB9c7vBgv4BDvr/ykkYGQybkn0Bf/U5rufzvIlst9bx7zKVAnIT9Qws37
 UTMY96XGYY5HNzGSZbQkpQ4cs8n72U+00OBHMTWtH8NJT+fRmaM312Q8F6wNKgH2
 YtaAjwb55BU5+hQUz5YN96xQGYaoj0s8UtOk4a3tS/t0F8mOodDVTxzMdHKToQmD
 SbscGvfC+bg5zjoYGFEU+cXoBMkkZlBPqZdLVQAGEy3YZT+JIdmyJhn9FD6HVuVw
 OzyG9VuY+TxOFrFQdMs3Xfa0vZ7AO1c90HB08P4T7nWaMioR1iFAF31MSVEMXuzd
 ZROHplWNIqDeOzmerXgZq4JWy3Bpaam8fH1B5/qN450oAxaym3lCOoCZidJYgy3g
 CVBF/6BO7DlpKiy9lXknRscItwIiRmZ9Xr+sOmOMQRGqQkC6Ykk/Hj2sA1qKPuQ5
 ruRqqaL9cznttAoJR2jZ0Ijyu7usgxB/y066nR1zXKdvEdNcntaic+QybHxbQoYx
 kyYNoR1dg+AWLb5juT68abkP4trZ8EUa7Q29OX1SzTk+0U7M0fO3/rq9gt71HiHR
 WSjRDQPfWrI=
 =jp5H
 -----END PGP SIGNATURE-----

Merge tag 'perf-urgent-2026-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull perf fixes from Ingo Molnar:

 - Fix a PMU driver crash on AMD EPYC systems, caused by
   a race condition in x86_pmu_enable()

 - Fix a possible counter-initialization bug in x86_pmu_enable()

 - Fix a counter inheritance bug in inherit_event() and
   __perf_event_read()

 - Fix an Intel PMU driver branch constraints handling bug
   found by UBSAN

 - Fix the Intel PMU driver's new Off-Module Response (OMR)
   support code for Diamond Rapids / Nova lake, to fix a snoop
   information parsing bug

* tag 'perf-urgent-2026-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/x86/intel: Fix OMR snoop information parsing issues
  perf/x86/intel: Add missing branch counters constraint apply
  perf: Make sure to use pmu_ctx->pmu for groups
  x86/perf: Make sure to program the counter value for stopped events on migration
  perf/x86: Move event pointer setup earlier in x86_pmu_enable()
2026-03-22 10:31:51 -07:00
Jiri Olsa 50b35c9e50 ftrace: Use hash argument for tmp_ops in update_ftrace_direct_mod
The modify logic registers temporary ftrace_ops object (tmp_ops) to trigger
the slow path for all direct callers to be able to safely modify attached
addresses.

At the moment we use ops->func_hash for tmp_ops filter, which represents all
the systems attachments. It's faster to use just the passed hash filter, which
contains only the modified sites and is always a subset of the ops->func_hash.

Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Menglong Dong <menglong8.dong@gmail.com>
Cc: Song Liu <song@kernel.org>
Link: https://patch.msgid.link/20260312123738.129926-1-jolsa@kernel.org
Fixes: e93672f770 ("ftrace: Add update_ftrace_direct_mod function")
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-03-21 16:51:04 -04:00
Masami Hiramatsu (Google) f35dbac694 ring-buffer: Fix to update per-subbuf entries of persistent ring buffer
Since the validation loop in rb_meta_validate_events() updates the same
cpu_buffer->head_page->entries, the other subbuf entries are not updated.
Fix to use head_page to update the entries field, since it is the cursor
in this loop.

Cc: stable@vger.kernel.org
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Ian Rogers <irogers@google.com>
Fixes: 5f3b6e839f ("ring-buffer: Validate boot range memory events")
Link: https://patch.msgid.link/177391153882.193994.17158784065013676533.stgit@mhiramat.tok.corp.google.com
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-03-21 16:47:28 -04:00
Steven Rostedt 07183aac4a tracing: Fix trace_marker copy link list updates
When the "copy_trace_marker" option is enabled for an instance, anything
written into /sys/kernel/tracing/trace_marker is also copied into that
instances buffer. When the option is set, that instance's trace_array
descriptor is added to the marker_copies link list. This list is protected
by RCU, as all iterations uses an RCU protected list traversal.

When the instance is deleted, all the flags that were enabled are cleared.
This also clears the copy_trace_marker flag and removes the trace_array
descriptor from the list.

The issue is after the flags are called, a direct call to
update_marker_trace() is performed to clear the flag. This function
returns true if the state of the flag changed and false otherwise. If it
returns true here, synchronize_rcu() is called to make sure all readers
see that its removed from the list.

But since the flag was already cleared, the state does not change and the
synchronization is never called, leaving a possible UAF bug.

Move the clearing of all flags below the updating of the copy_trace_marker
option which then makes sure the synchronization is performed.

Also use the flag for checking the state in update_marker_trace() instead
of looking at if the list is empty.

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20260318185512.1b6c7db4@gandalf.local.home
Fixes: 7b382efd5e ("tracing: Allow the top level trace_marker to write into another instances")
Reported-by: Sasha Levin <sashal@kernel.org>
Closes: https://lore.kernel.org/all/20260225133122.237275-1-sashal@kernel.org/
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-03-21 16:43:53 -04:00
Steven Rostedt edca33a562 tracing: Fix failure to read user space from system call trace events
The system call trace events call trace_user_fault_read() to read the user
space part of some system calls. This is done by grabbing a per-cpu
buffer, disabling migration, enabling preemption, calling
copy_from_user(), disabling preemption, enabling migration and checking if
the task was preempted while preemption was enabled. If it was, the buffer
is considered corrupted and it tries again.

There's a safety mechanism that will fail out of this loop if it fails 100
times (with a warning). That warning message was triggered in some
pi_futex stress tests. Enabling the sched_switch trace event and
traceoff_on_warning, showed the problem:

 pi_mutex_hammer-1375    [006] d..21   138.981648: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0
     migration/6-47      [006] d..2.   138.981651: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95
 pi_mutex_hammer-1375    [006] d..21   138.981656: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0
     migration/6-47      [006] d..2.   138.981659: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95
 pi_mutex_hammer-1375    [006] d..21   138.981664: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0
     migration/6-47      [006] d..2.   138.981667: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95
 pi_mutex_hammer-1375    [006] d..21   138.981671: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0
     migration/6-47      [006] d..2.   138.981675: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95
 pi_mutex_hammer-1375    [006] d..21   138.981679: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0
     migration/6-47      [006] d..2.   138.981682: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95
 pi_mutex_hammer-1375    [006] d..21   138.981687: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0
     migration/6-47      [006] d..2.   138.981690: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95
 pi_mutex_hammer-1375    [006] d..21   138.981695: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0
     migration/6-47      [006] d..2.   138.981698: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95
 pi_mutex_hammer-1375    [006] d..21   138.981703: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0
     migration/6-47      [006] d..2.   138.981706: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95
 pi_mutex_hammer-1375    [006] d..21   138.981711: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0
     migration/6-47      [006] d..2.   138.981714: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95
 pi_mutex_hammer-1375    [006] d..21   138.981719: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0
     migration/6-47      [006] d..2.   138.981722: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95
 pi_mutex_hammer-1375    [006] d..21   138.981727: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0
     migration/6-47      [006] d..2.   138.981730: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95
 pi_mutex_hammer-1375    [006] d..21   138.981735: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0
     migration/6-47      [006] d..2.   138.981738: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95

What happened was the task 1375 was flagged to be migrated. When
preemption was enabled, the migration thread woke up to migrate that task,
but failed because migration for that task was disabled. This caused the
loop to fail to exit because the task scheduled out while trying to read
user space.

Every time the task enabled preemption the migration thread would schedule
in, try to migrate the task, fail and let the task continue. But because
the loop would only enable preemption with migration disabled, it would
always fail because each time it enabled preemption to read user space,
the migration thread would try to migrate it.

To solve this, when the loop fails to read user space without being
scheduled out, enabled and disable preemption with migration enabled. This
will allow the migration task to successfully migrate the task and the
next loop should succeed to read user space without being scheduled out.

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20260316130734.1858a998@gandalf.local.home
Fixes: 64cf7d058a ("tracing: Have trace_marker use per-cpu data to read user space")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-03-21 16:42:36 -04:00
Daniel Borkmann bc308be380 bpf: Fix sync_linked_regs regarding BPF_ADD_CONST32 zext propagation
Jenny reported that in sync_linked_regs() the BPF_ADD_CONST32 flag is
checked on known_reg (the register narrowed by a conditional branch)
instead of reg (the linked target register created by an alu32 operation).

Example case with reg:

  1. r6 = bpf_get_prandom_u32()
  2. r7 = r6 (linked, same id)
  3. w7 += 5 (alu32 -- r7 gets BPF_ADD_CONST32, zero-extended by CPU)
  4. if w6 < 0xFFFFFFFC goto safe (narrows r6 to [0xFFFFFFFC, 0xFFFFFFFF])
  5. sync_linked_regs() propagates to r7 but does NOT call zext_32_to_64()
  6. Verifier thinks r7 is [0x100000001, 0x100000004] instead of [1, 4]

Since known_reg above does not have BPF_ADD_CONST32 set above, zext_32_to_64()
is never called on alu32-derived linked registers. This causes the verifier
to track incorrect 64-bit bounds, while the CPU correctly zero-extends the
32-bit result.

The code checking known_reg->id was correct however (see scalars_alu32_wrap
selftest case), but the real fix needs to handle both directions - zext
propagation should be done when either register has BPF_ADD_CONST32, since
the linked relationship involves a 32-bit operation regardless of which
side has the flag.

Example case with known_reg (exercised also by scalars_alu32_wrap):

  1. r1 = r0; w1 += 0x100 (alu32 -- r1 gets BPF_ADD_CONST32)
  2. if r1 > 0x80 - known_reg = r1 (has BPF_ADD_CONST32), reg = r0 (doesn't)

Hence, fix it by checking for (reg->id | known_reg->id) & BPF_ADD_CONST32.

Moreover, sync_linked_regs() also has a soundness issue when two linked
registers used different ALU widths: one with BPF_ADD_CONST32 and the
other with BPF_ADD_CONST64. The delta relationship between linked registers
assumes the same arithmetic width though. When one register went through
alu32 (CPU zero-extends the 32-bit result) and the other went through
alu64 (no zero-extension), the propagation produces incorrect bounds.

Example:

  r6 = bpf_get_prandom_u32()     // fully unknown
  if r6 >= 0x100000000 goto out  // constrain r6 to [0, U32_MAX]
  r7 = r6
  w7 += 1                        // alu32: r7.id = N | BPF_ADD_CONST32
  r8 = r6
  r8 += 2                        // alu64: r8.id = N | BPF_ADD_CONST64
  if r7 < 0xFFFFFFFF goto out    // narrows r7 to [0xFFFFFFFF, 0xFFFFFFFF]

At the branch on r7, sync_linked_regs() runs with known_reg=r7
(BPF_ADD_CONST32) and reg=r8 (BPF_ADD_CONST64). The delta path
computes:

  r8 = r7 + (delta_r8 - delta_r7) = 0xFFFFFFFF + (2 - 1) = 0x100000000

Then, because known_reg->id has BPF_ADD_CONST32, zext_32_to_64(r8) is
called, truncating r8 to [0, 0]. But r8 used a 64-bit ALU op -- the
CPU does NOT zero-extend it. The actual CPU value of r8 is
0xFFFFFFFE + 2 = 0x100000000, not 0. The verifier now underestimates
r8's 64-bit bounds, which is a soundness violation.

Fix sync_linked_regs() by skipping propagation when the two registers
have mixed ALU widths (one BPF_ADD_CONST32, the other BPF_ADD_CONST64).

Lastly, fix regsafe() used for path pruning: the existing checks used
"& BPF_ADD_CONST" to test for offset linkage, which treated
BPF_ADD_CONST32 and BPF_ADD_CONST64 as equivalent.

Fixes: 7a433e5193 ("bpf: Support negative offsets, BPF_SUB, and alu32 for linked register tracking")
Reported-by: Jenny Guanni Qu <qguanni@gmail.com>
Co-developed-by: Puranjay Mohan <puranjay@kernel.org>
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20260319211507.213816-1-daniel@iogearbox.net
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-03-21 13:19:40 -07:00
Daniel Wade c845894ebd bpf: Fix unsound scalar forking in maybe_fork_scalars() for BPF_OR
maybe_fork_scalars() is called for both BPF_AND and BPF_OR when the
source operand is a constant.  When dst has signed range [-1, 0], it
forks the verifier state: the pushed path gets dst = 0, the current
path gets dst = -1.

For BPF_AND this is correct: 0 & K == 0.
For BPF_OR this is wrong:    0 | K == K, not 0.

The pushed path therefore tracks dst as 0 when the runtime value is K,
producing an exploitable verifier/runtime divergence that allows
out-of-bounds map access.

Fix this by passing env->insn_idx (instead of env->insn_idx + 1) to
push_stack(), so the pushed path re-executes the ALU instruction with
dst = 0 and naturally computes the correct result for any opcode.

Fixes: bffacdb80b ("bpf: Recognize special arithmetic shift in the verifier")
Signed-off-by: Daniel Wade <danjwade95@gmail.com>
Reviewed-by: Amery Hung <ameryhung@gmail.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20260314021521.128361-2-danjwade95@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-03-21 13:14:28 -07:00
Jenny Guanni Qu c77b30bd1d bpf: Fix undefined behavior in interpreter sdiv/smod for INT_MIN
The BPF interpreter's signed 32-bit division and modulo handlers use
the kernel abs() macro on s32 operands. The abs() macro documentation
(include/linux/math.h) explicitly states the result is undefined when
the input is the type minimum. When DST contains S32_MIN (0x80000000),
abs((s32)DST) triggers undefined behavior and returns S32_MIN unchanged
on arm64/x86. This value is then sign-extended to u64 as
0xFFFFFFFF80000000, causing do_div() to compute the wrong result.

The verifier's abstract interpretation (scalar32_min_max_sdiv) computes
the mathematically correct result for range tracking, creating a
verifier/interpreter mismatch that can be exploited for out-of-bounds
map value access.

Introduce abs_s32() which handles S32_MIN correctly by casting to u32
before negating, avoiding signed overflow entirely. Replace all 8
abs((s32)...) call sites in the interpreter's sdiv32/smod32 handlers.

s32 is the only affected case -- the s64 division/modulo handlers do
not use abs().

Fixes: ec0e2da95f ("bpf: Support new signed div/mod instructions.")
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Acked-by: Mykyta Yatsenko <yatsenko@meta.com>
Signed-off-by: Jenny Guanni Qu <qguanni@gmail.com>
Link: https://lore.kernel.org/r/20260311011116.2108005-2-qguanni@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-03-21 13:12:16 -07:00
Ihor Solodrai 6c2128505f bpf: Fix exception exit lock checking for subprogs
process_bpf_exit_full() passes check_lock = !curframe to
check_resource_leak(), which is false in cases when bpf_throw() is
called from a static subprog. This makes check_resource_leak() to skip
validation of active_rcu_locks, active_preempt_locks, and
active_irq_id on exception exits from subprogs.

At runtime bpf_throw() unwinds the stack via ORC without releasing any
user-acquired locks, which may cause various issues as the result.

Fix by setting check_lock = true for exception exits regardless of
curframe, since exceptions bypass all intermediate frame
cleanup. Update the error message prefix to "bpf_throw" for exception
exits to distinguish them from normal BPF_EXIT.

Fix reject_subprog_with_rcu_read_lock test which was previously
passing for the wrong reason. Test program returned directly from the
subprog call without closing the RCU section, so the error was
triggered by the unclosed RCU lock on normal exit, not by
bpf_throw. Update __msg annotations for affected tests to match the
new "bpf_throw" error prefix.

The spin_lock case is not affected because they are already checked [1]
at the call site in do_check_insn() before bpf_throw can run.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/bpf/verifier.c?h=v7.0-rc4#n21098

Assisted-by: Claude:claude-opus-4-6
Fixes: f18b03faba ("bpf: Implement BPF exceptions")
Signed-off-by: Ihor Solodrai <ihor.solodrai@linux.dev>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20260320000809.643798-1-ihor.solodrai@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-03-21 12:51:44 -07:00
Linus Torvalds e9825d1c79 Power management fixes for 7.0-rc5
- Consolidate the handling of two special cases in the idle loop that
    occur when only one CPU idle state is present (Rafael Wysocki)
 
  - Fix a race condition related to device removal in the runtime PM
    core code that may cause a stale device object pointer to be
    dereferenced (Bart Van Assche)
 -----BEGIN PGP SIGNATURE-----
 
 iQFGBAABCAAwFiEEcM8Aw/RY0dgsiRUR7l+9nS/U47UFAmm8Ad4SHHJqd0Byand5
 c29ja2kubmV0AAoJEO5fvZ0v1OO1yFYIAKaPfbV2uWEbB0l+g7QBk4lEGc+CUxN+
 jpBut/hWepoLDzQyqaROtTd8R5dFmKASsLXpNVgDjEDX+HCkFzdmlUlDZar0B8zZ
 lZTcQypFPWtKeTD5OuelZK/trT09SdsB5Yeav1O6KXM5bT0ISn/f729gkPGIXXm5
 +L8R2SvrEYu1dMZ3ithRq+4/bx1Pqav/g+0aD/tQMBDxjSE0G+d3hPFxvPueDMZX
 L8CNC5CcH3vwNrI0NR9EH/z491q1Ddo6XUBW+ja2UGztdBTVmm5fEktDGxVjN8B8
 A+ZT0RX1itcV+ngjtNYpXbRlMLwCkTvM8rVLpOANcAmsbNOCYGyM80s=
 =aptB
 -----END PGP SIGNATURE-----

Merge tag 'pm-7.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management fixes from Rafael Wysocki:
 "These fix an idle loop issue exposed by recent changes and a race
  condition related to device removal in the runtime PM core code:

   - Consolidate the handling of two special cases in the idle loop that
     occur when only one CPU idle state is present (Rafael Wysocki)

   - Fix a race condition related to device removal in the runtime PM
     core code that may cause a stale device object pointer to be
     dereferenced (Bart Van Assche)"

* tag 'pm-7.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  PM: runtime: Fix a race condition related to device removal
  sched: idle: Consolidate the handling of two special cases
2026-03-19 08:45:34 -07:00
Kumar Kartikeya Dwivedi 146bd2a87a bpf: Release module BTF IDR before module unload
Gregory reported in [0] that the global_map_resize test when run in
repeatedly ends up failing during program load. This stems from the fact
that BTF reference has not dropped to zero after the previous run's
module is unloaded, and the older module's BTF is still discoverable and
visible. Later, in libbpf, load_module_btfs() will find the ID for this
stale BTF, open its fd, and then it will be used during program load
where later steps taking module reference using btf_try_get_module()
fail since the underlying module for the BTF is gone.

Logically, once a module is unloaded, it's associated BTF artifacts
should become hidden. The BTF object inside the kernel may still remain
alive as long its reference counts are alive, but it should no longer be
discoverable.

To fix this, let us call btf_free_id() from the MODULE_STATE_GOING case
for the module unload to free the BTF associated IDR entry, and disable
its discovery once module unload returns to user space. If a race
happens during unload, the outcome is non-deterministic anyway. However,
user space should be able to rely on the guarantee that once it has
synchronously established a successful module unload, no more stale
artifacts associated with this module can be obtained subsequently.

Note that we must be careful to not invoke btf_free_id() in btf_put()
when btf_is_module() is true now. There could be a window where the
module unload drops a non-terminal reference, frees the IDR, but the
same ID gets reused and the second unconditional btf_free_id() ends up
releasing an unrelated entry.

To avoid a special case for btf_is_module() case, set btf->id to zero to
make btf_free_id() idempotent, such that we can unconditionally invoke it
from btf_put(), and also from the MODULE_STATE_GOING case. Since zero is
an invalid IDR, the idr_remove() should be a noop.

Note that we can be sure that by the time we reach final btf_put() for
btf_is_module() case, the btf_free_id() is already done, since the
module itself holds the BTF reference, and it will call this function
for the BTF before dropping its own reference.

  [0]: https://lore.kernel.org/bpf/cover.1773170190.git.grbell@redhat.com

Fixes: 36e68442d1 ("bpf: Load and verify kernel module BTFs")
Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
Suggested-by: Martin KaFai Lau <martin.lau@kernel.org>
Reported-by: Gregory Bell <grbell@redhat.com>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20260312205307.1346991-1-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-03-18 17:26:40 -07:00
Rafael J. Wysocki f4c31b07b1 sched: idle: Consolidate the handling of two special cases
There are two special cases in the idle loop that are handled
inconsistently even though they are analogous.

The first one is when a cpuidle driver is absent and the default CPU
idle time power management implemented by the architecture code is used.
In that case, the scheduler tick is stopped every time before invoking
default_idle_call().

The second one is when a cpuidle driver is present, but there is only
one idle state in its table.  In that case, the scheduler tick is never
stopped at all.

Since each of these approaches has its drawbacks, reconcile them with
the help of one simple heuristic.  Namely, stop the tick if the CPU has
been woken up by it in the previous iteration of the idle loop, or let
it tick otherwise.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Qais Yousef <qyousef@layalina.io>
Reviewed-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
Fixes: ed98c34919 ("sched: idle: Do not stop the tick before cpuidle_idle_call()")
[ rjw: Added Fixes tag, changelog edits ]
Link: https://patch.msgid.link/4741364.LvFx2qVVIh@rafael.j.wysocki
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2026-03-16 20:29:47 +01:00
Linus Torvalds 8a91ebb337 6 hotfixes. 4 are cc:stable. 3 are for MM.
All are singletons - please see the changelogs for details.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCabhW5gAKCRDdBJ7gKXxA
 jiy+AQDq0x5J94Ezavp0//wqpQ3gwCbo+E8cOE4qb3ig20gBFAEA2hr8qvWYeGCV
 r+FhX6PAmY6fxL8e/akqR3h7fR7iPAk=
 =g8d7
 -----END PGP SIGNATURE-----

Merge tag 'mm-hotfixes-stable-2026-03-16-12-15' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull misc fixes from Andrew Morton:
 "6 hotfixes.  4 are cc:stable.  3 are for MM.

  All are singletons - please see the changelogs for details"

* tag 'mm-hotfixes-stable-2026-03-16-12-15' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  MAINTAINERS: update email address for Ignat Korchagin
  mm/huge_memory: fix early failure try_to_migrate() when split huge pmd for shared THP
  mm/rmap: fix incorrect pte restoration for lazyfree folios
  mm/huge_memory: fix use of NULL folio in move_pages_huge_pmd()
  build_bug.h: correct function parameters names in kernel-doc
  crash_dump: don't log dm-crypt key bytes in read_key_from_user_keying
2026-03-16 12:21:00 -07:00
Linus Torvalds d9bf296c39 Probes fixes for v7.0-rc3
- kprobes: avoid crash when rmmod/insmod after ftrace killed
   This fixes a kernel crash caused by kprobes on the symbol in a
   module which is unloaded after ftrace_kill() is called.
 - kprobes: Remove unneeded warnings from __arm_kprobe_ftrace()
   Remove unneeded WARN messages which can be kicked if the kprobe is
   using ftrace and it fails to enable the ftrace. Since kprobes
   correctly handle such failure, we don't need to warn it.
 -----BEGIN PGP SIGNATURE-----
 
 iQFPBAABCgA5FiEEh7BulGwFlgAOi5DV2/sHvwUrPxsFAmm2hk8bHG1hc2FtaS5o
 aXJhbWF0c3VAZ21haWwuY29tAAoJENv7B78FKz8bBZkH+gP3OllhdIU3AUB+vXEb
 UEE3VE5IZRufSgtjbbJnYI3b8U2dWXw7wmb+fBJ0i0Zf6F+2IUr3hUg1pHNARvlL
 MMu1YW7PG8gKGUsNc7jpHBVXrnefA4XpzXe7wtxaGvqAV16nL/6xhZlanGgL50Gv
 +F9cETMIGZ8duF6XVgEMmUUCg88Iwpp9MjzDQOjpRK7z41LND3ccJ2V88ODzDevj
 idnRbnk12Q9b7xZwGL5P5Ab163kHPFExsXQQnPmy/0fLcyGi8U3hY5EwYhOT85J0
 qRZVHpR1yrXqBpaxD/7as5/pfuUluxdzwmDkyRuQGYW6y7xkhlVNEGWqsWZeYLuW
 IZU=
 =Zkm3
 -----END PGP SIGNATURE-----

Merge tag 'probes-fixes-v7.0-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull probes fixes from Masami Hiramatsu:

 - Avoid crash when rmmod/insmod after ftrace killed

   This fixes a kernel crash caused by kprobes on the symbol in a module
   which is unloaded after ftrace_kill() is called.

 - Remove unneeded warnings from __arm_kprobe_ftrace()

   Remove unneeded WARN messages which can be triggered if the kprobe is
   using ftrace and it fails to enable the ftrace. Since kprobes
   correctly handle such failure, we don't need to warn it.

* tag 'probes-fixes-v7.0-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  kprobes: Remove unneeded warnings from __arm_kprobe_ftrace()
  kprobes: avoid crash when rmmod/insmod after ftrace killed
2026-03-15 13:08:05 -07:00
Linus Torvalds 164cb546e9 Fix function tracer recursion bug by marking jiffies_64_to_clock_t()
notrace.
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmm2J6MRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1hQ9g//Qwq0Esh58yP0UkySArlf7xrOEBrfIF2R
 W7aLDRevhTCklN5z4BlkCaT7XyfZL19CF2AXW1PZ9fSJKlM/vK0xkFToSiKuG6Pc
 2BvfYOvePrmT+8AMnhFe2ekzJclUi9s+ars3H0i59Xu6D+CCOqDvvdGoVkdfuL5+
 eYlxVZK2TtrZEDkdXkBD0vPvcrXqYxDnZD3Pkpys9NQBxSyIDwVdCr+uNVv3hCsf
 L6TAnOzUbI7jWp805izAgdQ4E0P2ywb83nK8op++p4J8PpVaGSnW6LmfDmBsKvRG
 V+cyqXdMToZjEayBdjDgk3GH6M4K2lFWsg1TVfqE1K05fwEdNDYJY7G7eKvhrLH1
 2oNvsEMKLXNVSTgn7gSr63IQBC7+J0eKkmBp1Lv9WWjqYC4QYacrFm4Fc5OCmec0
 GI3QQwPvNa+i7GByvLbXe8+8q/BLIWOIcCutlU8GRdYaghA1lOynI8dQmKfSdM+j
 O1TMIqI56ymffsiBe4d9iUyzCWUdB1LXR0Hm36wm+SJmydc6/Ael1gUlFjqXpGKq
 Ui4h5rRtNvZgsqmsGGgjMvdiZ9dN/e7tz4tdyp4oCcpKhfDT7qNLdHeoyvDxhDs/
 uTvpAK5iH6Wxa91DXt4M+17gf97Ws1dIHU6G9og4B1l5bptyXmrMuVPmK8L/wxDu
 POJGsxuIStA=
 =Obpb
 -----END PGP SIGNATURE-----

Merge tag 'timers-urgent-2026-03-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timer fix from Ingo Molnar:
 "Fix function tracer recursion bug by marking jiffies_64_to_clock_t()
  notrace"

* tag 'timers-urgent-2026-03-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  time/jiffies: Mark jiffies_64_to_clock_t() notrace
2026-03-15 11:14:09 -07:00
Linus Torvalds 63724e9519 More MM-CID fixes, mostly fixing hangs/races:
- Fix CID hangs due to a race between concurrent forks
 
  - Fix vfork()/CLONE_VM MMCID bug causing hangs
 
  - Remove pointless preemption guard
 
  - Fix CID task list walk performance regression on large systems
    by removing the known-flaky and slow counting logic using
    for_each_process_thread() in mm_cid_*fixup_tasks_to_cpus(),
    and implementing a simple sched_mm_cid::node list instead
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmm2JvIRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1hTqg/+K7b4LDOi3nVblmoj6q+mQj2i8DFPbi10
 zeAWJJnamYWPvUi+Wxq30JjZJ9v+15Ddcmbhea9m/3u1YO6nAL5TbGeQcJ2LU/7p
 Ynu9cznv9PfqO4X7WQc3gJC9xx8PbcM00E3JzGxDX/3NDmDBaTOwwuTp41ymcbhm
 cGfnUQWGt81sMummVzqehszfIRMZHnWflYDJ2gC66rcGXMNBlEX125F8jybOm66n
 Ez6gO7e9EGn28+hZIufySsxaeeK/3NFVKj1UjGP/FMuBwQFAjHPv61nic33nOKXT
 yrw7U8DIaYUqFN4d1lplTG72j2YSUj7snn3Q+ubxpzFmOt7RmouVqwlVGEoey5fh
 cEe2VYSQFoZKQioWWyms1LP1hTOa2JkNVhdjBfRZ8IM+Wp47OaDiw1h1+zwwMDbJ
 xpDAXEuU+sBZiv2SeBLFQgrGj58gb8pdjN4o47X89mx8TKYWtStrCMsD+MF10LBm
 dz780Eiinbw5D8JBsxU/ehETpgrAAVmo1KbFx2Q2grAgkJs7jSqBN2KF8NpmH/ZS
 Jk8SpQOn4Vp8iO32TbpsV/GErG9EQgixQxnkTukv2Qd9kguhmjwbi/blN3rLBlBb
 XbmR9rRAMfAjlPrk84tn9ecXNWO0NV83IYheAwjip36alSbOs+OcxdhrZ78nxh8C
 EsKqGl3PeOk=
 =ce5G
 -----END PGP SIGNATURE-----

Merge tag 'sched-urgent-2026-03-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler fixes from Ingo Molnar:
 "More MM-CID fixes, mostly fixing hangs/races:

   - Fix CID hangs due to a race between concurrent forks

   - Fix vfork()/CLONE_VM MMCID bug causing hangs

   - Remove pointless preemption guard

   - Fix CID task list walk performance regression on large systems
     by removing the known-flaky and slow counting logic using
     for_each_process_thread() in mm_cid_*fixup_tasks_to_cpus(), and
     implementing a simple sched_mm_cid::node list instead"

* tag 'sched-urgent-2026-03-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/mmcid: Avoid full tasklist walks
  sched/mmcid: Remove pointless preempt guard
  sched/mmcid: Handle vfork()/CLONE_VM correctly
  sched/mmcid: Prevent CID stalls due to concurrent forks
2026-03-15 10:49:47 -07:00
Linus Torvalds 9abff5748e workqueue: Fixes for v7.0-rc3
- Improve workqueue stall diagnostics: dump all busy workers (not just
   running ones), show wall-clock duration of in-flight work items, and
   add a sample module for reproducing stalls.
 
 - Fix POOL_BH vs WQ_BH flag namespace mismatch in pr_cont_worker_id().
 
 - Rename pool->watchdog_ts to pool->last_progress_ts and related
   functions for clarity.
 -----BEGIN PGP SIGNATURE-----
 
 iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCabRyOg4cdGpAa2VybmVs
 Lm9yZwAKCRCxYfJx3gVYGaAEAP9xJvKzVtyXBMWxIQNJKqN58VaT/5bNk3tYfm3O
 ZOuBrQD+Lsxjvv/pSSKFscZJ/x0dthdYncYI8DF3/G6Lnf+LDAc=
 =MT3y
 -----END PGP SIGNATURE-----

Merge tag 'wq-for-7.0-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq

Pull workqueue fixes from Tejun Heo:

 - Improve workqueue stall diagnostics: dump all busy workers (not just
   running ones), show wall-clock duration of in-flight work items, and
   add a sample module for reproducing stalls

 - Fix POOL_BH vs WQ_BH flag namespace mismatch in pr_cont_worker_id()

 - Rename pool->watchdog_ts to pool->last_progress_ts and related
   functions for clarity

* tag 'wq-for-7.0-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
  workqueue: Rename show_cpu_pool{s,}_hog{s,}() to reflect broadened scope
  workqueue: Add stall detector sample module
  workqueue: Show all busy workers in stall diagnostics
  workqueue: Show in-flight work item duration in stall diagnostics
  workqueue: Rename pool->watchdog_ts to pool->last_progress_ts
  workqueue: Use POOL_BH instead of WQ_BH when checking pool flags
2026-03-13 15:11:05 -07:00
Linus Torvalds b073bcb8d4 cgroup: Fixes for v7.0-rc3
- Hide PF_EXITING tasks from cgroup.procs to avoid exposing dead tasks
   that haven't been removed yet, fixing a systemd timeout issue on
   PREEMPT_RT.
 
 - Call rebuild_sched_domains() directly in CPU hotplug instead of
   deferring to a workqueue, fixing a race where online/offline CPUs
   could briefly appear in stale sched domains.
 -----BEGIN PGP SIGNATURE-----
 
 iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCabRyLQ4cdGpAa2VybmVs
 Lm9yZwAKCRCxYfJx3gVYGXk4APwKw2HtGyI3OQAHfDBL+wlblPtf8acz0zpDwGCT
 9+TWFQD/Rhmtvkb/X/LTwT5PKJksoHOfkD4MqmVMGStKGdxtBAY=
 =DJ17
 -----END PGP SIGNATURE-----

Merge tag 'cgroup-for-7.0-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup

Pull cgroup fixes from Tejun Heo:

 - Hide PF_EXITING tasks from cgroup.procs to avoid exposing dead tasks
   that haven't been removed yet, fixing a systemd timeout issue on
   PREEMPT_RT

 - Call rebuild_sched_domains() directly in CPU hotplug instead of
   deferring to a workqueue, fixing a race where online/offline CPUs
   could briefly appear in stale sched domains

* tag 'cgroup-for-7.0-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup: Don't expose dead tasks in cgroup
  cgroup/cpuset: Call rebuild_sched_domains() directly in hotplug
2026-03-13 15:06:31 -07:00
Linus Torvalds 8369b2e97d sched_ext: Fixes for v7.0-rc3
- Fix data races flagged by KCSAN: add missing READ_ONCE()/WRITE_ONCE()
   annotations for lock-free accesses to module parameters and dsq->seq.
 
 - Fix silent truncation of upper 32 enqueue flags (SCX_ENQ_PREEMPT and
   above) when passed through the int sched_class interface.
 
 - Documentation updates: scheduling class precedence, task ownership
   state machine, example scheduler descriptions, config list cleanup.
 
 - Selftest fix for format specifier and buffer length in
   file_write_long().
 -----BEGIN PGP SIGNATURE-----
 
 iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCabRyHg4cdGpAa2VybmVs
 Lm9yZwAKCRCxYfJx3gVYGZiWAQCmUOHiGAk73p9DDn6Zyrm+o/iQm/iOinchBeUs
 ZiG0bgEAn15giAnLCA5Zs6cG7PemxBH1v7ctyzTjh1VsBds0rwo=
 =zXix
 -----END PGP SIGNATURE-----

Merge tag 'sched_ext-for-7.0-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext

Pull sched_ext fixes from Tejun Heo:

 - Fix data races flagged by KCSAN: add missing READ_ONCE()/WRITE_ONCE()
   annotations for lock-free accesses to module parameters and dsq->seq

 - Fix silent truncation of upper 32 enqueue flags (SCX_ENQ_PREEMPT and
   above) when passed through the int sched_class interface

 - Documentation updates: scheduling class precedence, task ownership
   state machine, example scheduler descriptions, config list cleanup

 - Selftest fix for format specifier and buffer length in
   file_write_long()

* tag 'sched_ext-for-7.0-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
  sched_ext: Use WRITE_ONCE() for the write side of scx_enable helper pointer
  sched_ext: Fix enqueue_task_scx() truncation of upper enqueue flags
  sched_ext: Documentation: Update sched-ext.rst
  sched_ext: Use READ_ONCE() for scx_slice_bypass_us in scx_bypass()
  sched_ext: Documentation: Mention scheduling class precedence
  sched_ext: Document task ownership state machine
  sched_ext: Use READ_ONCE() for lock-free reads of module param variables
  sched_ext/selftests: Fix format specifier and buffer length in file_write_long()
  sched_ext: Use WRITE_ONCE() for the write side of dsq->seq update
2026-03-13 14:54:56 -07:00
Masami Hiramatsu (Google) 5ef268cb7a kprobes: Remove unneeded warnings from __arm_kprobe_ftrace()
Remove unneeded warnings for handled errors from __arm_kprobe_ftrace()
because all caller handled the error correctly.

Link: https://lore.kernel.org/all/177261531182.1312989.8737778408503961141.stgit@mhiramat.tok.corp.google.com/

Reported-by: Zw Tang <shicenci@gmail.com>
Closes: https://lore.kernel.org/all/CAPHJ_V+J6YDb_wX2nhXU6kh466Dt_nyDSas-1i_Y8s7tqY-Mzw@mail.gmail.com/
Fixes: 9c89bb8e32 ("kprobes: treewide: Cleanup the error messages for kprobes")
Cc: stable@vger.kernel.org
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2026-03-13 23:15:26 +09:00
Masami Hiramatsu (Google) e113f0b46d kprobes: avoid crash when rmmod/insmod after ftrace killed
After we hit ftrace is killed by some errors, the kernel crash if
we remove modules in which kprobe probes.

BUG: unable to handle page fault for address: fffffbfff805000d
PGD 817fcc067 P4D 817fcc067 PUD 817fc8067 PMD 101555067 PTE 0
Oops: Oops: 0000 [#1] SMP KASAN PTI
CPU: 4 UID: 0 PID: 2012 Comm: rmmod Tainted: G        W  OE
Tainted: [W]=WARN, [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
RIP: 0010:kprobes_module_callback+0x89/0x790
RSP: 0018:ffff88812e157d30 EFLAGS: 00010a02
RAX: 1ffffffff805000d RBX: dffffc0000000000 RCX: ffffffff86a8de90
RDX: ffffed1025c2af9b RSI: 0000000000000008 RDI: ffffffffc0280068
RBP: 0000000000000000 R08: 0000000000000001 R09: ffffed1025c2af9a
R10: ffff88812e157cd7 R11: 205d323130325420 R12: 0000000000000002
R13: ffffffffc0290488 R14: 0000000000000002 R15: ffffffffc0280040
FS:  00007fbc450dd740(0000) GS:ffff888420331000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: fffffbfff805000d CR3: 000000010f624000 CR4: 00000000000006f0
Call Trace:
 <TASK>
 notifier_call_chain+0xc6/0x280
 blocking_notifier_call_chain+0x60/0x90
 __do_sys_delete_module.constprop.0+0x32a/0x4e0
 do_syscall_64+0x5d/0xfa0
 entry_SYSCALL_64_after_hwframe+0x76/0x7e

This is because the kprobe on ftrace does not correctly handles
the kprobe_ftrace_disabled flag set by ftrace_kill().

To prevent this error, check kprobe_ftrace_disabled in
__disarm_kprobe_ftrace() and skip all ftrace related operations.

Link: https://lore.kernel.org/all/176473947565.1727781.13110060700668331950.stgit@mhiramat.tok.corp.google.com/

Reported-by: Ye Bin <yebin10@huawei.com>
Closes: https://lore.kernel.org/all/20251125020536.2484381-1-yebin@huaweicloud.com/
Fixes: ae6aa16fdc ("kprobes: introduce ftrace based optimization")
Cc: stable@vger.kernel.org
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-03-13 23:14:14 +09:00
Peter Zijlstra 4b9ce67196 perf: Make sure to use pmu_ctx->pmu for groups
Oliver reported that x86_pmu_del() ended up doing an out-of-bound memory access
when group_sched_in() fails and needs to roll back.

This *should* be handled by the transaction callbacks, but he found that when
the group leader is a software event, the transaction handlers of the wrong PMU
are used. Despite the move_group case in perf_event_open() and group_sched_in()
using pmu_ctx->pmu.

Turns out, inherit uses event->pmu to clone the events, effectively undoing the
move_group case for all inherited contexts. Fix this by also making inherit use
pmu_ctx->pmu, ensuring all inherited counters end up in the same pmu context.

Similarly, __perf_event_read() should use equally use pmu_ctx->pmu for the
group case.

Fixes: bd27568117 ("perf: Rewrite core context handling")
Reported-by: Oliver Rosenberg <olrose55@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ian Rogers <irogers@google.com>
Link: https://patch.msgid.link/20260309133713.GB606826@noisy.programming.kicks-ass.net
2026-03-12 11:29:16 +01:00
Thomas Gleixner 192d852129 sched/mmcid: Avoid full tasklist walks
Chasing vfork()'ed tasks on a CID ownership mode switch requires a full
task list walk, which is obviously expensive on large systems.

Avoid that by keeping a list of tasks using a mm MMCID entity in mm::mm_cid
and walk this list instead. This removes the proven to be flaky counting
logic and avoids a full task list walk in the case of vfork()'ed tasks.

Fixes: fbd0e71dc3 ("sched/mmcid: Provide CID ownership mode fixup functions")
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20260310202526.183824481@kernel.org
2026-03-11 12:01:07 +01:00
Thomas Gleixner 7574ac6e49 sched/mmcid: Remove pointless preempt guard
This is a leftover from the early versions of this function where it could
be invoked without mm::mm_cid::lock held.

Remove it and add lockdep asserts instead.

Fixes: 653fda7ae7 ("sched/mmcid: Switch over to the new mechanism")
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20260310202526.116363613@kernel.org
2026-03-11 12:01:06 +01:00
Thomas Gleixner 28b5a13950 sched/mmcid: Handle vfork()/CLONE_VM correctly
Matthieu and Jiri reported stalls where a task endlessly loops in
mm_get_cid() when scheduling in.

It turned out that the logic which handles vfork()'ed tasks is broken. It
is invoked when the number of tasks associated to a process is smaller than
the number of MMCID users. It then walks the task list to find the
vfork()'ed task, but accounts all the already processed tasks as well.

If that double processing brings the number of to be handled tasks to 0,
the walk stops and the vfork()'ed task's CID is not fixed up. As a
consequence a subsequent schedule in fails to acquire a (transitional) CID
and the machine stalls.

Cure this by removing the accounting condition and make the fixup always
walk the full task list if it could not find the exact number of users in
the process' thread list.

Fixes: fbd0e71dc3 ("sched/mmcid: Provide CID ownership mode fixup functions")
Closes: https://lore.kernel.org/b24ffcb3-09d5-4e48-9070-0b69bc654281@kernel.org
Reported-by: Matthieu Baerts <matttbe@kernel.org>
Reported-by: Jiri Slaby <jirislaby@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20260310202526.048657665@kernel.org
2026-03-11 12:01:06 +01:00
Thomas Gleixner b2e48c429e sched/mmcid: Prevent CID stalls due to concurrent forks
A newly forked task is accounted as MMCID user before the task is visible
in the process' thread list and the global task list. This creates the
following problem:

 CPU1			CPU2
 fork()
   sched_mm_cid_fork(tnew1)
     tnew1->mm.mm_cid_users++;
     tnew1->mm_cid.cid = getcid()
-> preemption
			fork()
			  sched_mm_cid_fork(tnew2)
			    tnew2->mm.mm_cid_users++;
                            // Reaches the per CPU threshold
			    mm_cid_fixup_tasks_to_cpus()
			    for_each_other(current, p)
			         ....

As tnew1 is not visible yet, this fails to fix up the already allocated CID
of tnew1. As a consequence a subsequent schedule in might fail to acquire a
(transitional) CID and the machine stalls.

Move the invocation of sched_mm_cid_fork() after the new task becomes
visible in the thread and the task list to prevent this.

This also makes it symmetrical vs. exit() where the task is removed as CID
user before the task is removed from the thread and task lists.

Fixes: fbd0e71dc3 ("sched/mmcid: Provide CID ownership mode fixup functions")
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20260310202525.969061974@kernel.org
2026-03-11 12:01:06 +01:00
Steven Rostedt 755a648e78 time/jiffies: Mark jiffies_64_to_clock_t() notrace
The trace_clock_jiffies() function that handles the "uptime" clock for
tracing calls jiffies_64_to_clock_t(). This causes the function tracer to
constantly recurse when the tracing clock is set to "uptime". Mark it
notrace to prevent unnecessary recursion when using the "uptime" clock.

Fixes: 58d4e21e50 ("tracing: Fix wraparound problems in "uptime" trace clock")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Link: https://patch.msgid.link/20260306212403.72270bb2@robin
2026-03-11 10:33:12 +01:00
Thorsten Blum 36f46b0e36 crash_dump: don't log dm-crypt key bytes in read_key_from_user_keying
When debug logging is enabled, read_key_from_user_keying() logs the first
8 bytes of the key payload and partially exposes the dm-crypt key.  Stop
logging any key bytes.

Link: https://lkml.kernel.org/r/20260227230008.858641-2-thorsten.blum@linux.dev
Fixes: 479e58549b ("crash_dump: store dm crypt keys in kdump reserved memory")
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Cc: Baoquan He <bhe@redhat.com>
Cc: Coiby Xu <coxu@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-03-10 16:01:48 -07:00
Sachin Kumar 2321a9596d bpf: Fix constant blinding for PROBE_MEM32 stores
BPF_ST | BPF_PROBE_MEM32 immediate stores are not handled by
bpf_jit_blind_insn(), allowing user-controlled 32-bit immediates to
survive unblinded into JIT-compiled native code when bpf_jit_harden >= 1.

The root cause is that convert_ctx_accesses() rewrites BPF_ST|BPF_MEM
to BPF_ST|BPF_PROBE_MEM32 for arena pointer stores during verification,
before bpf_jit_blind_constants() runs during JIT compilation. The
blinding switch only matches BPF_ST|BPF_MEM (mode 0x60), not
BPF_ST|BPF_PROBE_MEM32 (mode 0xa0). The instruction falls through
unblinded.

Add BPF_ST|BPF_PROBE_MEM32 cases to bpf_jit_blind_insn() alongside the
existing BPF_ST|BPF_MEM cases. The blinding transformation is identical:
load the blinded immediate into BPF_REG_AX via mov+xor, then convert
the immediate store to a register store (BPF_STX).

The rewritten STX instruction must preserve the BPF_PROBE_MEM32 mode so
the architecture JIT emits the correct arena addressing (R12-based on
x86-64). Cannot use the BPF_STX_MEM() macro here because it hardcodes
BPF_MEM mode; construct the instruction directly instead.

Fixes: 6082b6c328 ("bpf: Recognize addr_space_cast instruction in the verifier.")
Reviewed-by: Puranjay Mohan <puranjay@kernel.org>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Signed-off-by: Sachin Kumar <xcyfun@protonmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/r/Y6IT5VvNRchPBLI5D7JZHBzZrU9rb0ycRJPJzJSXGj7kJlX8RJwZFSM2YZjcDxoQKABkxt1T8Os2gi23PYyFuQe6KkZGWVyfz8K5afdy9ak=@protonmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-03-10 11:58:17 -07:00
Yazhou Tang a3125bc018 bpf: Reset register ID for BPF_END value tracking
When a register undergoes a BPF_END (byte swap) operation, its scalar
value is mutated in-place. If this register previously shared a scalar ID
with another register (e.g., after an `r1 = r0` assignment), this tie must
be broken.

Currently, the verifier misses resetting `dst_reg->id` to 0 for BPF_END.
Consequently, if a conditional jump checks the swapped register, the
verifier incorrectly propagates the learned bounds to the linked register,
leading to false confidence in the linked register's value and potentially
allowing out-of-bounds memory accesses.

Fix this by explicitly resetting `dst_reg->id` to 0 in the BPF_END case
to break the scalar tie, similar to how BPF_NEG handles it via
`__mark_reg_known`.

Fixes: 9d21199842 ("bpf: Add bitwise tracking for BPF_END")
Closes: https://lore.kernel.org/bpf/AMBPR06MB108683CFEB1CB8D9E02FC95ECF17EA@AMBPR06MB10868.eurprd06.prod.outlook.com/
Link: https://lore.kernel.org/bpf/4be25f7442a52244d0dd1abb47bc6750e57984c9.camel@gmail.com/
Reported-by: Guillaume Laporte <glapt.pro@outlook.com>
Co-developed-by: Tianci Cao <ziye@zju.edu.cn>
Signed-off-by: Tianci Cao <ziye@zju.edu.cn>
Co-developed-by: Shenghao Yuan <shenghaoyuan0928@163.com>
Signed-off-by: Shenghao Yuan <shenghaoyuan0928@163.com>
Signed-off-by: Yazhou Tang <tangyazhou518@outlook.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20260304083228.142016-2-tangyazhou@zju.edu.cn
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-03-10 11:46:31 -07:00
Rafael J. Wysocki d557640e4c sched: idle: Make skipping governor callbacks more consistent
If the cpuidle governor .select() callback is skipped because there
is only one idle state in the cpuidle driver, the .reflect() callback
should be skipped as well, at least for consistency (if not for
correctness), so do it.

Fixes: e5c9ffc6ae ("cpuidle: Skip governor when only one idle state is available")
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
Reviewed-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://patch.msgid.link/12857700.O9o76ZdvQC@rafael.j.wysocki
2026-03-10 16:03:02 +01:00
zhidao su 2fcfe5951e sched_ext: Use WRITE_ONCE() for the write side of scx_enable helper pointer
scx_enable() uses double-checked locking to lazily initialize a static
kthread_worker pointer. The fast path reads helper locklessly:

    if (!READ_ONCE(helper)) {          // lockless read -- no helper_mutex

The write side initializes helper under helper_mutex, but previously
used a plain assignment:

        helper = kthread_run_worker(0, "scx_enable_helper");
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                 plain write -- KCSAN data race with READ_ONCE() above

Since READ_ONCE() on the fast path and the plain write on the
initialization path access the same variable without a common lock,
they constitute a data race. KCSAN requires that all sides of a
lock-free access use READ_ONCE()/WRITE_ONCE() consistently.

Use a temporary variable to stage the result of kthread_run_worker(),
and only WRITE_ONCE() into helper after confirming the pointer is
valid. This avoids a window where a concurrent caller on the fast path
could observe an ERR pointer via READ_ONCE(helper) before the error
check completes.

Fixes: b06ccbabe2 ("sched_ext: Fix starvation of scx_enable() under fair-class saturation")
Signed-off-by: zhidao su <suzhidao@xiaomi.com>
Acked-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2026-03-09 06:08:26 -10:00
Linus Torvalds 6ff1020c2f Make clock_adjtime() syscall timex validation slightly more
permissive for auxiliary clocks, to not reject syscalls
 based on the status field that do not try to modify the
 status field. This makes ABI behavior in clock_adjtime()
 consistent with CLOCK_REALTIME.
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmmsxzkRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1hq8g//fRTp9p2pVfmRWUoxWELrT/bMK1r+D6F3
 6BYkwp68peRhchVrFxkI/Y37rjAIC8CXZSPuvkubqIROrH3gA7SCCQYCcZKdss+t
 i3lbpQF8IbagPIS5btpOAN2KRCu2S7aqjDdH0rWb9VhQdlW7fI71Z72Uz07YEA+q
 TWpy3gE531P/dgAqcvIAyMHnFZDCb1S6z8wZvT3SV4r4GkczfXpTFyNHHtETSu0V
 7isuOBfloM4HpDU50oUotlqBiwigH27J2Ad6aIrnCA7iaQPrzREysG+8E96ShhaB
 g6+qaQS5gTgFryA1bggA6LzGveLOI8bjy2kZ2SnZWuFPj46OReGIuwK4kyY07jz2
 xk0sd37alN16ETKhGVLfAgjmzVGoKVNnp4ak9J3VmMbxWEmXeObuOC8SmF9VImc1
 4bRaG9+Tlfd4DtOOz2+E4VcPE1D9A2tMw4esgUaXRrrp4GlEcKOJ5PRlWj0uGvrh
 xLPLbL0XIiWsjMsHdVs4Gq9Z0MvfRHc4VLOviIqLFtHox2DscZypPkyjKAv5inp0
 /VWyUYJkkr07RMQQ3nqHnP+lzAfO2aSeZ72D9NnHStL3RPbGC4jYvpoi8dnH0/TT
 PKJgj2jb7u3h+1cxKBi1RM0JbxUYD5+4N8zfJISa9uqkHZ3XY3VyuuT+2RHO6CQp
 d1BdX0V4oDA=
 =zjov
 -----END PGP SIGNATURE-----

Merge tag 'timers-urgent-2026-03-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timer fix from Ingo Molnar:
 "Make clock_adjtime() syscall timex validation slightly more permissive
  for auxiliary clocks, to not reject syscalls based on the status field
  that do not try to modify the status field.

  This makes the ABI behavior in clock_adjtime() consistent with
  CLOCK_REALTIME"

* tag 'timers-urgent-2026-03-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  timekeeping: Fix timex status validation for auxiliary clocks
2026-03-07 17:09:15 -08:00
Linus Torvalds b1b9a9d0b5 Fix a DL scheduler bug that may corrupt internal metrics during
PI and setscheduler() syscalls, resulting in kernel warnings
 and misbehavior. Found during stress-testing.
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmmsxW0RHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1jzXRAAjqcTwaC72cd+6cnh+tE9/fcjXf1JtK5e
 TxdTygsgBAbXh63rD4y4cRPueqBR1ne52TAV0lI8Z1pBM/XthnaF4MJBue6B8EdX
 SQIE7hpOh6R81I6hnuhNoNsAy95jQvYXN5SFaKMuNacWNVX8k3vPzN5XPxa7yHLN
 MVUL+O9c7Xwg4v30Nz/QIv0mFoPosbh4PIdeVpD/ghJAXtXhsCg7EYOivEk9UsSy
 TAcq3qRnfDyroIOc5/dnSglEwX12LQqVFBba97nI/TCjaH23PsUIt2Dg2rpJbJ+k
 bLh4hGpOoyQvgE/PSEdoMl1F9pXw3XiUOzAGrFJdqn0iKL+7WzuTEQH+vAToGZQv
 4hF5BtMjLrAYY/MVsD8qJGm/pne5nTIo2gSsG7LZPwCmMj0rDUGXfO4G8N8LHhT7
 ExQ/t2+z0BczsKdvF3VKX+RweT51AOYOWcmLIdA9h1jdAy858GVmTzSWDveAEJ0L
 yToPQ0UMCz985g9il6Rdb5cIphD7DjuUeFNnYTCm63cVpZdA4j8Da74r4KfP2jNY
 tRcbiUy+A7MwqW5aERgwBtI6XCz6QZqW3svJW9yYghf40lgNGAcDCTTdf2r7g0Ho
 Q0pQVxEk9mXD5N1otjzSS4piLbzoMaPH1L4W6ceHN1RzBjfSJED3tmfGUHZUDqNE
 w33GhhQAFpA=
 =vP5l
 -----END PGP SIGNATURE-----

Merge tag 'sched-urgent-2026-03-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler fix from Ingo Molnar:
 "Fix a DL scheduler bug that may corrupt internal metrics during PI and
  setscheduler() syscalls, resulting in kernel warnings and misbehavior.

  Found during stress-testing"

* tag 'sched-urgent-2026-03-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/deadline: Fix missing ENQUEUE_REPLENISH during PI de-boosting
2026-03-07 17:07:13 -08:00
Linus Torvalds 8b7f4cd3ac bpf-fixes
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmmsahoACgkQ6rmadz2v
 bToD4Q/9Hr2lAELsTRZIENBTYYTfk48Rzrr+rJbZWLfX4nOu9RwhVboTw0gvJr8r
 QEteYO8+gP5hrIuifcC+vdzW12CpgziBK7/Qj2NB7MGSX0ZCLK0ShQK6TAbNE8uZ
 xA4qMcJp0BWJpgfU51z4Ur5nMzG0Th2XlY+SGwHYZ/53qRMmP7Tr8R9wPltKuEll
 J+tS6BE6bFKMQe1CuIYlYfYj9kO3j4yx9S48j1e8x9LXc2/2arfsOIZQ6L11rnxZ
 YTa9jbcRL0YdLy7d4hMLyPQquF9hiHDrJLJih+YNmbyOCPGD3HP0ib2c2g0ip+qk
 WlNFLc2K9w0eS02uuAytaxwArdOpDUHG+skWe+dGt95Bk3Xld0hW/JSI5hrohM1E
 0YKjuZeGGvTkI5FJ5kk+BWoApy+FveiN1w7VsHZgr1tix5NRZRgJP/5swM+0eSp/
 neTzp08fC10gi+GOnUpdF/lI4wEayoGMpo68BTGiM/hP6Qh268wNWgTHaau3Sk/z
 wfJb90VTcDHk+N8mnbi4970RZ6OdqX2n7aBy8D7hBW31kHhP27zofzYX51CFD8Ln
 5NQDU3wX/hUw9tNwoghCNe12//gfDBi6rO/GoEztNFfVVef4QuzsfwKKdxzUMc9a
 jZvyDyGh87rGLauDWylq4s23vgxgIG/p114uv+eFp3AMkOfT/Dk=
 =knWl
 -----END PGP SIGNATURE-----

Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf

Pull bpf fixes from Alexei Starovoitov:

 - Fix u32/s32 bounds when ranges cross min/max boundary (Eduard
   Zingerman)

 - Fix precision backtracking with linked registers (Eduard Zingerman)

 - Fix linker flags detection for resolve_btfids (Ihor Solodrai)

 - Fix race in update_ftrace_direct_add/del (Jiri Olsa)

 - Fix UAF in bpf_trampoline_link_cgroup_shim (Lang Xu)

* tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  resolve_btfids: Fix linker flags detection
  selftests/bpf: add reproducer for spurious precision propagation through calls
  bpf: collect only live registers in linked regs
  Revert "selftests/bpf: Update reg_bound range refinement logic"
  selftests/bpf: test refining u32/s32 bounds when ranges cross min/max boundary
  bpf: Fix u32/s32 bounds when ranges cross min/max boundary
  bpf: Fix a UAF issue in bpf_trampoline_link_cgroup_shim
  ftrace: Add missing ftrace_lock to update_ftrace_direct_add/del
2026-03-07 12:20:37 -08:00
Linus Torvalds aed0af05a8 tracing fixes for 7.0:
- Fix possible NULL pointer dereference in trace_data_alloc()
 
   On the error path in trace_data_alloc(), it can call trigger_data_free()
   with a NULL pointer. This use to be a kfree() but was changed to
   trigger_data_free() to clean up any partial initialization. The issue is
   that trigger_data_free() does not expect a NULL pointer. Have
   trigger_data_free() return safely on NULL pointer.
 
 - Fix multiple events on the command line and bootconfig
 
   If multiple events are enabled on the command line separately and not
   grouped, only the last event gets enabled. That is:
 
     trace_event=sched_switch trace_event=sched_waking
 
   Will only enable sched_waking where as:
 
     trace_event=sched_switch,sched_waking
 
   Will enable both.
 
   The bootconfig makes it even worse as the second way is the more common
   method.
 
   The issue is that a temporary buffer is used to store the events to enable
   later in boot. Each time the cmdline callback is called, it overwrites
   what was previously there.
 
   Have the callback append the next value (delimited by a comma) if the
   temporary buffer already has content.
 
 - Fix command line trace_buffer_size if >= 2G
 
   The logic to allocate the trace buffer uses "int" for the size parameter
   in the command line code causing overflow issues if more that 2G is
   specified.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYKADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCaaxEIRQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qn+QAQCM6aJm0ZqDD2dM262M1mQpkU7sW3Dz
 hZfBpo3YlH55fQEAklsaD96+yKN7PLl1Vh4c0zCelMHZA7kgck/3GqaFAgA=
 =rn/Z
 -----END PGP SIGNATURE-----

Merge tag 'trace-v7.0-rc2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracing fixes from Steven Rostedt:

 - Fix possible NULL pointer dereference in trace_data_alloc()

   On the trace_data_alloc() error path, it can call trigger_data_free()
   with a NULL pointer. This used to be a kfree() but was changed to
   trigger_data_free() to clean up any partial initialization. The issue
   is that trigger_data_free() does not expect a NULL pointer. Have
   trigger_data_free() return safely on NULL pointer.

 - Fix multiple events on the command line and bootconfig

   If multiple events are enabled on the command line separately and not
   grouped, only the last event gets enabled. That is:

      trace_event=sched_switch trace_event=sched_waking

   will only enable sched_waking whereas:

      trace_event=sched_switch,sched_waking

   will enable both.

   The bootconfig makes it even worse as the second way is the more
   common method.

   The issue is that a temporary buffer is used to store the events to
   enable later in boot. Each time the cmdline callback is called, it
   overwrites what was previously there.

   Have the callback append the next value (delimited by a comma) if the
   temporary buffer already has content.

 - Fix command line trace_buffer_size if >= 2G

   The logic to allocate the trace buffer uses "int" for the size
   parameter in the command line code causing overflow issues if more
   that 2G is specified.

* tag 'trace-v7.0-rc2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing: Fix trace_buf_size= cmdline parameter with sizes >= 2G
  tracing: Fix enabling multiple events on the kernel command line and bootconfig
  tracing: Add NULL pointer check to trigger_data_free()
2026-03-07 09:50:54 -08:00
Tejun Heo 57ccf5ccdc sched_ext: Fix enqueue_task_scx() truncation of upper enqueue flags
enqueue_task_scx() takes int enq_flags from the sched_class interface.
SCX enqueue flags starting at bit 32 (SCX_ENQ_PREEMPT and above) are
silently truncated when passed through activate_task(). extra_enq_flags
was added as a workaround - storing high bits in rq->scx.extra_enq_flags
and OR-ing them back in enqueue_task_scx(). However, the OR target is
still the int parameter, so the high bits are lost anyway.

The current impact is limited as the only affected flag is SCX_ENQ_PREEMPT
which is informational to the BPF scheduler - its loss means the scheduler
doesn't know about preemption but doesn't cause incorrect behavior.

Fix by renaming the int parameter to core_enq_flags and introducing a
u64 enq_flags local that merges both sources. All downstream functions
already take u64 enq_flags.

Fixes: f0e1a0643a ("sched_ext: Implement BPF extensible scheduler class")
Cc: stable@vger.kernel.org # v6.12+
Acked-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2026-03-07 04:53:32 -10:00
Eduard Zingerman 2658a1720a bpf: collect only live registers in linked regs
Fix an inconsistency between func_states_equal() and
collect_linked_regs():
- regsafe() uses check_ids() to verify that cached and current states
  have identical register id mapping.
- func_states_equal() calls regsafe() only for registers computed as
  live by compute_live_registers().
- clean_live_states() is supposed to remove dead registers from cached
  states, but it can skip states belonging to an iterator-based loop.
- collect_linked_regs() collects all registers sharing the same id,
  ignoring the marks computed by compute_live_registers().
  Linked registers are stored in the state's jump history.
- backtrack_insn() marks all linked registers for an instruction
  as precise whenever one of the linked registers is precise.

The above might lead to a scenario:
- There is an instruction I with register rY known to be dead at I.
- Instruction I is reached via two paths: first A, then B.
- On path A:
  - There is an id link between registers rX and rY.
  - Checkpoint C is created at I.
  - Linked register set {rX, rY} is saved to the jump history.
  - rX is marked as precise at I, causing both rX and rY
    to be marked precise at C.
- On path B:
  - There is no id link between registers rX and rY,
    otherwise register states are sub-states of those in C.
  - Because rY is dead at I, check_ids() returns true.
  - Current state is considered equal to checkpoint C,
    propagate_precision() propagates spurious precision
    mark for register rY along the path B.
  - Depending on a program, this might hit verifier_bug()
    in the backtrack_insn(), e.g. if rY ∈  [r1..r5]
    and backtrack_insn() spots a function call.

The reproducer program is in the next patch.
This was hit by sched_ext scx_lavd scheduler code.

Changes in tests:
- verifier_scalar_ids.c selftests need modification to preserve
  some registers as live for __msg() checks.
- exceptions_assert.c adjusted to match changes in the verifier log,
  R0 is dead after conditional instruction and thus does not get
  range.
- precise.c adjusted to match changes in the verifier log, register r9
  is dead after comparison and it's range is not important for test.

Reported-by: Emil Tsalapatis <emil@etsalapatis.com>
Fixes: 0fb3cf6110 ("bpf: use register liveness information for func_states_equal")
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20260306-linked-regs-and-propagate-precision-v1-1-18e859be570d@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-03-06 21:49:40 -08:00
Calvin Owens d008ba8be8 tracing: Fix trace_buf_size= cmdline parameter with sizes >= 2G
Some of the sizing logic through tracer_alloc_buffers() uses int
internally, causing unexpected behavior if the user passes a value that
does not fit in an int (on my x86 machine, the result is uselessly tiny
buffers).

Fix by plumbing the parameter's real type (unsigned long) through to the
ring buffer allocation functions, which already use unsigned long.

It has always been possible to create larger ring buffers via the sysfs
interface: this only affects the cmdline parameter.

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/bff42a4288aada08bdf74da3f5b67a2c28b761f8.1772852067.git.calvin@wbinvd.org
Fixes: 73c5162aa3 ("tracing: keep ring buffer to minimum size till used")
Signed-off-by: Calvin Owens <calvin@wbinvd.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-03-06 22:25:53 -05:00
Eduard Zingerman fbc7aef517 bpf: Fix u32/s32 bounds when ranges cross min/max boundary
Same as in __reg64_deduce_bounds(), refine s32/u32 ranges
in __reg32_deduce_bounds() in the following situations:

- s32 range crosses U32_MAX/0 boundary, positive part of the s32 range
  overlaps with u32 range:

  0                                                   U32_MAX
  |  [xxxxxxxxxxxxxx u32 range xxxxxxxxxxxxxx]              |
  |----------------------------|----------------------------|
  |xxxxx s32 range xxxxxxxxx]                       [xxxxxxx|
  0                     S32_MAX S32_MIN                    -1

- s32 range crosses U32_MAX/0 boundary, negative part of the s32 range
  overlaps with u32 range:

  0                                                   U32_MAX
  |              [xxxxxxxxxxxxxx u32 range xxxxxxxxxxxxxx]  |
  |----------------------------|----------------------------|
  |xxxxxxxxx]                       [xxxxxxxxxxxx s32 range |
  0                     S32_MAX S32_MIN                    -1

- No refinement if ranges overlap in two intervals.

This helps for e.g. consider the following program:

   call %[bpf_get_prandom_u32];
   w0 &= 0xffffffff;
   if w0 < 0x3 goto 1f;    // on fall-through u32 range [3..U32_MAX]
   if w0 s> 0x1 goto 1f;   // on fall-through s32 range [S32_MIN..1]
   if w0 s< 0x0 goto 1f;   // range can be narrowed to  [S32_MIN..-1]
   r10 = 0;
1: ...;

The reg_bounds.c selftest is updated to incorporate identical logic,
refinement based on non-overflowing range halves:

  ((x ∩ [0, smax]) ∩ (y ∩ [0, smax])) ∪
  ((x ∩ [smin,-1]) ∩ (y ∩ [smin,-1]))

Reported-by: Andrea Righi <arighi@nvidia.com>
Reported-by: Emil Tsalapatis <emil@etsalapatis.com>
Closes: https://lore.kernel.org/bpf/aakqucg4vcujVwif@gpd4/T/
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20260306-bpf-32-bit-range-overflow-v3-1-f7f67e060a6b@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-03-06 18:16:06 -08:00
Sebastian Andrzej Siewior a72f73c4dd cgroup: Don't expose dead tasks in cgroup
Once a task exits it has its state set to TASK_DEAD and then it is
removed from the cgroup it belonged to. The last step happens on the task
gets out of its last schedule() invocation and is delayed on PREEMPT_RT
due to locking constraints.

As a result it is possible to receive a pid via waitpid() of a task
which is still listed in cgroup.procs for the cgroup it belonged
to. This is something that systemd does not expect and as a result it
waits for its exit until a time out occurs.
This can also be reproduced on !PREEMPT_RT kernel with a significant
delay in do_exit() after exit_notify().

Hide the task from the output which have PF_EXITING set which is done
before the parent is notified. Keeping zombies with live threads
shouldn't break anything (suggested by Tejun).

Reported-by: Bert Karwatzki <spasswolf@web.de>
Closes: https://lore.kernel.org/all/20260219164648.3014-1-spasswolf@web.de/
Tested-by: Bert Karwatzki <spasswolf@web.de>
Fixes: 9311e6c29b ("cgroup: Fix sleeping from invalid context warning on PREEMPT_RT")
Cc: stable@vger.kernel.org # v6.19+
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
2026-03-06 12:43:25 -10:00
Andrei-Alexandru Tachici 3b1679e086 tracing: Fix enabling multiple events on the kernel command line and bootconfig
Multiple events can be enabled on the kernel command line via a comma
separator. But if the are specified one at a time, then only the last
event is enabled. This is because the event names are saved in a temporary
buffer, and each call by the init cmdline code will reset that buffer.

This also affects names in the boot config file, as it may call the
callback multiple times with an example of:

  kernel.trace_event = ":mod:rproc_qcom_common", ":mod:qrtr", ":mod:qcom_aoss"

Change the cmdline callback function to append a comma and the next value
if the temporary buffer already has content.

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20260302-trace-events-allow-multiple-modules-v1-1-ce4436e37fb8@oss.qualcomm.com
Signed-off-by: Andrei-Alexandru Tachici <andrei-alexandru.tachici@oss.qualcomm.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-03-06 16:54:34 -05:00
Guenter Roeck 457965c13f tracing: Add NULL pointer check to trigger_data_free()
If trigger_data_alloc() fails and returns NULL, event_hist_trigger_parse()
jumps to the out_free error path. While kfree() safely handles a NULL
pointer, trigger_data_free() does not. This causes a NULL pointer
dereference in trigger_data_free() when evaluating
data->cmd_ops->set_filter.

Fix the problem by adding a NULL pointer check to trigger_data_free().

The problem was found by an experimental code review agent based on
gemini-3.1-pro while reviewing backports into v6.18.y.

Cc: Miaoqian Lin <linmq006@gmail.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
Link: https://patch.msgid.link/20260305193339.2810953-1-linux@roeck-us.net
Fixes: 0550069cc2 ("tracing: Properly process error handling in event_hist_trigger_parse()")
Assisted-by: Gemini:gemini-3.1-pro
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-03-06 13:04:30 -05:00
Waiman Long ca174c705d cgroup/cpuset: Call rebuild_sched_domains() directly in hotplug
Besides deferring the call to housekeeping_update(), commit 6df415aa46
("cgroup/cpuset: Defer housekeeping_update() calls from CPU hotplug
to workqueue") also defers the rebuild_sched_domains() call to
the workqueue. So a new offline CPU may still be in a sched domain
or new online CPU not showing up in the sched domains for a short
transition period. That could be a problem in some corner cases and
can be the cause of a reported test failure[1]. Fix it by calling
rebuild_sched_domains_cpuslocked() directly in hotplug as before. If
isolated partition invalidation or recreation is being done, the
housekeeping_update() call to update the housekeeping cpumasks will
still be deferred to a workqueue.

In commit 3bfe479671 ("cgroup/cpuset: Move
housekeeping_update()/rebuild_sched_domains() together"),
housekeeping_update() is called before rebuild_sched_domains() because
it needs to access the HK_TYPE_DOMAIN housekeeping cpumask. That is now
changed to use the static HK_TYPE_DOMAIN_BOOT cpumask as HK_TYPE_DOMAIN
cpumask is now changeable at run time.  As a result, we can move the
rebuild_sched_domains() call before housekeeping_update() with
the slight advantage that it will be done in the same cpus_read_lock
critical section without the possibility of interference by a concurrent
cpu hot add/remove operation.

As it doesn't make sense to acquire cpuset_mutex/cpuset_top_mutex after
calling housekeeping_update() and immediately release them again, move
the cpuset_full_unlock() operation inside update_hk_sched_domains()
and rename it to cpuset_update_sd_hk_unlock() to signify that it will
release the full set of locks.

[1] https://lore.kernel.org/lkml/1a89aceb-48db-4edd-a730-b445e41221fe@nvidia.com

Fixes: 6df415aa46 ("cgroup/cpuset: Defer housekeeping_update() calls from CPU hotplug to workqueue")
Tested-by: Jon Hunter <jonathanh@nvidia.com>
Reviewed-by: Chen Ridong <chenridong@huaweicloud.com>
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2026-03-06 06:58:25 -10:00
David Carlier 1dde502587 sched_ext: Use READ_ONCE() for scx_slice_bypass_us in scx_bypass()
Commit 0927780c90 ("sched_ext: Use READ_ONCE() for lock-free reads
of module param variables") annotated the plain reads of
scx_slice_bypass_us and scx_bypass_lb_intv_us in bypass_lb_cpu(), but
missed a third site in scx_bypass():

  WRITE_ONCE(scx_slice_dfl, scx_slice_bypass_us * NSEC_PER_USEC);

scx_slice_bypass_us is a module parameter writable via sysfs in
process context through set_slice_us() -> param_set_uint_minmax(),
which performs a plain store without holding bypass_lock. scx_bypass()
reads the variable under bypass_lock, but since the writer does not
take that lock, the two accesses are concurrent.

WRITE_ONCE() only applies volatile semantics to the store of
scx_slice_dfl -- the val expression containing scx_slice_bypass_us is
evaluated as a plain read, providing no protection against concurrent
writes.

Wrap the read with READ_ONCE() to complete the annotation started by
commit 0927780c90 and make the access KCSAN-clean, consistent with
the existing READ_ONCE(scx_slice_bypass_us) in bypass_lb_cpu().

Signed-off-by: David Carlier <devnexen@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2026-03-06 06:57:23 -10:00
Breno Leitao 98c790b100 workqueue: Rename show_cpu_pool{s,}_hog{s,}() to reflect broadened scope
show_cpu_pool_hog() and show_cpu_pools_hogs() no longer only dump CPU
hogs — since commit 8823eaef45 ("workqueue: Show all busy workers in
stall diagnostics"), they dump every in-flight worker in the pool's
busy_hash.

Rename them to show_cpu_pool_busy_workers() and
show_cpu_pools_busy_workers() to accurately describe what they do.

Also fix the pr_info() message to say "stalled worker pools" instead of
"stalled CPU-bound worker pools", since sleeping/blocked workers are now
included.

No functional change.

Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2026-03-06 06:38:16 -10:00
Linus Torvalds a028739a43 block-7.0-20260305
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmmqPRMQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgplf5D/9uOsBr+OGXtkLUJtD6MiwoJUsYgYF2dMIx
 epcp+8RdMaOGtigtx69QXzTP5aPjA+AvBLAMYM+QDQDAPMWbRPsD7LaCYHy7ekwA
 OL68R3QRTMYPPgpuf7pKyhif7olozAvoWAnRaoWlo67rbK+mTzZsTIsgTwF4zUu6
 T0dL9thbWqtJMxKSuUk+DywggvGyNZWICJ3rAZ6os2htruH0fPhsJNGVFgNXMnpe
 Cy2OvWxBWRQkZnpDEocZUdYyCRVhHr7hu311j6nSLNXufqpgFmWLGO4C3vetOlgx
 ulEHfGNINcSLcw9R8pNWRxU14V6iw8Oy4nU9RtZhUpF32Iasvxb4H0w76Dp9Ukq1
 /DuoSkWg/Ahn24xSYxJwwZpOEE8L92pn0M2ukCfC6h7ytmDjjEL1AQ2kyFHV4mR3
 nc/3FkQ0abe3HHk8Rit6+txe3sSQo5no1z8kFlb9yp2MwAmonxCCQ9N1s7pxeeP+
 iLaPbGMaZ7Ra1GswD/vzxFQtkglsxLuM5D0JkjHe99a54ZnF0vF3y9jeDVOQbV1C
 H6/bU/2DI3SQ8xqv6tIXQ22reyRen3ao5VKLSrmrT/tDQVoEBV5SMnJFO1J8jBP4
 QST03wiu8ShHSyZ98KefwlsndrTX02V9UVD4FVj+TZXwCWltulnIR4dVYFdySWwW
 d613iUsWJw==
 =NNcQ
 -----END PGP SIGNATURE-----

Merge tag 'block-7.0-20260305' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux

Pull block fixes from Jens Axboe:

 - NVMe pull request via Keith:
      - Improve quirk visibility and configurability (Maurizio)
      - Fix runtime user modification to queue setup (Keith)
      - Fix multipath leak on try_module_get failure (Keith)
      - Ignore ambiguous spec definitions for better atomics support
        (John)
      - Fix admin queue leak on controller reset (Ming)
      - Fix large allocation in persistent reservation read keys
        (Sungwoo Kim)
      - Fix fcloop callback handling (Justin)
      - Securely free DHCHAP secrets (Daniel)
      - Various cleanups and typo fixes (John, Wilfred)

 - Avoid a circular lock dependency issue in the sysfs nr_requests or
   scheduler store handling

 - Fix a circular lock dependency with the pcpu mutex and the queue
   freeze lock

 - Cleanup for bio_copy_kern(), using __bio_add_page() rather than the
   bio_add_page(), as adding a page here cannot fail. The exiting code
   had broken cleanup for the error condition, so make it clear that the
   error condition cannot happen

 - Fix for a __this_cpu_read() in preemptible context splat

* tag 'block-7.0-20260305' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
  block: use trylock to avoid lockdep circular dependency in sysfs
  nvme: fix memory allocation in nvme_pr_read_keys()
  block: use __bio_add_page in bio_copy_kern
  block: break pcpu_alloc_mutex dependency on freeze_lock
  blktrace: fix __this_cpu_read/write in preemptible context
  nvme-multipath: fix leak on try_module_get failure
  nvmet-fcloop: Check remoteport port_state before calling done callback
  nvme-pci: do not try to add queue maps at runtime
  nvme-pci: cap queue creation to used queues
  nvme-pci: ensure we're polling a polled queue
  nvme: fix memory leak in quirks_param_set()
  nvme: correct comment about nvme_ns_remove()
  nvme: stop setting namespace gendisk device driver data
  nvme: add support for dynamic quirk configuration via module parameter
  nvme: fix admin queue leak on controller reset
  nvme-fabrics: use kfree_sensitive() for DHCHAP secrets
  nvme: stop using AWUPF
  nvme: expose active quirks in sysfs
  nvme/host: fixup some typos
2026-03-06 08:36:18 -08:00
Christian Loehle 7fe44c4388 bpf: drop kthread_exit from noreturn_deny
kthread_exit became a macro to do_exit in commit 28aaa9c399
("kthread: consolidate kthread exit paths to prevent use-after-free"),
so there is no kthread_exit function BTF ID to resolve. Remove it from
noreturn_deny to avoid resolve_btfids unresolved symbol warnings.

Signed-off-by: Christian Loehle <christian.loehle@arm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-03-06 08:25:54 -08:00