linux

Commit Graph

Author	SHA1	Message	Date
Fernando Fernandez Mancera	e5e8906305	net: bridge: fix nd_tbl NULL dereference when IPv6 is disabled When booting with the 'ipv6.disable=1' parameter, the nd_tbl is never initialized because inet6_init() exits before ndisc_init() is called which initializes it. Then, if neigh_suppress is enabled and an ICMPv6 Neighbor Discovery packet reaches the bridge, br_do_suppress_nd() will dereference ipv6_stub->nd_tbl which is NULL, passing it to neigh_lookup(). This causes a kernel NULL pointer dereference. BUG: kernel NULL pointer dereference, address: 0000000000000268 Oops: 0000 [#1] PREEMPT SMP NOPTI [...] RIP: 0010:neigh_lookup+0x16/0xe0 [...] Call Trace: <IRQ> ? neigh_lookup+0x16/0xe0 br_do_suppress_nd+0x160/0x290 [bridge] br_handle_frame_finish+0x500/0x620 [bridge] br_handle_frame+0x353/0x440 [bridge] __netif_receive_skb_core.constprop.0+0x298/0x1110 __netif_receive_skb_one_core+0x3d/0xa0 process_backlog+0xa0/0x140 __napi_poll+0x2c/0x170 net_rx_action+0x2c4/0x3a0 handle_softirqs+0xd0/0x270 do_softirq+0x3f/0x60 Fix this by replacing IS_ENABLED(IPV6) call with ipv6_mod_enabled() in the callers. This is in essence disabling NS/NA suppression when IPv6 is disabled. Fixes: `ed842faeb2` ("bridge: suppress nd pkts on BR_NEIGH_SUPPRESS ports") Reported-by: Guruprasad C P <gurucp2005@gmail.com> Closes: https://lore.kernel.org/netdev/CAHXs0ORzd62QOG-Fttqa2Cx_A_VFp=utE2H2VTX5nqfgs7LDxQ@mail.gmail.com/ Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20260304120357.9778-1-fmancera@suse.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-03-05 07:52:56 -08:00
Florian Westphal	9df95785d3	netfilter: nft_set_pipapo: split gc into unlink and reclaim phase Yiming Qian reports Use-after-free in the pipapo set type: Under a large number of expired elements, commit-time GC can run for a very long time in a non-preemptible context, triggering soft lockup warnings and RCU stall reports (local denial of service). We must split GC in an unlink and a reclaim phase. We cannot queue elements for freeing until pointers have been swapped. Expired elements are still exposed to both the packet path and userspace dumpers via the live copy of the data structure. call_rcu() does not protect us: dump operations or element lookups starting after call_rcu has fired can still observe the free'd element, unless the commit phase has made enough progress to swap the clone and live pointers before any new reader has picked up the old version. This a similar approach as done recently for the rbtree backend in commit `35f83a7552` ("netfilter: nft_set_rbtree: don't gc elements on insert"). Fixes: `3c4287f620` ("nf_tables: Add set type for arbitrary concatenation of ranges") Reported-by: Yiming Qian <yimingqian591@gmail.com> Signed-off-by: Florian Westphal <fw@strlen.de>	2026-03-05 13:22:37 +01:00
Pablo Neira Ayuso	fb7fb40163	netfilter: nf_tables: clone set on flush only Syzbot with fault injection triggered a failing memory allocation with GFP_KERNEL which results in a WARN splat: iter.err WARNING: net/netfilter/nf_tables_api.c:845 at nft_map_deactivate+0x34e/0x3c0 net/netfilter/nf_tables_api.c:845, CPU#0: syz.0.17/5992 Modules linked in: CPU: 0 UID: 0 PID: 5992 Comm: syz.0.17 Not tainted syzkaller #0 PREEMPT(full) Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 02/12/2026 RIP: 0010:nft_map_deactivate+0x34e/0x3c0 net/netfilter/nf_tables_api.c:845 Code: 8b 05 86 5a 4e 09 48 3b 84 24 a0 00 00 00 75 62 48 8d 65 d8 5b 41 5c 41 5d 41 5e 41 5f 5d c3 cc cc cc cc cc e8 63 6d fa f7 90 <0f> 0b 90 43 +80 7c 35 00 00 0f 85 23 fe ff ff e9 26 fe ff ff 89 d9 RSP: 0018:ffffc900045af780 EFLAGS: 00010293 RAX: ffffffff89ca45bd RBX: 00000000fffffff4 RCX: ffff888028111e40 RDX: 0000000000000000 RSI: 00000000fffffff4 RDI: 0000000000000000 RBP: ffffc900045af870 R08: 0000000000400dc0 R09: 00000000ffffffff R10: dffffc0000000000 R11: fffffbfff1d141db R12: ffffc900045af7e0 R13: 1ffff920008b5f24 R14: dffffc0000000000 R15: ffffc900045af920 FS: 000055557a6a5500(0000) GS:ffff888125496000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fb5ea271fc0 CR3: 000000003269e000 CR4: 00000000003526f0 Call Trace: <TASK> __nft_release_table+0xceb/0x11f0 net/netfilter/nf_tables_api.c:12115 nft_rcv_nl_event+0xc25/0xdb0 net/netfilter/nf_tables_api.c:12187 notifier_call_chain+0x19d/0x3a0 kernel/notifier.c:85 blocking_notifier_call_chain+0x6a/0x90 kernel/notifier.c:380 netlink_release+0x123b/0x1ad0 net/netlink/af_netlink.c:761 __sock_release net/socket.c:662 [inline] sock_close+0xc3/0x240 net/socket.c:1455 Restrict set clone to the flush set command in the preparation phase. Add NFT_ITER_UPDATE_CLONE and use it for this purpose, update the rbtree and pipapo backends to only clone the set when this iteration type is used. As for the existing NFT_ITER_UPDATE type, update the pipapo backend to use the existing set clone if available, otherwise use the existing set representation. After this update, there is no need to clone a set that is being deleted, this includes bound anonymous set. An alternative approach to NFT_ITER_UPDATE_CLONE is to add a .clone interface and call it from the flush set path. Reported-by: syzbot+4924a0edc148e8b4b342@syzkaller.appspotmail.com Fixes: `3f1d886cc7` ("netfilter: nft_set_pipapo: move cloning of match info to insert/removal path") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Florian Westphal <fw@strlen.de>	2026-03-05 13:22:37 +01:00
Pablo Neira Ayuso	def602e498	netfilter: nf_tables: unconditionally bump set->nelems before insertion In case that the set is full, a new element gets published then removed without waiting for the RCU grace period, while RCU reader can be walking over it already. To address this issue, add the element transaction even if set is full, but toggle the set_full flag to report -ENFILE so the abort path safely unwinds the set to its previous state. As for element updates, decrement set->nelems to restore it. A simpler fix is to call synchronize_rcu() in the error path. However, with a large batch adding elements to already maxed-out set, this could cause noticeable slowdown of such batches. Fixes: `35d0ac9070` ("netfilter: nf_tables: fix set->nelems counting with no NLM_F_EXCL") Reported-by: Inseo An <y0un9sa@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Florian Westphal <fw@strlen.de>	2026-03-05 13:22:37 +01:00
Sebastian Andrzej Siewior	b824c3e16c	net: Provide a PREEMPT_RT specific check for netdev_queue::_xmit_lock After acquiring netdev_queue::_xmit_lock the number of the CPU owning the lock is recorded in netdev_queue::xmit_lock_owner. This works as long as the BH context is not preemptible. On PREEMPT_RT the softirq context is preemptible and without the softirq-lock it is possible to have multiple user in __dev_queue_xmit() submitting a skb on the same CPU. This is fine in general but this means also that the current CPU is recorded as netdev_queue::xmit_lock_owner. This in turn leads to the recursion alert and the skb is dropped. Instead checking the for CPU number, that owns the lock, PREEMPT_RT can check if the lockowner matches the current task. Add netif_tx_owned() which returns true if the current context owns the lock by comparing the provided CPU number with the recorded number. This resembles the current check by negating the condition (the current check returns true if the lock is not owned). On PREEMPT_RT use rt_mutex_owner() to return the lock owner and compare the current task against it. Use the new helper in __dev_queue_xmit() and netif_local_xmit_active() which provides a similar check. Update comments regarding pairing READ_ONCE(). Reported-by: Bert Karwatzki <spasswolf@web.de> Closes: https://lore.kernel.org/all/20260216134333.412332-1-spasswolf@web.de Fixes: `3253cb49cb` ("softirq: Allow to drop the softirq-BKL lock on PREEMPT_RT") Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reported-by: Bert Karwatzki <spasswolf@web.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://patch.msgid.link/20260302162631.uGUyIqDT@linutronix.de Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-03-05 12:14:21 +01:00
Matthieu Baerts (NGI0)	579a752464	mptcp: pm: in-kernel: always mark signal+subflow endp as used Syzkaller managed to find a combination of actions that was generating this warning: msk->pm.local_addr_used == 0 WARNING: net/mptcp/pm_kernel.c:1071 at __mark_subflow_endp_available net/mptcp/pm_kernel.c:1071 [inline], CPU#1: syz.2.17/961 WARNING: net/mptcp/pm_kernel.c:1071 at mptcp_nl_remove_subflow_and_signal_addr net/mptcp/pm_kernel.c:1103 [inline], CPU#1: syz.2.17/961 WARNING: net/mptcp/pm_kernel.c:1071 at mptcp_pm_nl_del_addr_doit+0x81d/0x8f0 net/mptcp/pm_kernel.c:1210, CPU#1: syz.2.17/961 Modules linked in: CPU: 1 UID: 0 PID: 961 Comm: syz.2.17 Not tainted 6.19.0-08368-gfafda3b4b06b #22 PREEMPT(full) Hardware name: QEMU Ubuntu 25.10 PC v2 (i440FX + PIIX, + 10.1 machine, 1996), BIOS 1.17.0-debian-1.17.0-1build1 04/01/2014 RIP: 0010:__mark_subflow_endp_available net/mptcp/pm_kernel.c:1071 [inline] RIP: 0010:mptcp_nl_remove_subflow_and_signal_addr net/mptcp/pm_kernel.c:1103 [inline] RIP: 0010:mptcp_pm_nl_del_addr_doit+0x81d/0x8f0 net/mptcp/pm_kernel.c:1210 Code: 89 c5 e8 46 30 6f fe e9 21 fd ff ff 49 83 ed 80 e8 38 30 6f fe 4c 89 ef be 03 00 00 00 e8 db 49 df fe eb ac e8 24 30 6f fe 90 <0f> 0b 90 e9 1d ff ff ff e8 16 30 6f fe eb 05 e8 0f 30 6f fe e8 9a RSP: 0018:ffffc90001663880 EFLAGS: 00010293 RAX: ffffffff82de1a6c RBX: 0000000000000000 RCX: ffff88800722b500 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 RBP: ffff8880158b22d0 R08: 0000000000010425 R09: ffffffffffffffff R10: ffffffff82de18ba R11: 0000000000000000 R12: ffff88800641a640 R13: ffff8880158b1880 R14: ffff88801ec3c900 R15: ffff88800641a650 FS: 00005555722c3500(0000) GS:ffff8880f909d000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f66346e0f60 CR3: 000000001607c000 CR4: 0000000000350ef0 Call Trace: <TASK> genl_family_rcv_msg_doit+0x117/0x180 net/netlink/genetlink.c:1115 genl_family_rcv_msg net/netlink/genetlink.c:1195 [inline] genl_rcv_msg+0x3a8/0x3f0 net/netlink/genetlink.c:1210 netlink_rcv_skb+0x16d/0x240 net/netlink/af_netlink.c:2550 genl_rcv+0x28/0x40 net/netlink/genetlink.c:1219 netlink_unicast_kernel net/netlink/af_netlink.c:1318 [inline] netlink_unicast+0x3e9/0x4c0 net/netlink/af_netlink.c:1344 netlink_sendmsg+0x4aa/0x5b0 net/netlink/af_netlink.c:1894 sock_sendmsg_nosec net/socket.c:727 [inline] __sock_sendmsg+0xc9/0xf0 net/socket.c:742 ____sys_sendmsg+0x272/0x3b0 net/socket.c:2592 ___sys_sendmsg+0x2de/0x320 net/socket.c:2646 __sys_sendmsg net/socket.c:2678 [inline] __do_sys_sendmsg net/socket.c:2683 [inline] __se_sys_sendmsg net/socket.c:2681 [inline] __x64_sys_sendmsg+0x110/0x1a0 net/socket.c:2681 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0x143/0x440 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7f66346f826d Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 e8 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007ffc83d8bdc8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e RAX: ffffffffffffffda RBX: 00007f6634985fa0 RCX: 00007f66346f826d RDX: 00000000040000b0 RSI: 0000200000000740 RDI: 0000000000000007 RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 00007f6634985fa8 R13: 00007f6634985fac R14: 0000000000000000 R15: 0000000000001770 </TASK> The actions that caused that seem to be: - Set the MPTCP subflows limit to 0 - Create an MPTCP endpoint with both the 'signal' and 'subflow' flags - Create a new MPTCP connection from a different address: an ADD_ADDR linked to the MPTCP endpoint will be sent ('signal' flag), but no subflows is initiated ('subflow' flag) - Remove the MPTCP endpoint In this case, msk->pm.local_addr_used has been kept to 0 -- because no subflows have been created -- but the corresponding bit in msk->pm.id_avail_bitmap has been cleared when the ADD_ADDR has been sent. This later causes a splat when removing the MPTCP endpoint because msk->pm.local_addr_used has been kept to 0. Now, if an endpoint has both the signal and subflow flags, but it is not possible to create subflows because of the limits or the c-flag case, then the local endpoint counter is still incremented: the endpoint is used at the end. This avoids issues later when removing the endpoint and calling __mark_subflow_endp_available(), which expects msk->pm.local_addr_used to have been previously incremented if the endpoint was marked as used according to msk->pm.id_avail_bitmap. Note that signal_and_subflow variable is reset to false when the limits and the c-flag case allows subflows creation. Also, local_addr_used is only incremented for non ID0 subflows. Fixes: `85df533a78` ("mptcp: pm: do not ignore 'subflow' if 'signal' flag is also set") Cc: stable@vger.kernel.org Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/613 Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20260303-net-mptcp-misc-fixes-7-0-rc2-v1-4-4b5462b6f016@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-03-04 18:21:13 -08:00
Matthieu Baerts (NGI0)	fb8d0bccb2	mptcp: pm: avoid sending RM_ADDR over same subflow RM_ADDR are sent over an active subflow, the first one in the subflows list. There is then a high chance the initial subflow is picked. With the in-kernel PM, when an endpoint is removed, a RM_ADDR is sent, then linked subflows are closed. This is done for each active MPTCP connection. MPTCP endpoints are likely removed because the attached network is no longer available or usable. In this case, it is better to avoid sending this RM_ADDR over the subflow that is going to be removed, but prefer sending it over another active and non stale subflow, if any. This modification avoids situations where the other end is not notified when a subflow is no longer usable: typically when the endpoint linked to the initial subflow is removed, especially on the server side. Fixes: `8dd5efb1f9` ("mptcp: send ack for rm_addr") Cc: stable@vger.kernel.org Reported-by: Frank Lorenz <lorenz-frank@web.de> Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/612 Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20260303-net-mptcp-misc-fixes-7-0-rc2-v1-2-4b5462b6f016@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-03-04 18:21:12 -08:00
Jakub Kicinski	d793458c45	nfc: rawsock: cancel tx_work before socket teardown In rawsock_release(), cancel any pending tx_work and purge the write queue before orphaning the socket. rawsock_tx_work runs on the system workqueue and calls nfc_data_exchange which dereferences the NCI device. Without synchronization, tx_work can race with socket and device teardown when a process is killed (e.g. by SIGKILL), leading to use-after-free or leaked references. Set SEND_SHUTDOWN first so that if tx_work is already running it will see the flag and skip transmitting, then use cancel_work_sync to wait for any in-progress execution to finish, and finally purge any remaining queued skbs. Fixes: `23b7869c0f` ("NFC: add the NFC socket raw protocol") Reviewed-by: Joe Damato <joe@dama.to> Link: https://patch.msgid.link/20260303162346.2071888-6-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-03-04 18:18:57 -08:00
Jakub Kicinski	0efdc02f4f	nfc: nci: clear NCI_DATA_EXCHANGE before calling completion callback Move clear_bit(NCI_DATA_EXCHANGE) before invoking the data exchange callback in nci_data_exchange_complete(). The callback (e.g. rawsock_data_exchange_complete) may immediately schedule another data exchange via schedule_work(tx_work). On a multi-CPU system, tx_work can run and reach nci_transceive() before the current nci_data_exchange_complete() clears the flag, causing test_and_set_bit(NCI_DATA_EXCHANGE) to return -EBUSY and the new transfer to fail. This causes intermittent flakes in nci/nci_dev in NIPA: # # RUN NCI.NCI1_0.t4t_tag_read ... # # t4t_tag_read: Test terminated by timeout # # FAIL NCI.NCI1_0.t4t_tag_read # not ok 3 NCI.NCI1_0.t4t_tag_read Fixes: `38f04c6b1b` ("NFC: protect nci_data_exchange transactions") Reviewed-by: Joe Damato <joe@dama.to> Link: https://patch.msgid.link/20260303162346.2071888-5-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-03-04 18:18:57 -08:00
Jakub Kicinski	6608358194	nfc: nci: complete pending data exchange on device close In nci_close_device(), complete any pending data exchange before closing. The data exchange callback (e.g. rawsock_data_exchange_complete) holds a socket reference. NIPA occasionally hits this leak: unreferenced object 0xff1100000f435000 (size 2048): comm "nci_dev", pid 3954, jiffies 4295441245 hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 27 00 01 40 00 00 00 00 00 00 00 00 00 00 00 00 '..@............ backtrace (crc ec2b3c5): __kmalloc_noprof+0x4db/0x730 sk_prot_alloc.isra.0+0xe4/0x1d0 sk_alloc+0x36/0x760 rawsock_create+0xd1/0x540 nfc_sock_create+0x11f/0x280 __sock_create+0x22d/0x630 __sys_socket+0x115/0x1d0 __x64_sys_socket+0x72/0xd0 do_syscall_64+0x117/0xfc0 entry_SYSCALL_64_after_hwframe+0x4b/0x53 Fixes: `38f04c6b1b` ("NFC: protect nci_data_exchange transactions") Reviewed-by: Joe Damato <joe@dama.to> Link: https://patch.msgid.link/20260303162346.2071888-4-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-03-04 18:18:57 -08:00
Jakub Kicinski	d42449d2c1	nfc: digital: free skb on digital_in_send error paths digital_in_send() takes ownership of the skb passed by the caller (nfc_data_exchange), make sure it's freed on all error paths. Found looking around the real driver for similar bugs to the one just fixed in nci. Fixes: `2c66daecc4` ("NFC Digital: Add NFC-A technology support") Reviewed-by: Joe Damato <joe@dama.to> Link: https://patch.msgid.link/20260303162346.2071888-3-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-03-04 18:16:10 -08:00
Jakub Kicinski	7bd4b0c477	nfc: nci: free skb on nci_transceive early error paths nci_transceive() takes ownership of the skb passed by the caller, but the -EPROTO, -EINVAL, and -EBUSY error paths return without freeing it. Due to issues clearing NCI_DATA_EXCHANGE fixed by subsequent changes the nci/nci_dev selftest hits the error path occasionally in NIPA, and kmemleak detects leaks: unreferenced object 0xff11000015ce6a40 (size 640): comm "nci_dev", pid 3954, jiffies 4295441246 hex dump (first 32 bytes): 6b 6b 6b 6b 00 a4 00 0c 02 e1 03 6b 6b 6b 6b 6b kkkk.......kkkkk 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk backtrace (crc 7c40cc2a): kmem_cache_alloc_node_noprof+0x492/0x630 __alloc_skb+0x11e/0x5f0 alloc_skb_with_frags+0xc6/0x8f0 sock_alloc_send_pskb+0x326/0x3f0 nfc_alloc_send_skb+0x94/0x1d0 rawsock_sendmsg+0x162/0x4c0 do_syscall_64+0x117/0xfc0 Fixes: `6a2968aaf5` ("NFC: basic NCI protocol implementation") Reviewed-by: Joe Damato <joe@dama.to> Link: https://patch.msgid.link/20260303162346.2071888-2-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-03-04 18:14:29 -08:00
Bobby Eshleman	40bf00ec2e	net: devmem: use READ_ONCE/WRITE_ONCE on binding->dev binding->dev is protected on the write-side in mp_dmabuf_devmem_uninstall() against concurrent writes, but due to the concurrent bare reads in net_devmem_get_binding() and validate_xmit_unreadable_skb() it should be wrapped in a READ_ONCE/WRITE_ONCE pair to make sure no compiler optimizations play with the underlying register in unforeseen ways. Doesn't present a critical bug because the known compiler optimizations don't result in bad behavior. There is no tearing on u64, and load omissions/invented loads would only break if additional binding->dev references were inlined together (they aren't right now). This just more strictly follows the linux memory model (i.e., "Lock-Protected Writes With Lockless Reads" in tools/memory-model/Documentation/access-marking.txt). Fixes: `bd61848900` ("net: devmem: Implement TX path") Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com> Link: https://patch.msgid.link/20260302-devmem-membar-fix-v2-1-5b33c9cbc28b@meta.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-03-04 17:59:27 -08:00
Eric Dumazet	a4c2b8be2e	net_sched: sch_fq: clear q->band_pkt_count[] in fq_reset() When/if a NIC resets, queues are deactivated by dev_deactivate_many(), then reactivated when the reset operation completes. fq_reset() removes all the skbs from various queues. If we do not clear q->band_pkt_count[], these counters keep growing and can eventually reach sch->limit, preventing new packets to be queued. Many thanks to Praveen for discovering the root cause. Fixes: `29f834aa32` ("net_sched: sch_fq: add 3 bands and WRR scheduling") Diagnosed-by: Praveen Kaligineedi <pkaligineedi@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20260304015640.961780-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-03-04 17:54:22 -08:00
Ian Ray	f7d92f11bd	net: nfc: nci: Fix zero-length proprietary notifications NCI NFC controllers may have proprietary OIDs with zero-length payload. One example is: drivers/nfc/nxp-nci/core.c, NXP_NCI_RF_TXLDO_ERROR_NTF. Allow a zero length payload in proprietary notifications only. Before: -- >8 -- kernel: nci: nci_recv_frame: len 3 -- >8 -- After: -- >8 -- kernel: nci: nci_recv_frame: len 3 kernel: nci: nci_ntf_packet: NCI RX: MT=ntf, PBF=0, GID=0x1, OID=0x23, plen=0 kernel: nci: nci_ntf_packet: unknown ntf opcode 0x123 kernel: nfc nfc0: NFC: RF transmitter couldn't start. Bad power and/or configuration? -- >8 -- After fixing the hardware: -- >8 -- kernel: nci: nci_recv_frame: len 27 kernel: nci: nci_ntf_packet: NCI RX: MT=ntf, PBF=0, GID=0x1, OID=0x5, plen=24 kernel: nci: nci_rf_intf_activated_ntf_packet: rf_discovery_id 1 -- >8 -- Fixes: `d24b03535e` ("nfc: nci: Fix uninit-value in nci_dev_up and nci_ntf_packet") Signed-off-by: Ian Ray <ian.ray@gehealthcare.com> Link: https://patch.msgid.link/20260302163238.140576-1-ian.ray@gehealthcare.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-03-04 17:48:12 -08:00
Eric Dumazet	165573e41f	tcp: secure_seq: add back ports to TS offset This reverts `28ee1b746f` ("secure_seq: downgrade to per-host timestamp offsets") tcp_tw_recycle went away in 2017. Zhouyan Deng reported off-path TCP source port leakage via SYN cookie side-channel that can be fixed in multiple ways. One of them is to bring back TCP ports in TS offset randomization. As a bonus, we perform a single siphash() computation to provide both an ISN and a TS offset. Fixes: `28ee1b746f` ("secure_seq: downgrade to per-host timestamp offsets") Reported-by: Zhouyan Deng <dengzhouyan_nwpu@163.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Acked-by: Florian Westphal <fw@strlen.de> Link: https://patch.msgid.link/20260302205527.1982836-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-03-04 17:44:35 -08:00
Jakub Kicinski	c649e99764	linux-can-fixes-for-7.0-20260302 -----BEGIN PGP SIGNATURE----- iIkEABYKADEWIQSl+MghEFFAdY3pYJLMOmT6rpmt0gUCaaVwehMcbWtsQHBlbmd1 dHJvbml4LmRlAAoJEMw6ZPquma3SqFUA/ihDNaZuD1HDNZ6tFugz4gcvytH4LT+R CRZXS+a1FRLyAQCuTiN1k080l4pj0sVDNlkymjxcn7a8RZ+Dk/Wy3b7JDg== =e56S -----END PGP SIGNATURE----- Merge tag 'linux-can-fixes-for-7.0-20260302' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can Marc Kleine-Budde says: ==================== pull-request: can 2026-03-02 The first 2 patches are by Oliver Hartkopp. The first fixes the locking for CAN Broadcast Manager op runtime updates, the second fixes the packet statisctics for the CAN dummy driver. Alban Bedel's patch fixes a potential problem in the error path of the mcp251x's ndo_open callback. A patch by Ziyi Guo add USB endpoint type validation to the esd_usb driver. The next 6 patches are by Greg Kroah-Hartman and fix URB data parsing for the ems_usb and ucan driver, fix URB anchoring in the etas_es58x, and in the f81604 driver fix URB data parsing, add URB error handling and fix URB anchoring. A patch by me targets the gs_usb driver and fixes interoperability with the CANable-2.5 firmware by always configuring the bit rate before starting the device. The last patch is by Frank Li and fixes a CHECK_DTBS warning for the nxp,sja1000 dt-binding. * tag 'linux-can-fixes-for-7.0-20260302' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can: dt-bindings: net: can: nxp,sja1000: add reference to mc-peripheral-props.yaml can: gs_usb: gs_can_open(): always configure bitrates before starting device can: usb: f81604: correctly anchor the urb in the read bulk callback can: usb: f81604: handle bulk write errors properly can: usb: f81604: handle short interrupt urb messages properly can: usb: etas_es58x: correctly anchor the urb in the read bulk callback can: ucan: Fix infinite loop from zero-length messages can: ems_usb: ems_usb_read_bulk_callback(): check the proper length of a message can: esd_usb: add endpoint type validation can: mcp251x: fix deadlock in error path of mcp251x_open can: dummy_can: dummy_can_init(): fix packet statistics can: bcm: fix locking for bcm_op runtime updates ==================== Link: https://patch.msgid.link/20260302152755.1700177-1-mkl@pengutronix.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-03-04 16:47:46 -08:00
Jakub Kicinski	2697c45a48	Some more fixes: - mt76 gets three almost identical new length checks - cw1200 & ti: locking fixes - mac80211 has a fix for the recent EML frame handling - rsi driver no longer oddly responds to config, which had triggered a warning in mac80211 - ath12k has two fixes for station statistics handling -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEpeA8sTs3M8SN2hR410qiO8sPaAAFAmmoFjoACgkQ10qiO8sP aACNYA/9Fqvno+pZXvGHo9yIsKgW1WfOVADhXkcG1rKDLcUS7MAzIXx3eru4Uteq wXD4rX07M2E0qa2TGmemCJSZPlYdNWkSn3EXGfn2TDmUH0u991kVa7wHMbUal4XQ lOlQns6c5B5r1wA7Cg1MTzDqplkIQacJ4T4Hf+v7n7twz9uKmfBYpR1vNek60Cx3 smCrdMTbsMO1YBcMjoIoMYs7sPWk/2MFiCS0YxRlOik1bmZnwf34Y9p+sOSvZxRV Ydt/9pBlwA2eFzdKPdoSZmMtfs+OdAiICqSC6CSI1KylgLOwnEl/UPN/bUxszGx1 NHjp2+8/yfOpm+EXwqWjeZ46lbw6YkkjWmxraNqRSCvQpPdc67RbqN405fFjonaR YMh6UuBHMlTGAbCLXwSkL5TKBD3EHq4PXnc6aFoBpPgaRxppGuT+Lx8PQny5fMOq GbOo3XUyB1Dc6B1A+xCN95DmPdg223RBR/qJCMHRkzaKm1ExO5hyYFQlZcH4bFHJ 9K2bTzZ7/w5UNYEiqX1yQ7btIlqf/gdeBANjLHNdqlsJ8i3GKm6TZV1rMnvt9MWC 21yYZ/Bwf4SzLgCH6Ds1RGEJjXusYnea/XJKeSPE0xqicFrXDqKGrmYIDFbntFLa mAqWmTqTVFl697h8PFoi47YqKr/5B+Ew3+1ZAGESc1RYWq+GDjI= =/SxR -----END PGP SIGNATURE----- Merge tag 'wireless-2026-03-04' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless Johannes Berg says: ==================== Some more fixes: - mt76 gets three almost identical new length checks - cw1200 & ti: locking fixes - mac80211 has a fix for the recent EML frame handling - rsi driver no longer oddly responds to config, which had triggered a warning in mac80211 - ath12k has two fixes for station statistics handling * tag 'wireless-2026-03-04' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless: wifi: mt76: Fix possible oob access in mt76_connac2_mac_write_txwi_80211() wifi: mt76: mt7925: Fix possible oob access in mt7925_mac_write_txwi_80211() wifi: mt76: mt7996: Fix possible oob access in mt7996_mac_write_txwi_80211() wifi: wlcore: Fix a locking bug wifi: cw1200: Fix locking in error paths wifi: mac80211: fix missing ieee80211_eml_params member initialization wifi: rsi: Don't default to -EOPNOTSUPP in rsi_mac80211_config wifi: ath12k: fix station lookup failure when disconnecting from AP wifi: ath12k: use correct pdev id when requesting firmware stats ==================== Link: https://patch.msgid.link/20260304112500.169639-3-johannes@sipsolutions.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-03-04 15:29:56 -08:00
Eric Biggers	46d0d6f50d	net/tcp-md5: Fix MAC comparison to be constant-time To prevent timing attacks, MACs need to be compared in constant time. Use the appropriate helper function for this. Fixes: `cfb6eeb4c8` ("[TCP]: MD5 Signature Option (RFC2385) support.") Fixes: `658ddaaf66` ("tcp: md5: RST: getting md5 key from listener") Cc: stable@vger.kernel.org Signed-off-by: Eric Biggers <ebiggers@kernel.org> Link: https://patch.msgid.link/20260302203409.13388-1-ebiggers@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-03-03 18:39:43 -08:00
Yung Chih Su	4ee7fa6cf7	net: ipv4: fix ARM64 alignment fault in multipath hash seed `struct sysctl_fib_multipath_hash_seed` contains two u32 fields (user_seed and mp_seed), making it an 8-byte structure with a 4-byte alignment requirement. In `fib_multipath_hash_from_keys()`, the code evaluates the entire struct atomically via `READ_ONCE()`: mp_seed = READ_ONCE(net->ipv4.sysctl_fib_multipath_hash_seed).mp_seed; While this silently works on GCC by falling back to unaligned regular loads which the ARM64 kernel tolerates, it causes a fatal kernel panic when compiled with Clang and LTO enabled. Commit `e35123d83e` ("arm64: lto: Strengthen READ_ONCE() to acquire when CONFIG_LTO=y") strengthens `READ_ONCE()` to use Load-Acquire instructions (`ldar` / `ldapr`) to prevent compiler reordering bugs under Clang LTO. Since the macro evaluates the full 8-byte struct, Clang emits a 64-bit `ldar` instruction. ARM64 architecture strictly requires `ldar` to be naturally aligned, thus executing it on a 4-byte aligned address triggers a strict Alignment Fault (FSC = 0x21). Fix the read side by moving the `READ_ONCE()` directly to the `u32` member, which emits a safe 32-bit `ldar Wn`. Furthermore, Eric Dumazet pointed out that `WRITE_ONCE()` on the entire struct in `proc_fib_multipath_hash_set_seed()` is also flawed. Analysis shows that Clang splits this 8-byte write into two separate 32-bit `str` instructions. While this avoids an alignment fault, it destroys atomicity and exposes a tear-write vulnerability. Fix this by explicitly splitting the write into two 32-bit `WRITE_ONCE()` operations. Finally, add the missing `READ_ONCE()` when reading `user_seed` in `proc_fib_multipath_hash_seed()` to ensure proper pairing and concurrency safety. Fixes: `4ee2a8cace` ("net: ipv4: Add a sysctl to set multipath hash seed") Signed-off-by: Yung Chih Su <yuuchihsu@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260302060247.7066-1-yuuchihsu@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-03-03 17:20:37 -08:00
Eric Biggers	67edfec516	net/tcp-ao: Fix MAC comparison to be constant-time To prevent timing attacks, MACs need to be compared in constant time. Use the appropriate helper function for this. Fixes: `0a3a809089` ("net/tcp: Verify inbound TCP-AO signed segments") Cc: stable@vger.kernel.org Signed-off-by: Eric Biggers <ebiggers@kernel.org> Reviewed-by: Dmitry Safonov <0x7f454c46@gmail.com> Link: https://patch.msgid.link/20260302203600.13561-1-ebiggers@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-03-03 17:16:54 -08:00
Jakub Kicinski	2ffb4f5c2c	ipv6: fix NULL pointer deref in ip6_rt_get_dev_rcu() l3mdev_master_dev_rcu() can return NULL when the slave device is being un-slaved from a VRF. All other callers deal with this, but we lost the fallback to loopback in ip6_rt_pcpu_alloc() -> ip6_rt_get_dev_rcu() with commit `4832c30d54` ("net: ipv6: put host and anycast routes on device with address"). KASAN: null-ptr-deref in range [0x0000000000000108-0x000000000000010f] RIP: 0010:ip6_rt_pcpu_alloc (net/ipv6/route.c:1418) Call Trace: ip6_pol_route (net/ipv6/route.c:2318) fib6_rule_lookup (net/ipv6/fib6_rules.c:115) ip6_route_output_flags (net/ipv6/route.c:2607) vrf_process_v6_outbound (drivers/net/vrf.c:437) I was tempted to rework the un-slaving code to clear the flag first and insert synchronize_rcu() before we remove the upper. But looks like the explicit fallback to loopback_dev is an established pattern. And I guess avoiding the synchronize_rcu() is nice, too. Fixes: `4832c30d54` ("net: ipv6: put host and anycast routes on device with address") Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20260301194548.927324-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-03-03 17:14:48 -08:00
YiFei Zhu	1a86a1f7d8	net: Fix rcu_tasks stall in threaded busypoll I was debugging a NIC driver when I noticed that when I enable threaded busypoll, bpftrace hangs when starting up. dmesg showed: rcu_tasks_wait_gp: rcu_tasks grace period number 85 (since boot) is 10658 jiffies old. rcu_tasks_wait_gp: rcu_tasks grace period number 85 (since boot) is 40793 jiffies old. rcu_tasks_wait_gp: rcu_tasks grace period number 85 (since boot) is 131273 jiffies old. rcu_tasks_wait_gp: rcu_tasks grace period number 85 (since boot) is 402058 jiffies old. INFO: rcu_tasks detected stalls on tasks: 00000000769f52cd: .N nvcsw: 2/2 holdout: 1 idle_cpu: -1/64 task:napi/eth2-8265 state:R running task stack:0 pid:48300 tgid:48300 ppid:2 task_flags:0x208040 flags:0x00004000 Call Trace: <TASK> ? napi_threaded_poll_loop+0x27c/0x2c0 ? __pfx_napi_threaded_poll+0x10/0x10 ? napi_threaded_poll+0x26/0x80 ? kthread+0xfa/0x240 ? __pfx_kthread+0x10/0x10 ? ret_from_fork+0x31/0x50 ? __pfx_kthread+0x10/0x10 ? ret_from_fork_asm+0x1a/0x30 </TASK> The cause is that in threaded busypoll, the main loop is in napi_threaded_poll rather than napi_threaded_poll_loop, where the latter rarely iterates more than once within its loop. For rcu_softirq_qs_periodic inside napi_threaded_poll_loop to report its qs state, the last_qs must be 100ms behind, and this can't happen because napi_threaded_poll_loop rarely iterates in threaded busypoll, and each time napi_threaded_poll_loop is called last_qs is reset to latest jiffies. This patch changes so that in threaded busypoll, last_qs is saved in the outer napi_threaded_poll, and whether busy_poll_last_qs is NULL indicates whether napi_threaded_poll_loop is called for busypoll. This way last_qs would not reset to latest jiffies on each invocation of napi_threaded_poll_loop. Fixes: `c18d4b190a` ("net: Extend NAPI threaded polling to allow kthread based busy polling") Cc: stable@vger.kernel.org Signed-off-by: YiFei Zhu <zhuyifei@google.com> Reviewed-by: Samiullah Khawaja <skhawaja@google.com> Link: https://patch.msgid.link/20260227221937.1060857-1-zhuyifei@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-03-03 13:44:28 +01:00
Allison Henderson	6a877ececd	net/rds: Fix circular locking dependency in rds_tcp_tune syzbot reported a circular locking dependency in rds_tcp_tune() where sk_net_refcnt_upgrade() is called while holding the socket lock: ====================================================== WARNING: possible circular locking dependency detected ====================================================== kworker/u10:8/15040 is trying to acquire lock: ffffffff8e9aaf80 (fs_reclaim){+.+.}-{0:0}, at: __kmalloc_cache_noprof+0x4b/0x6f0 but task is already holding lock: ffff88805a3c1ce0 (k-sk_lock-AF_INET6){+.+.}-{0:0}, at: rds_tcp_tune+0xd7/0x930 The issue occurs because sk_net_refcnt_upgrade() performs memory allocation (via get_net_track() -> ref_tracker_alloc()) while the socket lock is held, creating a circular dependency with fs_reclaim. Fix this by moving sk_net_refcnt_upgrade() outside the socket lock critical section. This is safe because the fields modified by the sk_net_refcnt_upgrade() call (sk_net_refcnt, ns_tracker) are not accessed by any concurrent code path at this point. v2: - Corrected fixes tag - check patch line wrap nits - ai commentary nits Reported-by: syzbot+2e2cf5331207053b8106@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=2e2cf5331207053b8106 Fixes: `3a58f13a88` ("net: rds: acquire refcount on TCP sockets") Signed-off-by: Allison Henderson <achender@kernel.org> Link: https://patch.msgid.link/20260227202336.167757-1-achender@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-03-03 12:57:06 +01:00
MeiChia Chiu	8fb54c7307	wifi: mac80211: fix missing ieee80211_eml_params member initialization The missing initialization causes driver to misinterpret the EML control bitmap, resulting in incorrect link bitmap handling. Fixes: `0d95280a2d` ("wifi: mac80211: Add eMLSR/eMLMR action frame parsing support") Signed-off-by: MeiChia Chiu <MeiChia.Chiu@mediatek.com> Acked-by: Lorenzo Bianconi <lorenzo@kernel.org> Link: https://patch.msgid.link/20260303054725.471548-1-MeiChia.Chiu@mediatek.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2026-03-03 08:37:29 +01:00
Oliver Hartkopp	c35636e91e	can: bcm: fix locking for bcm_op runtime updates Commit `c2aba69d0c` ("can: bcm: add locking for bcm_op runtime updates") added a locking for some variables that can be modified at runtime when updating the sending bcm_op with a new TX_SETUP command in bcm_tx_setup(). Usually the RX_SETUP only handles and filters incoming traffic with one exception: When the RX_RTR_FRAME flag is set a predefined CAN frame is sent when a specific RTR frame is received. Therefore the rx bcm_op uses bcm_can_tx() which uses the bcm_tx_lock that was only initialized in bcm_tx_setup(). Add the missing spin_lock_init() when allocating the bcm_op in bcm_rx_setup() to handle the RTR case properly. Fixes: `c2aba69d0c` ("can: bcm: add locking for bcm_op runtime updates") Reported-by: syzbot+5b11eccc403dd1cea9f8@syzkaller.appspotmail.com Closes: https://lore.kernel.org/linux-can/699466e4.a70a0220.2c38d7.00ff.GAE@google.com/ Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Link: https://patch.msgid.link/20260218-bcm_spin_lock_init-v1-1-592634c8a5b5@hartkopp.net Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>	2026-03-02 10:24:40 +01:00
Roshan Kumar	0d10393d5e	xfrm: iptfs: validate inner IPv4 header length in IPTFS payload Add validation of the inner IPv4 packet tot_len and ihl fields parsed from decrypted IPTFS payloads in __input_process_payload(). A crafted ESP packet containing an inner IPv4 header with tot_len=0 causes an infinite loop: iplen=0 leads to capturelen=min(0, remaining)=0, so the data offset never advances and the while(data < tail) loop never terminates, spinning forever in softirq context. Reject inner IPv4 packets where tot_len < ihl4 or ihl4 < sizeof(struct iphdr), which catches both the tot_len=0 case and malformed ihl values. The normal IP stack performs this validation in ip_rcv_core(), but IPTFS extracts and processes inner packets before they reach that layer. Reported-by: Roshan Kumar <roshaen09@gmail.com> Fixes: `6c82d24336` ("xfrm: iptfs: add basic receive packet (tunnel egress) handling") Cc: stable@vger.kernel.org Signed-off-by: Roshan Kumar <roshaen09@gmail.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>	2026-03-02 08:53:00 +01:00
Jiayuan Chen	101bacb303	atm: lec: fix null-ptr-deref in lec_arp_clear_vccs syzkaller reported a null-ptr-deref in lec_arp_clear_vccs(). This issue can be easily reproduced using the syzkaller reproducer. In the ATM LANE (LAN Emulation) module, the same atm_vcc can be shared by multiple lec_arp_table entries (e.g., via entry->vcc or entry->recv_vcc). When the underlying VCC is closed, lec_vcc_close() iterates over all ARP entries and calls lec_arp_clear_vccs() for each matched entry. For example, when lec_vcc_close() iterates through the hlists in priv->lec_arp_empty_ones or other ARP tables: 1. In the first iteration, for the first matched ARP entry sharing the VCC, lec_arp_clear_vccs() frees the associated vpriv (which is vcc->user_back) and sets vcc->user_back to NULL. 2. In the second iteration, for the next matched ARP entry sharing the same VCC, lec_arp_clear_vccs() is called again. It obtains a NULL vpriv from vcc->user_back (via LEC_VCC_PRIV(vcc)) and then attempts to dereference it via `vcc->pop = vpriv->old_pop`, leading to a null-ptr-deref crash. Fix this by adding a null check for vpriv before dereferencing it. If vpriv is already NULL, it means the VCC has been cleared by a previous call, so we can safely skip the cleanup and just clear the entry's vcc/recv_vcc pointers. The entire cleanup block (including vcc_release_async()) is placed inside the vpriv guard because a NULL vpriv indicates the VCC has already been fully released by a prior iteration — repeating the teardown would redundantly set flags and trigger callbacks on an already-closing socket. The Fixes tag points to the initial commit because the entry->vcc path has been vulnerable since the original code. The entry->recv_vcc path was later added by commit `8d9f73c0ad` ("atm: fix a memory leak of vcc->user_back") with the same pattern, and both paths are fixed here. Reported-by: syzbot+72e3ea390c305de0e259@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/68c95a83.050a0220.3c6139.0e5c.GAE@google.com/T/ Fixes: `1da177e4c3` ("Linux-2.6.12-rc2") Suggested-by: Dan Carpenter <dan.carpenter@linaro.org> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com> Link: https://patch.msgid.link/20260225123250.189289-1-jiayuan.chen@linux.dev Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-28 09:33:26 -08:00
Nikhil P. Rao	f7387d6579	xsk: Fix zero-copy AF_XDP fragment drop AF_XDP should ensure that only a complete packet is sent to application. In the zero-copy case, if the Rx queue gets full as fragments are being enqueued, the remaining fragments are dropped. For the multi-buffer case, add a check to ensure that the Rx queue has enough space for all fragments of a packet before starting to enqueue them. Fixes: `24ea50127e` ("xsk: support mbuf on ZC RX") Signed-off-by: Nikhil P. Rao <nikhil.rao@amd.com> Link: https://patch.msgid.link/20260225000456.107806-3-nikhil.rao@amd.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-28 08:55:11 -08:00
Nikhil P. Rao	60abb0ac11	xsk: Fix fragment node deletion to prevent buffer leak After commit `b692bf9a75` ("xsk: Get rid of xdp_buff_xsk::xskb_list_node"), the list_node field is reused for both the xskb pool list and the buffer free list, this causes a buffer leak as described below. xp_free() checks if a buffer is already on the free list using list_empty(&xskb->list_node). When list_del() is used to remove a node from the xskb pool list, it doesn't reinitialize the node pointers. This means list_empty() will return false even after the node has been removed, causing xp_free() to incorrectly skip adding the buffer to the free list. Fix this by using list_del_init() instead of list_del() in all fragment handling paths, this ensures the list node is reinitialized after removal, allowing the list_empty() to work correctly. Fixes: `b692bf9a75` ("xsk: Get rid of xdp_buff_xsk::xskb_list_node") Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Nikhil P. Rao <nikhil.rao@amd.com> Link: https://patch.msgid.link/20260225000456.107806-2-nikhil.rao@amd.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-28 08:55:11 -08:00
Jakub Kicinski	026dfef287	tcp: give up on stronger sk_rcvbuf checks (for now) We hit another corner case which leads to TcpExtTCPRcvQDrop Connections which send RPCs in the 20-80kB range over loopback experience spurious drops. The exact conditions for most of the drops I investigated are that: - socket exchanged >1MB of data so its not completely fresh - rcvbuf is around 128kB (default, hasn't grown) - there is ~60kB of data in rcvq - skb > 64kB arrives The sum of skb->len (!) of both of the skbs (the one already in rcvq and the arriving one) is larger than rwnd. My suspicion is that this happens because __tcp_select_window() rounds the rwnd up to (1 << wscale) if less than half of the rwnd has been consumed. Eric suggests that given the number of Fixes we already have pointing to `1d2fbaad7c` it's probably time to give up on it, until a bigger revamp of rmem management. Also while we could risk tweaking the rwnd math, there are other drops on workloads I investigated, after the commit in question, not explained by this phenomenon. Suggested-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/20260225122355.585fd57b@kernel.org Fixes: `1d2fbaad7c` ("tcp: stronger sk_rcvbuf checks") Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260227003359.2391017-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-28 07:55:39 -08:00
Kuniyuki Iwashima	6996a2d2d0	udp: Unhash auto-bound connected sk from 4-tuple hash table when disconnected. Let's say we bind() an UDP socket to the wildcard address with a non-zero port, connect() it to an address, and disconnect it from the address. bind() sets SOCK_BINDPORT_LOCK on sk->sk_userlocks (but not SOCK_BINDADDR_LOCK), and connect() calls udp_lib_hash4() to put the socket into the 4-tuple hash table. Then, __udp_disconnect() calls sk->sk_prot->rehash(sk). It computes a new hash based on the wildcard address and moves the socket to a new slot in the 4-tuple hash table, leaving a garbage in the chain that no packet hits. Let's remove such a socket from 4-tuple hash table when disconnected. Note that udp_sk(sk)->udp_portaddr_hash needs to be udpated after udp_hash4_dec(hslot2) in udp_unhash4(). Fixes: `78c91ae2c6` ("ipv4/udp: Add 4-tuple hash for connected socket") Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260227035547.3321327-1-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-28 07:46:24 -08:00
Victor Nogueira	11cb63b0d1	net/sched: Only allow act_ct to bind to clsact/ingress qdiscs and shared blocks As Paolo said earlier [1]: "Since the blamed commit below, classify can return TC_ACT_CONSUMED while the current skb being held by the defragmentation engine. As reported by GangMin Kim, if such packet is that may cause a UaF when the defrag engine later on tries to tuch again such packet." act_ct was never meant to be used in the egress path, however some users are attaching it to egress today [2]. Attempting to reach a middle ground, we noticed that, while most qdiscs are not handling TC_ACT_CONSUMED, clsact/ingress qdiscs are. With that in mind, we address the issue by only allowing act_ct to bind to clsact/ingress qdiscs and shared blocks. That way it's still possible to attach act_ct to egress (albeit only with clsact). [1] https://lore.kernel.org/netdev/674b8cbfc385c6f37fb29a1de08d8fe5c2b0fbee.1771321118.git.pabeni@redhat.com/ [2] https://lore.kernel.org/netdev/cc6bfb4a-4a2b-42d8-b9ce-7ef6644fb22b@ovn.org/ Reported-by: GangMin Kim <km.kim1503@gmail.com> Fixes: `3f14b377d0` ("net/sched: act_ct: fix skb leak and crash on ooo frags") CC: stable@vger.kernel.org Signed-off-by: Victor Nogueira <victor@mojatatu.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://patch.msgid.link/20260225134349.1287037-1-victor@mojatatu.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-27 19:06:21 -08:00
Jonas Köppeler	15c2715a52	net/sched: sch_cake: fixup cake_mq rate adjustment for diffserv config cake_mq's rate adjustment during the sync periods did not adjust the rates for every tin in a diffserv config. This lead to inconsistencies of rates between the tins. Fix this by setting the rates for all tins during synchronization. Fixes: `1bddd758ba` ("net/sched: sch_cake: share shaper state across sub-instances of cake_mq") Signed-off-by: Jonas Köppeler <j.koeppeler@tu-berlin.de> Acked-by: Toke Høiland-Jørgensen <toke@toke.dk> Link: https://patch.msgid.link/20260226-cake-mq-skip-sync-bandwidth-unlimited-v1-2-01830bb4db87@tu-berlin.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-27 18:35:40 -08:00
Jonas Köppeler	0b3cd139be	net/sched: sch_cake: avoid sync overhead when unlimited Skip inter-instance sync when no rate limit is configured, as it serves no purpose and only adds overhead. Fixes: `1bddd758ba` ("net/sched: sch_cake: share shaper state across sub-instances of cake_mq") Signed-off-by: Jonas Köppeler <j.koeppeler@tu-berlin.de> Acked-by: Toke Høiland-Jørgensen <toke@toke.dk> Link: https://patch.msgid.link/20260226-cake-mq-skip-sync-bandwidth-unlimited-v1-1-01830bb4db87@tu-berlin.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-27 18:35:40 -08:00
Eric Dumazet	29252397bc	inet: annotate data-races around isk->inet_num UDP/TCP lookups are using RCU, thus isk->inet_num accesses should use READ_ONCE() and WRITE_ONCE() where needed. Fixes: `3ab5aee7fe` ("net: Convert TCP & DCCP hash tables to use RCU / hlist_nulls") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260225203545.1512417-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-27 17:16:59 -08:00
Paul Moses	62413a9c3c	net/sched: act_gate: snapshot parameters with RCU on replace The gate action can be replaced while the hrtimer callback or dump path is walking the schedule list. Convert the parameters to an RCU-protected snapshot and swap updates under tcf_lock, freeing the previous snapshot via call_rcu(). When REPLACE omits the entry list, preserve the existing schedule so the effective state is unchanged. Fixes: `a51c328df3` ("net: qos: introduce a gate control flow action") Cc: stable@vger.kernel.org Signed-off-by: Paul Moses <p@1g4.org> Tested-by: Vladimir Oltean <vladimir.oltean@nxp.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Reviewed-by: Victor Nogueira <victor@mojatatu.com> Link: https://patch.msgid.link/20260223150512.2251594-2-p@1g4.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-27 16:10:36 -08:00
Eric Badger	7b6275c80a	xprtrdma: Decrement re_receiving on the early exit paths In the event that rpcrdma_post_recvs() fails to create a work request (due to memory allocation failure, say) or otherwise exits early, we should decrement ep->re_receiving before returning. Otherwise we will hang in rpcrdma_xprt_drain() as re_receiving will never reach zero and the completion will never be triggered. On a system with high memory pressure, this can appear as the following hung task: INFO: task kworker/u385:17:8393 blocked for more than 122 seconds. Tainted: G S E 6.19.0 #3 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:kworker/u385:17 state:D stack:0 pid:8393 tgid:8393 ppid:2 task_flags:0x4248060 flags:0x00080000 Workqueue: xprtiod xprt_autoclose [sunrpc] Call Trace: <TASK> __schedule+0x48b/0x18b0 ? ib_post_send_mad+0x247/0xae0 [ib_core] schedule+0x27/0xf0 schedule_timeout+0x104/0x110 __wait_for_common+0x98/0x180 ? __pfx_schedule_timeout+0x10/0x10 wait_for_completion+0x24/0x40 rpcrdma_xprt_disconnect+0x444/0x460 [rpcrdma] xprt_rdma_close+0x12/0x40 [rpcrdma] xprt_autoclose+0x5f/0x120 [sunrpc] process_one_work+0x191/0x3e0 worker_thread+0x2e3/0x420 ? __pfx_worker_thread+0x10/0x10 kthread+0x10d/0x230 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x273/0x2b0 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1a/0x30 Fixes: `15788d1d10` ("xprtrdma: Do not refresh Receive Queue while it is draining") Signed-off-by: Eric Badger <ebadger@purestorage.com> Reviewed-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>	2026-02-27 15:42:14 -05:00
Danielle Ratson	93c9475c04	bridge: Check relevant per-VLAN options in VLAN range grouping The br_vlan_opts_eq_range() function determines if consecutive VLANs can be grouped together in a range for compact netlink notifications. It currently checks state, tunnel info, and multicast router configuration, but misses two categories of per-VLAN options that affect the output: 1. User-visible priv_flags (neigh_suppress, mcast_enabled) 2. Port multicast context (mcast_max_groups, mcast_n_groups) When VLANs have different settings for these options, they are incorrectly grouped into ranges, causing netlink notifications to report only one VLAN's settings for the entire range. Fix by checking priv_flags equality, but only for flags that affect netlink output (BR_VLFLAG_NEIGH_SUPPRESS_ENABLED and BR_VLFLAG_MCAST_ENABLED), and comparing multicast context (mcast_max_groups and mcast_n_groups). Example showing the bugs before the fix: $ bridge vlan set vid 10 dev dummy1 neigh_suppress on $ bridge vlan set vid 11 dev dummy1 neigh_suppress off $ bridge -d vlan show dev dummy1 port vlan-id dummy1 10-11 ... neigh_suppress on $ bridge vlan set vid 10 dev dummy1 mcast_max_groups 100 $ bridge vlan set vid 11 dev dummy1 mcast_max_groups 200 $ bridge -d vlan show dev dummy1 port vlan-id dummy1 10-11 ... mcast_max_groups 100 After the fix, VLANs 10 and 11 are shown as separate entries with their correct individual settings. Fixes: `a1aee20d5d` ("net: bridge: Add netlink knobs for number / maximum MDB entries") Fixes: `83f6d60079` ("bridge: vlan: Allow setting VLAN neighbor suppression state") Signed-off-by: Danielle Ratson <danieller@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20260225143956.3995415-2-danieller@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-26 19:24:29 -08:00
Eric Dumazet	2ef2b20cf4	net: annotate data-races around sk->sk_{data_ready,write_space} skmsg (and probably other layers) are changing these pointers while other cpus might read them concurrently. Add corresponding READ_ONCE()/WRITE_ONCE() annotations for UDP, TCP and AF_UNIX. Fixes: `604326b41a` ("bpf, sockmap: convert to generic sk_msg interface") Reported-by: syzbot+87f770387a9e5dc6b79b@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/699ee9fc.050a0220.1cd54b.0009.GAE@google.com/ Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: John Fastabend <john.fastabend@gmail.com> Cc: Jakub Sitnicki <jakub@cloudflare.com> Cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260225131547.1085509-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-26 19:23:03 -08:00
Jakub Kicinski	754a3d081a	Here is a batman-adv bugfix: - Avoid double-rtnl_lock ELP metric worker, by Sven Eckelmann -----BEGIN PGP SIGNATURE----- iQJKBAABCgA0FiEE1ilQI7G+y+fdhnrfoSvjmEKSnqEFAmmes84WHHN3QHNpbW9u d3VuZGVybGljaC5kZQAKCRChK+OYQpKeoSxDD/wI/ssEvqmay/4okfp6Fk/+hjLi 2BvCLwKei8JKqsnNvUSW7I+inrp0AilwfUuMqQlIiOdz6zJ6O4s4SXdiwl8TH49p uVp4dSwoOPHzBKaPH+dU15fcLD4yBqRYnl6gyxem7hWtsDU04fn96se7lagUdJc/ 35LZ2ni9cRmxgmvcLECNGOj4Tm7TxbcG0wkifS/rIO7gd05rXb7c7T1lCGRPeBf4 2i4RVQXwSEVhff1ig7yU/1gs2FUzIKnrlKHayyfYkynEI37Ggc4IBiqLkdyBuxJ4 Z+qlCfumrtdrt79kirzezrcWEzQEj5Yn3fnXj0X27QYy5FJVKnLczHnGuLUSUqzl QgwvQ87tNwEmz50ODsq+TFY9GuowWJ5yLTMFb18u/5hJrAGvux5wU+mIbloTOpBg M/kMv8kZIMNzVEirxbD08Ygx9Fsxu3UWGptDAunlv1GkHBj7XqA2Jkoq77eDfxx+ lIa0tu1s/y1eTb5tA9JXUn0BsoNrafDIY5zrjz+lDKYpmmeNUgiTbQBGuVCZ+t2o EWLYPxdV84QpwuoaXZ/ZkD0YVAx/sfDLptxaBGViWbThLGVYxYSELePO94Mkr6Os Fa/8gEg0Z+jNUZ3UfVVnjyjPaa5/BM2vtbwSQFgv1udJGLoa/AkWIcOEgkeZAzWc B5cubcmbSHx4mBCEuA== =Ng/2 -----END PGP SIGNATURE----- Merge tag 'batadv-net-pullrequest-20260225' of https://git.open-mesh.org/linux-merge Simon Wunderlich says: ==================== Here is a batman-adv bugfix: - Avoid double-rtnl_lock ELP metric worker, by Sven Eckelmann * tag 'batadv-net-pullrequest-20260225' of https://git.open-mesh.org/linux-merge: batman-adv: Avoid double-rtnl_lock ELP metric worker ==================== Link: https://patch.msgid.link/20260225084614.229077-1-sw@simonwunderlich.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-26 19:15:09 -08:00
Davide Caratti	e35626f610	net/sched: ets: fix divide by zero in the offload path Offloading ETS requires computing each class' WRR weight: this is done by averaging over the sums of quanta as 'q_sum' and 'q_psum'. Using unsigned int, the same integer size as the individual DRR quanta, can overflow and even cause division by zero, like it happened in the following splat: Oops: divide error: 0000 [#1] SMP PTI CPU: 13 UID: 0 PID: 487 Comm: tc Tainted: G E 6.19.0-virtme #45 PREEMPT(full) Tainted: [E]=UNSIGNED_MODULE Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 RIP: 0010:ets_offload_change+0x11f/0x290 [sch_ets] Code: e4 45 31 ff eb 03 41 89 c7 41 89 cb 89 ce 83 f9 0f 0f 87 b7 00 00 00 45 8b 08 31 c0 45 01 cc 45 85 c9 74 09 41 6b c4 64 31 d2 <41> f7 f2 89 c2 44 29 fa 45 89 df 41 83 fb 0f 0f 87 c7 00 00 00 44 RSP: 0018:ffffd0a180d77588 EFLAGS: 00010246 RAX: 00000000ffffff38 RBX: ffff8d3d482ca000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffd0a180d77660 RBP: ffffd0a180d77690 R08: ffff8d3d482ca2d8 R09: 00000000fffffffe R10: 0000000000000000 R11: 0000000000000000 R12: 00000000fffffffe R13: ffff8d3d472f2000 R14: 0000000000000003 R15: 0000000000000000 FS: 00007f440b6c2740(0000) GS:ffff8d3dc9803000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000000003cdd2000 CR3: 0000000007b58002 CR4: 0000000000172ef0 Call Trace: <TASK> ets_qdisc_change+0x870/0xf40 [sch_ets] qdisc_create+0x12b/0x540 tc_modify_qdisc+0x6d7/0xbd0 rtnetlink_rcv_msg+0x168/0x6b0 netlink_rcv_skb+0x5c/0x110 netlink_unicast+0x1d6/0x2b0 netlink_sendmsg+0x22e/0x470 ____sys_sendmsg+0x38a/0x3c0 ___sys_sendmsg+0x99/0xe0 __sys_sendmsg+0x8a/0xf0 do_syscall_64+0x111/0xf80 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7f440b81c77e Code: 4d 89 d8 e8 d4 bc 00 00 4c 8b 5d f8 41 8b 93 08 03 00 00 59 5e 48 83 f8 fc 74 11 c9 c3 0f 1f 80 00 00 00 00 48 8b 45 10 0f 05 <c9> c3 83 e2 39 83 fa 08 75 e7 e8 13 ff ff ff 0f 1f 00 f3 0f 1e fa RSP: 002b:00007fff951e4c10 EFLAGS: 00000202 ORIG_RAX: 000000000000002e RAX: ffffffffffffffda RBX: 0000000000481820 RCX: 00007f440b81c77e RDX: 0000000000000000 RSI: 00007fff951e4cd0 RDI: 0000000000000003 RBP: 00007fff951e4c20 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000202 R12: 00007fff951f4fa8 R13: 00000000699ddede R14: 00007f440bb01000 R15: 0000000000486980 </TASK> Modules linked in: sch_ets(E) netdevsim(E) ---[ end trace 0000000000000000 ]--- RIP: 0010:ets_offload_change+0x11f/0x290 [sch_ets] Code: e4 45 31 ff eb 03 41 89 c7 41 89 cb 89 ce 83 f9 0f 0f 87 b7 00 00 00 45 8b 08 31 c0 45 01 cc 45 85 c9 74 09 41 6b c4 64 31 d2 <41> f7 f2 89 c2 44 29 fa 45 89 df 41 83 fb 0f 0f 87 c7 00 00 00 44 RSP: 0018:ffffd0a180d77588 EFLAGS: 00010246 RAX: 00000000ffffff38 RBX: ffff8d3d482ca000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffd0a180d77660 RBP: ffffd0a180d77690 R08: ffff8d3d482ca2d8 R09: 00000000fffffffe R10: 0000000000000000 R11: 0000000000000000 R12: 00000000fffffffe R13: ffff8d3d472f2000 R14: 0000000000000003 R15: 0000000000000000 FS: 00007f440b6c2740(0000) GS:ffff8d3dc9803000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000000003cdd2000 CR3: 0000000007b58002 CR4: 0000000000172ef0 Kernel panic - not syncing: Fatal exception Kernel Offset: 0x30000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) ---[ end Kernel panic - not syncing: Fatal exception ]--- Fix this using 64-bit integers for 'q_sum' and 'q_psum'. Cc: stable@vger.kernel.org Fixes: `d35eb52bd2` ("net: sch_ets: Make the ETS qdisc offloadable") Signed-off-by: Davide Caratti <dcaratti@redhat.com> Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Link: https://patch.msgid.link/28504887df314588c7255e9911769c36f751edee.1771964872.git.dcaratti@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-26 18:28:47 -08:00
Linus Torvalds	b9c8fc2cae	Including fixes from IPsec, Bluetooth and netfilter Current release - regressions: - wifi: fix dev_alloc_name() return value check - rds: fix recursive lock in rds_tcp_conn_slots_available Current release - new code bugs: - vsock: lock down child_ns_mode as write-once Previous releases - regressions: - core: - do not pass flow_id to set_rps_cpu() - consume xmit errors of GSO frames - netconsole: avoid OOB reads, msg is not nul-terminated - netfilter: h323: fix OOB read in decode_choice() - tcp: re-enable acceptance of FIN packets when RWIN is 0 - udplite: fix null-ptr-deref in __udp_enqueue_schedule_skb(). - wifi: brcmfmac: fix potential kernel oops when probe fails - phy: register phy led_triggers during probe to avoid AB-BA deadlock - eth: bnxt_en: fix deleting of Ntuple filters - eth: wan: farsync: fix use-after-free bugs caused by unfinished tasklets - eth: xscale: check for PTP support properly Previous releases - always broken: - tcp: fix potential race in tcp_v6_syn_recv_sock() - kcm: fix zero-frag skb in frag_list on partial sendmsg error - xfrm: - fix race condition in espintcp_close() - always flush state and policy upon NETDEV_UNREGISTER event - bluetooth: - purge error queues in socket destructors - fix response to L2CAP_ECRED_CONN_REQ - eth: mlx5: - fix circular locking dependency in dump - fix "scheduling while atomic" in IPsec MAC address query - eth: gve: fix incorrect buffer cleanup for QPL - eth: team: avoid NETDEV_CHANGEMTU event when unregistering slave - eth: usb: validate USB endpoints Signed-off-by: Paolo Abeni <pabeni@redhat.com> -----BEGIN PGP SIGNATURE----- iQJGBAABCgAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmmgYU4SHHBhYmVuaUBy ZWRoYXQuY29tAAoJECkkeY3MjxOkLBgQAINazHstJ0DoDkvmwXapRSN0Ffauyd46 oX6nfeWOT3BzZbAhZHtGgCSs4aULifJWMevtT7pq7a7PgZwMwfa47BugR1G/u5UE hCqalNjRTB/U2KmFk6eViKSacD4FvUIAyAMOotn1aEdRRAkBIJnIW/o/ZR9ZUkm0 5+UigO64aq57+FOc5EQdGjYDcTVdzW12iOZ8ZqwtSATdNd9aC+gn3voRomTEo+Fm kQinkFEPAy/YyHGmfpC/z87/RTgkYLpagmsT4ZvBJeNPrIRvFEibSpPNhuzTzg81 /BW5M8sJmm3XFiTiRp6Blv+0n6HIpKjAZMHn5c9hzX9cxPZQ24EjkXEex9ClaxLd OMef79rr1HBwqBTpIlK7xfLKCdT5Iex88s8HxXRB/Psqk9pVP469cSoK6cpyiGiP I+4WT0wn9ukTiu/yV2L2byVr1sanlu54P+UBYJpDwqq3lZ1ngWtkJ+SY369jhwAS FYIBmUSKhmWz3FEULaGpgPy4m9Fl/fzN8IFh2Buoc/Puq61HH7MAMjRty2ZSFTqj gbHrRhlkCRqubytgjsnCDPLoJF4ZYcXtpo/8ogG3641H1I+dN+DyGGVZ/ioswkks My1ds0rKqA3BHCmn+pN/qqkuopDCOB95dqOpgDqHG7GePrpa/FJ1guhxexsCd+nL Run2RcgDmd+d =HBOu -----END PGP SIGNATURE----- Merge tag 'net-7.0-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Paolo Abeni: "Including fixes from IPsec, Bluetooth and netfilter Current release - regressions: - wifi: fix dev_alloc_name() return value check - rds: fix recursive lock in rds_tcp_conn_slots_available Current release - new code bugs: - vsock: lock down child_ns_mode as write-once Previous releases - regressions: - core: - do not pass flow_id to set_rps_cpu() - consume xmit errors of GSO frames - netconsole: avoid OOB reads, msg is not nul-terminated - netfilter: h323: fix OOB read in decode_choice() - tcp: re-enable acceptance of FIN packets when RWIN is 0 - udplite: fix null-ptr-deref in __udp_enqueue_schedule_skb(). - wifi: brcmfmac: fix potential kernel oops when probe fails - phy: register phy led_triggers during probe to avoid AB-BA deadlock - eth: - bnxt_en: fix deleting of Ntuple filters - wan: farsync: fix use-after-free bugs caused by unfinished tasklets - xscale: check for PTP support properly Previous releases - always broken: - tcp: fix potential race in tcp_v6_syn_recv_sock() - kcm: fix zero-frag skb in frag_list on partial sendmsg error - xfrm: - fix race condition in espintcp_close() - always flush state and policy upon NETDEV_UNREGISTER event - bluetooth: - purge error queues in socket destructors - fix response to L2CAP_ECRED_CONN_REQ - eth: - mlx5: - fix circular locking dependency in dump - fix "scheduling while atomic" in IPsec MAC address query - gve: fix incorrect buffer cleanup for QPL - team: avoid NETDEV_CHANGEMTU event when unregistering slave - usb: validate USB endpoints" * tag 'net-7.0-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (72 commits) netfilter: nf_conntrack_h323: fix OOB read in decode_choice() dpaa2-switch: validate num_ifs to prevent out-of-bounds write net: consume xmit errors of GSO frames vsock: document write-once behavior of the child_ns_mode sysctl vsock: lock down child_ns_mode as write-once selftests/vsock: change tests to respect write-once child ns mode net/mlx5e: Fix "scheduling while atomic" in IPsec MAC address query net/mlx5: Fix missing devlink lock in SRIOV enable error path net/mlx5: E-switch, Clear legacy flag when moving to switchdev net/mlx5: LAG, disable MPESW in lag_disable_change() net/mlx5: DR, Fix circular locking dependency in dump selftests: team: Add a reference count leak test team: avoid NETDEV_CHANGEMTU event when unregistering slave net: mana: Fix double destroy_workqueue on service rescan PCI path MAINTAINERS: Update maintainer entry for QUALCOMM ETHQOS ETHERNET DRIVER dpll: zl3073x: Remove redundant cleanup in devm_dpll_init() selftests/net: packetdrill: Verify acceptance of FIN packets when RWIN is 0 tcp: re-enable acceptance of FIN packets when RWIN is 0 vsock: Use container_of() to get net namespace in sysctl handlers net: usb: kaweth: validate USB endpoints ...	2026-02-26 08:00:13 -08:00
Vahagn Vardanian	baed0d9ba9	netfilter: nf_conntrack_h323: fix OOB read in decode_choice() In decode_choice(), the boundary check before get_len() uses the variable `len`, which is still 0 from its initialization at the top of the function: unsigned int type, ext, len = 0; ... if (ext \|\| (son->attr & OPEN)) { BYTE_ALIGN(bs); if (nf_h323_error_boundary(bs, len, 0)) /* len is 0 here / return H323_ERROR_BOUND; len = get_len(bs); / OOB read / When the bitstream is exactly consumed (bs->cur == bs->end), the check nf_h323_error_boundary(bs, 0, 0) evaluates to (bs->cur + 0 > bs->end), which is false. The subsequent get_len() call then dereferences bs->cur++, reading 1 byte past the end of the buffer. If that byte has bit 7 set, get_len() reads a second byte as well. This can be triggered remotely by sending a crafted Q.931 SETUP message with a User-User Information Element containing exactly 2 bytes of PER-encoded data ({0x08, 0x00}) to port 1720 through a firewall with the nf_conntrack_h323 helper active. The decoder fully consumes the PER buffer before reaching this code path, resulting in a 1-2 byte heap-buffer-overflow read confirmed by AddressSanitizer. Fix this by checking for 2 bytes (the maximum that get_len() may read) instead of the uninitialized `len`. This matches the pattern used at every other get_len() call site in the same file, where the caller checks for 2 bytes of available data before calling get_len(). Fixes: `ec8a8f3c31` ("netfilter: nf_ct_h323: Extend nf_h323_error_boundary to work on bits as well") Signed-off-by: Vahagn Vardanian <vahagn@redrays.io> Signed-off-by: Florian Westphal <fw@strlen.de> Link: https://patch.msgid.link/20260225130619.1248-2-fw@strlen.de Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-02-26 12:50:42 +01:00
Jakub Kicinski	7aa767d0d3	net: consume xmit errors of GSO frames udpgro_frglist.sh and udpgro_bench.sh are the flakiest tests currently in NIPA. They fail in the same exact way, TCP GRO test stalls occasionally and the test gets killed after 10min. These tests use veth to simulate GRO. They attach a trivial ("return XDP_PASS;") XDP program to the veth to force TSO off and NAPI on. Digging into the failure mode we can see that the connection is completely stuck after a burst of drops. The sender's snd_nxt is at sequence number N [1], but the receiver claims to have received (rcv_nxt) up to N + 3 * MSS [2]. Last piece of the puzzle is that senders rtx queue is not empty (let's say the block in the rtx queue is at sequence number N - 4 * MSS [3]). In this state, sender sends a retransmission from the rtx queue with a single segment, and sequence numbers N-4MSS:N-3MSS [3]. Receiver sees it and responds with an ACK all the way up to N + 3 * MSS [2]. But sender will reject this ack as TCP_ACK_UNSENT_DATA because it has no recollection of ever sending data that far out [1]. And we are stuck. The root cause is the mess of the xmit return codes. veth returns an error when it can't xmit a frame. We end up with a loss event like this: ------------------------------------------------- \| GSO super frame 1 \| GSO super frame 2 \| \|-----------------------------------------------\| \| seg \| seg \| seg \| seg \| seg \| seg \| seg \| seg \| \| 1 \| 2 \| 3 \| 4 \| 5 \| 6 \| 7 \| 8 \| ------------------------------------------------- x ok ok <ok>\| ok ok ok <x> \\ snd_nxt "x" means packet lost by veth, and "ok" means it went thru. Since veth has TSO disabled in this test it sees individual segments. Segment 1 is on the retransmit queue and will be resent. So why did the sender not advance snd_nxt even tho it clearly did send up to seg 8? tcp_write_xmit() interprets the return code from the core to mean that data has not been sent at all. Since TCP deals with GSO super frames, not individual segment the crux of the problem is that loss of a single segment can be interpreted as loss of all. TCP only sees the last return code for the last segment of the GSO frame (in <> brackets in the diagram above). Of course for the problem to occur we need a setup or a device without a Qdisc. Otherwise Qdisc layer disconnects the protocol layer from the device errors completely. We have multiple ways to fix this. 1) make veth not return an error when it lost a packet. While this is what I think we did in the past, the issue keeps reappearing and it's annoying to debug. The game of whack a mole is not great. 2) fix the damn return codes We only talk about NETDEV_TX_OK and NETDEV_TX_BUSY in the documentation, so maybe we should make the return code from ndo_start_xmit() a boolean. I like that the most, but perhaps some ancient, not-really-networking protocol would suffer. 3) make TCP ignore the errors It is not entirely clear to me what benefit TCP gets from interpreting the result of ip_queue_xmit()? Specifically once the connection is established and we're pushing data - packet loss is just packet loss? 4) this fix Ignore the rc in the Qdisc-less+GSO case, since it's unreliable. We already always return OK in the TCQ_F_CAN_BYPASS case. In the Qdisc-less case let's be a bit more conservative and only mask the GSO errors. This path is taken by non-IP-"networks" like CAN, MCTP etc, so we could regress some ancient thing. This is the simplest, but also maybe the hackiest fix? Similar fix has been proposed by Eric in the past but never committed because original reporter was working with an OOT driver and wasn't providing feedback (see Link). Link: https://lore.kernel.org/CANn89iJcLepEin7EtBETrZ36bjoD9LrR=k4cfwWh046GB+4f9A@mail.gmail.com Fixes: `1f59533f9c` ("qdisc: validate frames going through the direct_xmit path") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260223235100.108939-1-kuba@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-02-26 11:35:00 +01:00
Bobby Eshleman	102eab95f0	vsock: lock down child_ns_mode as write-once Two administrator processes may race when setting child_ns_mode as one process sets child_ns_mode to "local" and then creates a namespace, but another process changes child_ns_mode to "global" between the write and the namespace creation. The first process ends up with a namespace in "global" mode instead of "local". While this can be detected after the fact by reading ns_mode and retrying, it is fragile and error-prone. Make child_ns_mode write-once so that a namespace manager can set it once and be sure it won't change. Writing a different value after the first write returns -EBUSY. This applies to all namespaces, including init_net, where an init process can write "local" to lock all future namespaces into local mode. Fixes: `eafb64f40c` ("vsock: add netns to vsock core") Suggested-by: Daan De Meyer <daan.j.demeyer@gmail.com> Suggested-by: Stefano Garzarella <sgarzare@redhat.com> Co-developed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Link: https://patch.msgid.link/20260223-vsock-ns-write-once-v3-2-c0cde6959923@meta.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-02-26 11:10:03 +01:00
Jakub Kicinski	6668c6f2dd	A good number of fixes: - cfg80211: - cancel rfkill work appropriately - fix radiotap parsing to correctly reject field 18 - fix wext (yes...) off-by-one for IGTK key ID - mac80211: - fix for mesh NULL pointer dereference - fix for stack out-of-bounds (2 bytes) write on specific multi-link action frames - set default WMM parameters for all links - mwifiex: check dev_alloc_name() return value correctly - libertas: fix potential timer use-after-free - brcmfmac: fix crash on probe failure -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEpeA8sTs3M8SN2hR410qiO8sPaAAFAmme3O0ACgkQ10qiO8sP aAAhBA//UhqBeXsJd7dfSfGcz4ztzw/m4BDDxwWhJd0wq/ZHVwGvLfOXN1lXG1yR OsMaSQkT8UGv4NI0V/+7vcKlTvCe0oF0RPyzNtGL8CCYASyM0WbD6EqqpaLKdBIE Qg/PQ3n7mtPiKHYz9fmL/Yku8uNvHaYJ18HIki9Zn1kgcKvJegf4VqYoMa4m5zK3 ShaNERSsrks2cgBQGwRMxNDfmbn2lr/YnyavFd+RoOdlIjN4FiU7zelgeCKapL6B URkn/NTp92ga3zcb5b57K3fjHucSKc7Lvf7l/ie5m8tw+Omr7zooBzjvtUzd6lfy gIFaPUuiKe3Zzq8fUKqgdSivyVOv6VdX6ieKi+mS0CkhfURqQUwNTZPM1Cn5MAkt lOPwaBpO7iZ2pP56jr29sEXz2komhTZLDv4bssrPvH6si6zToSd+wY10b6hESfTw wQBxdZl/YqnzngaojQhKTwlQRYATp1h60yEj2SKXpx+DMCtNkAmfxDhAzBCuIaDI eggswVy97Fn11WuDF3d8nthgyULrAzaK9LIGDCGObHZQYqROJmXtyNyeCmJJHvM7 5/4l61H2nfMIymcSItVo/0ZQKmgiaSeU3t7Arp13uX6jbiWEbmGcdV35fmorwq+u p9Y3ay8o5yWfpb/XKx7mdurFBrYXTwry7xlaOkUzqCuEhRNRbTU= =VLWl -----END PGP SIGNATURE----- Merge tag 'wireless-2026-02-25' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless Johannes Berg says: ==================== A good number of fixes: - cfg80211: - cancel rfkill work appropriately - fix radiotap parsing to correctly reject field 18 - fix wext (yes...) off-by-one for IGTK key ID - mac80211: - fix for mesh NULL pointer dereference - fix for stack out-of-bounds (2 bytes) write on specific multi-link action frames - set default WMM parameters for all links - mwifiex: check dev_alloc_name() return value correctly - libertas: fix potential timer use-after-free - brcmfmac: fix crash on probe failure * tag 'wireless-2026-02-25' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless: wifi: mac80211: fix NULL pointer dereference in mesh_rx_csa_frame() wifi: mac80211: bounds-check link_id in ieee80211_ml_reconfiguration wifi: mac80211: set default WMM parameters on all links wifi: libertas: fix use-after-free in lbs_free_adapter() wifi: mwifiex: Fix dev_alloc_name() return value check wifi: brcmfmac: Fix potential kernel oops when probe fails wifi: radiotap: reject radiotap with unknown bits wifi: cfg80211: cancel rfkill_block work in wiphy_unregister() wifi: cfg80211: wext: fix IGTK key ID off-by-one ==================== Link: https://patch.msgid.link/20260225113159.360574-3-johannes@sipsolutions.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-25 19:54:28 -08:00
Simon Baatz	1e3bb184e9	tcp: re-enable acceptance of FIN packets when RWIN is 0 Commit `2bd99aef1b` ("tcp: accept bare FIN packets under memory pressure") allowed accepting FIN packets in tcp_data_queue() even when the receive window was closed, to prevent ACK/FIN loops with broken clients. Such a FIN packet is in sequence, but because the FIN consumes a sequence number, it extends beyond the window. Before commit `9ca48d616e` ("tcp: do not accept packets beyond window"), tcp_sequence() only required the seq to be within the window. After that change, the entire packet (including the FIN) must fit within the window. As a result, such FIN packets are now dropped and the handling path is no longer reached. Be more lenient by not counting the sequence number consumed by the FIN when calling tcp_sequence(), restoring the previous behavior for cases where only the FIN extends beyond the window. Fixes: `9ca48d616e` ("tcp: do not accept packets beyond window") Signed-off-by: Simon Baatz <gmbnomis@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260224-fix_zero_wnd_fin-v2-1-a16677ea7cea@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-25 19:07:02 -08:00
Greg Kroah-Hartman	5cc619583c	vsock: Use container_of() to get net namespace in sysctl handlers current->nsproxy is should not be accessed directly as syzbot has found that it could be NULL at times, causing crashes. Fix up the af_vsock sysctl handlers to use container_of() to deal with the current net namespace instead of attempting to rely on current. This is the same type of change done in commit `7f5611cbc4` ("rds: sysctl: rds_tcp_{rcv,snd}buf: avoid using current->nsproxy") Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Bobby Eshleman <bobbyeshleman@meta.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Fixes: `eafb64f40c` ("vsock: add netns to vsock core") Link: https://patch.msgid.link/2026022318-rearview-gallery-ae13@gregkh Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-25 18:59:18 -08:00
Sabrina Dubroca	0c0eef8ccd	esp: fix skb leak with espintcp and async crypto When the TX queue for espintcp is full, esp_output_tail_tcp will return an error and not free the skb, because with synchronous crypto, the common xfrm output code will drop the packet for us. With async crypto (esp_output_done), we need to drop the skb when esp_output_tail_tcp returns an error. Fixes: `e27cca96cd` ("xfrm: add espintcp (RFC 8229)") Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>	2026-02-25 09:11:40 +01:00
Sabrina Dubroca	7d2fc41f91	xfrm: call xdo_dev_state_delete during state update When we update an SA, we construct a new state and call xdo_dev_state_add, but never insert it. The existing state is updated, then we immediately destroy the new state. Since we haven't added it, we don't go through the standard state delete code, and we're skipping removing it from the device (but xdo_dev_state_free will get called when we destroy the temporary state). This is similar to commit `c5d4d7d831` ("xfrm: Fix deletion of offloaded SAs on failure."). Fixes: `d77e38e612` ("xfrm: Add an IPsec hardware offloading API") Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>	2026-02-25 09:11:33 +01:00
Sabrina Dubroca	b57defcf8f	xfrm: fix the condition on x->pcpu_num in xfrm_sa_len pcpu_num = 0 is a valid value. The marker for "unset pcpu_num" which makes copy_to_user_state_extra not add the XFRMA_SA_PCPU attribute is UINT_MAX. Fixes: `1ddf9916ac` ("xfrm: Add support for per cpu xfrm state handling.") Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>	2026-02-25 09:11:26 +01:00
Sabrina Dubroca	aa8a3f3c67	xfrm: add missing extack for XFRMA_SA_PCPU in add_acquire and allocspi We're returning an error caused by invalid user input without setting an extack. Add one. Fixes: `1ddf9916ac` ("xfrm: Add support for per cpu xfrm state handling.") Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>	2026-02-25 09:11:04 +01:00
Paolo Abeni	1348659dc9	bluetooth pull request for net: - purge error queues in socket destructors - hci_sync: Fix CIS host feature condition - L2CAP: Fix invalid response to L2CAP_ECRED_RECONF_REQ - L2CAP: Fix result of L2CAP_ECRED_CONN_RSP when MTU is too short - L2CAP: Fix response to L2CAP_ECRED_CONN_REQ - L2CAP: Fix not checking output MTU is acceptable on L2CAP_ECRED_CONN_REQ - L2CAP: Fix missing key size check for L2CAP_LE_CONN_REQ - hci_qca: Cleanup on all setup failures -----BEGIN PGP SIGNATURE----- iQJNBAABCgA3FiEE7E6oRXp8w05ovYr/9JCA4xAyCykFAmmcw1EZHGx1aXoudm9u LmRlbnR6QGludGVsLmNvbQAKCRD0kIDjEDILKUTyD/4jtQwDrveC19zamF5n7lFY Oils6eftANcLFzLwTrMqGO7IxESga4qdNOf2vc/UgVSUfNqsPIUJ5El+LzpXZXAa sYBP/KudEX53CfU3fEVyPTUaWkZ4CdMRZeiCmgXqW7GxYbGw92SFuaSIHAP6Ep4s Z7Ryd1H0xhX9QPMc4g4IgoMiBiKzNs4GtlLSbDJcivAtbC/34nkMOxK9g+1DbU0F qzW+oPfYCpPzXTf20I1QIAMt5smnSM3Tuvo9u2pZRuEGpKjENxeY4hdAejfjeKA6 RLWXm6JvMP2lUBT68plMQQdYyQ8DxG75sVjgSoQYIu2YTVnsX76t/kD2hhiHXH/Y nQoy4dtA1/5V7Ka0cfMhcvino4Rb9Gh3dsFKJOuWRT+aTY+gNhpyr56SuJh24Y3C 7tUeEDI4fBkJGaRAbreVbaI5vw4kbSfi7IDOM/ccWDSLaG8HGaLOtn0IU8q4AgMa IkYzB5zwtiyM/zaSTO1k0HkpjR0wwftnTd+Fj2mUWdTwSeek64R9enmKYmg5UJrv 14yhfLHFsbAQo+o1B3ZslnCdYQJpgFmyAInV6Jpunc78IE9+g/YA55K22JbDDSzI t9Zy25OWLyYZyuD1PzDkMlYU5OARNYeyRXbJ3w037LrpqRoEuFsK0qTmgi+kR9C7 VR9IpCqgf4SJbL7ge83H8g== =JBaa -----END PGP SIGNATURE----- Merge tag 'for-net-2026-02-23' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth Luiz Augusto von Dentz says: ==================== bluetooth pull request for net: - purge error queues in socket destructors - hci_sync: Fix CIS host feature condition - L2CAP: Fix invalid response to L2CAP_ECRED_RECONF_REQ - L2CAP: Fix result of L2CAP_ECRED_CONN_RSP when MTU is too short - L2CAP: Fix response to L2CAP_ECRED_CONN_REQ - L2CAP: Fix not checking output MTU is acceptable on L2CAP_ECRED_CONN_REQ - L2CAP: Fix missing key size check for L2CAP_LE_CONN_REQ - hci_qca: Cleanup on all setup failures * tag 'for-net-2026-02-23' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth: Bluetooth: L2CAP: Fix missing key size check for L2CAP_LE_CONN_REQ Bluetooth: L2CAP: Fix not checking output MTU is acceptable on L2CAP_ECRED_CONN_REQ Bluetooth: Fix CIS host feature condition Bluetooth: L2CAP: Fix response to L2CAP_ECRED_CONN_REQ Bluetooth: hci_qca: Cleanup on all setup failures Bluetooth: purge error queues in socket destructors Bluetooth: L2CAP: Fix result of L2CAP_ECRED_CONN_RSP when MTU is too short Bluetooth: L2CAP: Fix invalid response to L2CAP_ECRED_RECONF_REQ ==================== Link: https://patch.msgid.link/20260223211634.3800315-1-luiz.dentz@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-02-24 15:03:08 +01:00
Sebastian Andrzej Siewior	983512f3a8	net: Drop the lock in skb_may_tx_timestamp() skb_may_tx_timestamp() may acquire sock::sk_callback_lock. The lock must not be taken in IRQ context, only softirq is okay. A few drivers receive the timestamp via a dedicated interrupt and complete the TX timestamp from that handler. This will lead to a deadlock if the lock is already write-locked on the same CPU. Taking the lock can be avoided. The socket (pointed by the skb) will remain valid until the skb is released. The ->sk_socket and ->file member will be set to NULL once the user closes the socket which may happen before the timestamp arrives. If we happen to observe the pointer while the socket is closing but before the pointer is set to NULL then we may use it because both pointer (and the file's cred member) are RCU freed. Drop the lock. Use READ_ONCE() to obtain the individual pointer. Add a matching WRITE_ONCE() where the pointer are cleared. Link: https://lore.kernel.org/all/20260205145104.iWinkXHv@linutronix.de Fixes: `b245be1f4d` ("net-timestamp: no-payload only sysctl") Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260220183858.N4ERjFW6@linutronix.de Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-02-24 11:27:29 +01:00
Fernando Fernandez Mancera	021fd0f870	net/rds: fix recursive lock in rds_tcp_conn_slots_available syzbot reported a recursive lock warning in rds_tcp_get_peer_sport() as it calls inet6_getname() which acquires the socket lock that was already held by __release_sock(). kworker/u8:6/2985 is trying to acquire lock: ffff88807a07aa20 (k-sk_lock-AF_INET6){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1709 [inline] ffff88807a07aa20 (k-sk_lock-AF_INET6){+.+.}-{0:0}, at: inet6_getname+0x15d/0x650 net/ipv6/af_inet6.c:533 but task is already holding lock: ffff88807a07aa20 (k-sk_lock-AF_INET6){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1709 [inline] ffff88807a07aa20 (k-sk_lock-AF_INET6){+.+.}-{0:0}, at: tcp_sock_set_cork+0x2c/0x2e0 net/ipv4/tcp.c:3694 lock_sock_nested+0x48/0x100 net/core/sock.c:3780 lock_sock include/net/sock.h:1709 [inline] inet6_getname+0x15d/0x650 net/ipv6/af_inet6.c:533 rds_tcp_get_peer_sport net/rds/tcp_listen.c:70 [inline] rds_tcp_conn_slots_available+0x288/0x470 net/rds/tcp_listen.c:149 rds_recv_hs_exthdrs+0x60f/0x7c0 net/rds/recv.c:265 rds_recv_incoming+0x9f6/0x12d0 net/rds/recv.c:389 rds_tcp_data_recv+0x7f1/0xa40 net/rds/tcp_recv.c:243 __tcp_read_sock+0x196/0x970 net/ipv4/tcp.c:1702 rds_tcp_read_sock net/rds/tcp_recv.c:277 [inline] rds_tcp_data_ready+0x369/0x950 net/rds/tcp_recv.c:331 tcp_rcv_established+0x19e9/0x2670 net/ipv4/tcp_input.c:6675 tcp_v6_do_rcv+0x8eb/0x1ba0 net/ipv6/tcp_ipv6.c:1609 sk_backlog_rcv include/net/sock.h:1185 [inline] __release_sock+0x1b8/0x3a0 net/core/sock.c:3213 Reading from the socket struct directly is safe from possible paths. For rds_tcp_accept_one(), the socket has just been accepted and is not yet exposed to concurrent access. For rds_tcp_conn_slots_available(), direct access avoids the recursive deadlock seen during backlog processing where the socket lock is already held from the __release_sock(). However, rds_tcp_conn_slots_available() is also called from the normal softirq path via tcp_data_ready() where the lock is not held. This is also safe because inet_dport is a stable 16 bits field. A READ_ONCE() annotation as the value might be accessed lockless in a concurrent access context. Note that it is also safe to call rds_tcp_conn_slots_available() from rds_conn_shutdown() because the fan-out is disabled. Fixes: `9d27a0fb12` ("net/rds: Trigger rds_send_ping() more than once") Reported-by: syzbot+5efae91f60932839f0a5@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=5efae91f60932839f0a5 Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de> Reviewed-by: Allison Henderson <achender@kernel.org> Link: https://patch.msgid.link/20260219075738.4403-1-fmancera@suse.de Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-02-24 10:11:04 +01:00
Vahagn Vardanian	017c179252	wifi: mac80211: fix NULL pointer dereference in mesh_rx_csa_frame() In mesh_rx_csa_frame(), elems->mesh_chansw_params_ie is dereferenced at lines 1638 and 1642 without a prior NULL check: ifmsh->chsw_ttl = elems->mesh_chansw_params_ie->mesh_ttl; ... pre_value = le16_to_cpu(elems->mesh_chansw_params_ie->mesh_pre_value); The mesh_matches_local() check above only validates the Mesh ID, Mesh Configuration, and Supported Rates IEs. It does not verify the presence of the Mesh Channel Switch Parameters IE (element ID 118). When a received CSA action frame omits that IE, ieee802_11_parse_elems() leaves elems->mesh_chansw_params_ie as NULL, and the unconditional dereference causes a kernel NULL pointer dereference. A remote mesh peer with an established peer link (PLINK_ESTAB) can trigger this by sending a crafted SPECTRUM_MGMT/CHL_SWITCH action frame that includes a matching Mesh ID and Mesh Configuration IE but omits the Mesh Channel Switch Parameters IE. No authentication beyond the default open mesh peering is required. Crash confirmed on kernel 6.17.0-5-generic via mac80211_hwsim: BUG: kernel NULL pointer dereference, address: 0000000000000000 Oops: Oops: 0000 [#1] SMP NOPTI RIP: 0010:ieee80211_mesh_rx_queued_mgmt+0x143/0x2a0 [mac80211] CR2: 0000000000000000 Fix by adding a NULL check for mesh_chansw_params_ie after mesh_matches_local() returns, consistent with how other optional IEs are guarded throughout the mesh code. The bug has been present since v3.13 (released 2014-01-19). Fixes: `8f2535b92d` ("mac80211: process the CSA frame for mesh accordingly") Cc: stable@vger.kernel.org Signed-off-by: Vahagn Vardanian <vahagn@redrays.io> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2026-02-24 10:03:10 +01:00
Tung Nguyen	3aa677625c	tipc: fix duplicate publication key in tipc_service_insert_publ() TIPC uses named table to store TIPC services represented by type and instance. Each time an application calls TIPC API bind() to bind a type/instance to a socket, an entry is created and inserted into the named table. It looks like this: named table: key1, entry1 (type, instance ...) key2, entry2 (type, instance ...) In the above table, each entry represents a route for sending data from one socket to the other. For all publications originated from the same node, the key is UNIQUE to identify each entry. It is calculated by this formula: key = socket portid + number of bindings + 1 (1) where: - socket portid: unique and calculated by using linux kernel function get_random_u32_below(). So, the value is randomized. - number of bindings: the number of times a type/instance pair is bound to a socket. This number is linearly increased, starting from 0. While the socket portid is unique and randomized by linux kernel, the linear increment of "number of bindings" in formula (1) makes "key" not unique anymore. For example: - Socket 1 is created with its associated port number 20062001. Type 1000, instance 1 is bound to socket 1: key1: 20062001 + 0 + 1 = 20062002 Then, bind() is called a second time on Socket 1 to by the same type 1000, instance 1: key2: 20062001 + 1 + 1 = 20062003 Named table: key1 (20062002), entry1 (1000, 1 ...) key2 (20062003), entry2 (1000, 1 ...) - Socket 2 is created with its associated port number 20062002. Type 1000, instance 1 is bound to socket 2: key3: 20062002 + 0 + 1 = 20062003 TIPC looks up the named table and finds out that key2 with the same value already exists and rejects the insertion into the named table. This leads to failure of bind() call from application on Socket 2 with error message EINVAL "Invalid argument". This commit fixes this issue by adding more port id checking to make sure that the key is unique to publications originated from the same port id and node. Fixes: `218527fe27` ("tipc: replace name table service range array with rb tree") Signed-off-by: Tung Nguyen <tung.quang.nguyen@est.tech> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260220050541.237962-1-tung.quang.nguyen@est.tech Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-23 17:40:52 -08:00
Jiayuan Chen	ca220141fa	kcm: fix zero-frag skb in frag_list on partial sendmsg error Syzkaller reported a warning in kcm_write_msgs() when processing a message with a zero-fragment skb in the frag_list. When kcm_sendmsg() fills MAX_SKB_FRAGS fragments in the current skb, it allocates a new skb (tskb) and links it into the frag_list before copying data. If the copy subsequently fails (e.g. -EFAULT from user memory), tskb remains in the frag_list with zero fragments: head skb (msg being assembled, NOT yet in sk_write_queue) +-----------+ \| frags[17] \| (MAX_SKB_FRAGS, all filled with data) \| frag_list-+--> tskb +-----------+ +----------+ \| frags[0] \| (empty! copy failed before filling) +----------+ For SOCK_SEQPACKET with partial data already copied, the error path saves this message via partial_message for later completion. For SOCK_SEQPACKET, sock_write_iter() automatically sets MSG_EOR, so a subsequent zero-length write(fd, NULL, 0) completes the message and queues it to sk_write_queue. kcm_write_msgs() then walks the frag_list and hits: WARN_ON(!skb_shinfo(skb)->nr_frags) TCP has a similar pattern where skbs are enqueued before data copy and cleaned up on failure via tcp_remove_empty_skb(). KCM was missing the equivalent cleanup. Fix this by tracking the predecessor skb (frag_prev) when allocating a new frag_list entry. On error, if the tail skb has zero frags, use frag_prev to unlink and free it in O(1) without walking the singly-linked frag_list. frag_prev is safe to dereference because the entire message chain is only held locally (or in kcm->seq_skb) and is not added to sk_write_queue until MSG_EOR, so the send path cannot free it underneath us. Also change the WARN_ON to WARN_ON_ONCE to avoid flooding the log if the condition is somehow hit repeatedly. There are currently no KCM selftests in the kernel tree; a simple reproducer is available at [1]. [1] https://gist.github.com/mrpre/a94d431c757e8d6f168f4dd1a3749daa Reported-by: syzbot+52624bdfbf2746d37d70@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/000000000000269a1405a12fdc77@google.com/T/ Fixes: `ab7ac4eb98` ("kcm: Kernel Connection Multiplexor module") Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com> Link: https://patch.msgid.link/20260219014256.370092-1-jiayuan.chen@linux.dev Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-23 17:26:55 -08:00
Hyunwoo Kim	7bb09315f9	tls: Fix race condition in tls_sw_cancel_work_tx() This issue was discovered during a code audit. After cancel_delayed_work_sync() is called from tls_sk_proto_close(), tx_work_handler() can still be scheduled from paths such as the Delayed ACK handler or ksoftirqd. As a result, the tx_work_handler() worker may dereference a freed TLS object. The following is a simple race scenario: cpu0 cpu1 tls_sk_proto_close() tls_sw_cancel_work_tx() tls_write_space() tls_sw_write_space() if (!test_and_set_bit(BIT_TX_SCHEDULED, &tx_ctx->tx_bitmask)) set_bit(BIT_TX_SCHEDULED, &ctx->tx_bitmask); cancel_delayed_work_sync(&ctx->tx_work.work); schedule_delayed_work(&tx_ctx->tx_work.work, 0); To prevent this race condition, cancel_delayed_work_sync() is replaced with disable_delayed_work_sync(). Fixes: `f87e62d45e` ("net/tls: remove close callback sock unlock/lock around TX work flush") Signed-off-by: Hyunwoo Kim <imv4bel@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Link: https://patch.msgid.link/aZgsFO6nfylfvLE7@v4bel Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-23 17:08:14 -08:00
Eric Dumazet	8a8a9fac9e	net: do not pass flow_id to set_rps_cpu() Blamed commit made the assumption that the RPS table for each receive queue would have the same size, and that it would not change. Compute flow_id in set_rps_cpu(), do not assume we can use the value computed by get_rps_cpu(). Otherwise we risk out-of-bound access and/or crashes. Fixes: `48aa30443e` ("net: Cache hash and flow_id to avoid recalculation") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Krishna Kumar <krikku@gmail.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260220222605.3468081-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-23 17:07:34 -08:00
Luiz Augusto von Dentz	138d7eca44	Bluetooth: L2CAP: Fix missing key size check for L2CAP_LE_CONN_REQ This adds a check for encryption key size upon receiving L2CAP_LE_CONN_REQ which is required by L2CAP/LE/CFC/BV-15-C which expects L2CAP_CR_LE_BAD_KEY_SIZE. Link: https://lore.kernel.org/linux-bluetooth/5782243.rdbgypaU67@n9w6sw14/ Fixes: `27e2d4c8d2` ("Bluetooth: Add basic LE L2CAP connect request receiving support") Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Tested-by: Christian Eggers <ceggers@arri.de>	2026-02-23 16:08:15 -05:00
Luiz Augusto von Dentz	a8d1d73c81	Bluetooth: L2CAP: Fix not checking output MTU is acceptable on L2CAP_ECRED_CONN_REQ Upon receiving L2CAP_ECRED_CONN_REQ the given MTU shall be checked against the suggested MTU of the listening socket as that is required by the likes of PTS L2CAP/ECFC/BV-27-C test which expects L2CAP_CR_LE_UNACCEPT_PARAMS if the MTU is lowers than socket omtu. In order to be able to set chan->omtu the code now allows setting setsockopt(BT_SNDMTU), but it is only allowed when connection has not been stablished since there is no procedure to reconfigure the output MTU. Link: https://github.com/bluez/bluez/issues/1895 Fixes: `15f02b9105` ("Bluetooth: L2CAP: Add initial code for Enhanced Credit Based Mode") Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>	2026-02-23 16:07:55 -05:00
Mariusz Skamra	7cff9a40c6	Bluetooth: Fix CIS host feature condition This fixes the condition for sending the LE Set Host Feature command. The command is sent to indicate host support for Connected Isochronous Streams in this case. It has been observed that the system could not initialize BIS-only capable controllers because the controllers do not support the command. As per Core v6.2 \| Vol 4, Part E, Table 3.1 the command shall be supported if CIS Central or CIS Peripheral is supported; otherwise, the command is optional. Fixes: `709788b154` ("Bluetooth: hci_core: Fix using {cis,bis}_capable for current settings") Cc: stable@vger.kernel.org Signed-off-by: Mariusz Skamra <mariusz.skamra@codecoup.pl> Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>	2026-02-23 16:07:37 -05:00
Luiz Augusto von Dentz	05761c2c2b	Bluetooth: L2CAP: Fix response to L2CAP_ECRED_CONN_REQ Similar to `03dba9cea7` ("Bluetooth: L2CAP: Fix not responding with L2CAP_CR_LE_ENCRYPTION") the result code L2CAP_CR_LE_ENCRYPTION shall be used when BT_SECURITY_MEDIUM is set since that means security mode 2 which mean it doesn't require authentication which results in qualification test L2CAP/ECFC/BV-32-C failing. Link: https://github.com/bluez/bluez/issues/1871 Fixes: `15f02b9105` ("Bluetooth: L2CAP: Add initial code for Enhanced Credit Based Mode") Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>	2026-02-23 15:31:10 -05:00
Heitor Alves de Siqueira	21e4271e65	Bluetooth: purge error queues in socket destructors When TX timestamping is enabled via SO_TIMESTAMPING, SKBs may be queued into sk_error_queue and will stay there until consumed. If userspace never gets to read the timestamps, or if the controller is removed unexpectedly, these SKBs will leak. Fix by adding skb_queue_purge() calls for sk_error_queue in affected bluetooth destructors. RFCOMM does not currently use sk_error_queue. Fixes: `134f4b39df` ("Bluetooth: add support for skb TX SND/COMPLETION timestamping") Reported-by: syzbot+7ff4013eabad1407b70a@syzkaller.appspotmail.com Closes: https://syzbot.org/bug?extid=7ff4013eabad1407b70a Cc: stable@vger.kernel.org Signed-off-by: Heitor Alves de Siqueira <halves@igalia.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>	2026-02-23 15:30:16 -05:00
Luiz Augusto von Dentz	c28d2bff70	Bluetooth: L2CAP: Fix result of L2CAP_ECRED_CONN_RSP when MTU is too short Test L2CAP/ECFC/BV-26-C expect the response to L2CAP_ECRED_CONN_REQ with and MTU value < L2CAP_ECRED_MIN_MTU (64) to be L2CAP_CR_LE_INVALID_PARAMS rather than L2CAP_CR_LE_UNACCEPT_PARAMS. Also fix not including the correct number of CIDs in the response since the spec requires all CIDs being rejected to be included in the response. Link: https://github.com/bluez/bluez/issues/1868 Fixes: `15f02b9105` ("Bluetooth: L2CAP: Add initial code for Enhanced Credit Based Mode") Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>	2026-02-23 15:28:56 -05:00
Luiz Augusto von Dentz	7accb1c432	Bluetooth: L2CAP: Fix invalid response to L2CAP_ECRED_RECONF_REQ This fixes responding with an invalid result caused by checking the wrong size of CID which should have been (cmd_len - sizeof(*req)) and on top of it the wrong result was use L2CAP_CR_LE_INVALID_PARAMS which is invalid/reserved for reconf when running test like L2CAP/ECFC/BI-03-C: > ACL Data RX: Handle 64 flags 0x02 dlen 14 LE L2CAP: Enhanced Credit Reconfigure Request (0x19) ident 2 len 6 MTU: 64 MPS: 64 Source CID: 64 < ACL Data TX: Handle 64 flags 0x00 dlen 10 LE L2CAP: Enhanced Credit Reconfigure Respond (0x1a) ident 2 len 2 ! Result: Reserved (0x000c) Result: Reconfiguration failed - one or more Destination CIDs invalid (0x0003) Fiix L2CAP/ECFC/BI-04-C which expects L2CAP_RECONF_INVALID_MPS (0x0002) when more than one channel gets its MPS reduced: > ACL Data RX: Handle 64 flags 0x02 dlen 16 LE L2CAP: Enhanced Credit Reconfigure Request (0x19) ident 2 len 8 MTU: 264 MPS: 99 Source CID: 64 ! Source CID: 65 < ACL Data TX: Handle 64 flags 0x00 dlen 10 LE L2CAP: Enhanced Credit Reconfigure Respond (0x1a) ident 2 len 2 ! Result: Reconfiguration successful (0x0000) Result: Reconfiguration failed - reduction in size of MPS not allowed for more than one channel at a time (0x0002) Fix L2CAP/ECFC/BI-05-C when SCID is invalid (85 unconnected): > ACL Data RX: Handle 64 flags 0x02 dlen 14 LE L2CAP: Enhanced Credit Reconfigure Request (0x19) ident 2 len 6 MTU: 65 MPS: 64 ! Source CID: 85 < ACL Data TX: Handle 64 flags 0x00 dlen 10 LE L2CAP: Enhanced Credit Reconfigure Respond (0x1a) ident 2 len 2 ! Result: Reconfiguration successful (0x0000) Result: Reconfiguration failed - one or more Destination CIDs invalid (0x0003) Fix L2CAP/ECFC/BI-06-C when MPS < L2CAP_ECRED_MIN_MPS (64): > ACL Data RX: Handle 64 flags 0x02 dlen 14 LE L2CAP: Enhanced Credit Reconfigure Request (0x19) ident 2 len 6 MTU: 672 ! MPS: 63 Source CID: 64 < ACL Data TX: Handle 64 flags 0x00 dlen 10 LE L2CAP: Enhanced Credit Reconfigure Respond (0x1a) ident 2 len 2 ! Result: Reconfiguration failed - reduction in size of MPS not allowed for more than one channel at a time (0x0002) Result: Reconfiguration failed - other unacceptable parameters (0x0004) Fix L2CAP/ECFC/BI-07-C when MPS reduced for more than one channel: > ACL Data RX: Handle 64 flags 0x02 dlen 16 LE L2CAP: Enhanced Credit Reconfigure Request (0x19) ident 3 len 8 MTU: 84 ! MPS: 71 Source CID: 64 ! Source CID: 65 < ACL Data TX: Handle 64 flags 0x00 dlen 10 LE L2CAP: Enhanced Credit Reconfigure Respond (0x1a) ident 2 len 2 ! Result: Reconfiguration successful (0x0000) Result: Reconfiguration failed - reduction in size of MPS not allowed for more than one channel at a time (0x0002) Link: https://github.com/bluez/bluez/issues/1865 Fixes: `15f02b9105` ("Bluetooth: L2CAP: Add initial code for Enhanced Credit Based Mode") Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>	2026-02-23 15:23:37 -05:00
Ariel Silver	162d331d83	wifi: mac80211: bounds-check link_id in ieee80211_ml_reconfiguration link_id is taken from the ML Reconfiguration element (control & 0x000f), so it can be 0..15. link_removal_timeout[] has IEEE80211_MLD_MAX_NUM_LINKS (15) elements, so index 15 is out-of-bounds. Skip subelements with link_id >= IEEE80211_MLD_MAX_NUM_LINKS to avoid a stack out-of-bounds write. Fixes: `8eb8dd2ffb` ("wifi: mac80211: Support link removal using Reconfiguration ML element") Reported-by: Ariel Silver <arielsilver77@gmail.com> Signed-off-by: Ariel Silver <arielsilver77@gmail.com> Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20260220101129.1202657-1-Ariel.Silver@cybereason.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2026-02-23 12:35:34 +01:00
Ramanathan Choodamani	2259d14499	wifi: mac80211: set default WMM parameters on all links Currently, mac80211 only initializes default WMM parameters on the deflink during do_open(). For MLO cases, this leaves the additional links without proper WMM defaults if hostapd does not supply per-link WMM parameters, leading to inconsistent QoS behavior across links. Set default WMM parameters for each link during ieee80211_vif_update_links(), because this ensures all individual links in an MLD have valid WMM settings during bring-up and behave consistently across different BSS. Signed-off-by: Ramanathan Choodamani <quic_rchoodam@quicinc.com> Signed-off-by: Aishwarya R <aishwarya.r@oss.qualcomm.com> Link: https://patch.msgid.link/20260205094216.3093542-1-aishwarya.r@oss.qualcomm.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2026-02-23 09:30:20 +01:00
Johannes Berg	c854758abe	wifi: radiotap: reject radiotap with unknown bits The radiotap parser is currently only used with the radiotap namespace (not with vendor namespaces), but if the undefined field 18 is used, the alignment/size is unknown as well. In this case, iterator->_next_ns_data isn't initialized (it's only set for skipping vendor namespaces), and syzbot points out that we later compare against this uninitialized value. Fix this by moving the rejection of unknown radiotap fields down to after the in-namespace lookup, so it will really use iterator->_next_ns_data only for vendor namespaces, even in case undefined fields are present. Cc: stable@vger.kernel.org Fixes: `33e5a2f776` ("wireless: update radiotap parser") Reported-by: syzbot+b09c1af8764c0097bb19@syzkaller.appspotmail.com Closes: https://lore.kernel.org/r/69944a91.a70a0220.2c38d7.00fc.GAE@google.com Link: https://patch.msgid.link/20260217120526.162647-2-johannes@sipsolutions.net Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2026-02-23 09:23:44 +01:00
Daniil Dulov	767d23ade7	wifi: cfg80211: cancel rfkill_block work in wiphy_unregister() There is a use-after-free error in cfg80211_shutdown_all_interfaces found by syzkaller: BUG: KASAN: use-after-free in cfg80211_shutdown_all_interfaces+0x213/0x220 Read of size 8 at addr ffff888112a78d98 by task kworker/0:5/5326 CPU: 0 UID: 0 PID: 5326 Comm: kworker/0:5 Not tainted 6.19.0-rc2 #2 PREEMPT(voluntary) Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014 Workqueue: events cfg80211_rfkill_block_work Call Trace: <TASK> dump_stack_lvl+0x116/0x1f0 print_report+0xcd/0x630 kasan_report+0xe0/0x110 cfg80211_shutdown_all_interfaces+0x213/0x220 cfg80211_rfkill_block_work+0x1e/0x30 process_one_work+0x9cf/0x1b70 worker_thread+0x6c8/0xf10 kthread+0x3c5/0x780 ret_from_fork+0x56d/0x700 ret_from_fork_asm+0x1a/0x30 </TASK> The problem arises due to the rfkill_block work is not cancelled when wiphy is being unregistered. In order to fix the issue cancel the corresponding work in wiphy_unregister(). Found by Linux Verification Center (linuxtesting.org) with Syzkaller. Fixes: `1f87f7d3a3` ("cfg80211: add rfkill support") Cc: stable@vger.kernel.org Signed-off-by: Daniil Dulov <d.dulov@aladdin.ru> Link: https://patch.msgid.link/20260211082024.1967588-1-d.dulov@aladdin.ru Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2026-02-23 09:21:55 +01:00
Johannes Berg	c8d7f21ead	wifi: cfg80211: wext: fix IGTK key ID off-by-one The IGTK key ID must be 4 or 5, but the code checks against key ID + 1, so must check against 5/6 rather than 4/5. Fix that. Reported-by: Jouni Malinen <j@w1.fi> Fixes: `08645126dd` ("cfg80211: implement wext key handling") Link: https://patch.msgid.link/20260209181220.362205-2-johannes@sipsolutions.net Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2026-02-23 09:18:59 +01:00
Kees Cook	189f164e57	Convert remaining multi-line kmalloc_obj/flex GFP_KERNEL uses Conversion performed via this Coccinelle script: // SPDX-License-Identifier: GPL-2.0-only // Options: --include-headers-for-types --all-includes --include-headers --keep-comments virtual patch @gfp depends on patch && !(file in "tools") && !(file in "samples")@ identifier ALLOC = {kmalloc_obj,kmalloc_objs,kmalloc_flex, kzalloc_obj,kzalloc_objs,kzalloc_flex, kvmalloc_obj,kvmalloc_objs,kvmalloc_flex, kvzalloc_obj,kvzalloc_objs,kvzalloc_flex}; @@ ALLOC(... - , GFP_KERNEL ) $ make coccicheck MODE=patch COCCI=gfp.cocci Build and boot tested x86_64 with Fedora 42's GCC and Clang: Linux version 6.19.0+ (user@host) (gcc (GCC) 15.2.1 20260123 (Red Hat 15.2.1-7), GNU ld version 2.44-12.fc42) #1 SMP PREEMPT_DYNAMIC 1970-01-01 Linux version 6.19.0+ (user@host) (clang version 20.1.8 (Fedora 20.1.8-4.fc42), LLD 20.1.8) #1 SMP PREEMPT_DYNAMIC 1970-01-01 Signed-off-by: Kees Cook <kees@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2026-02-22 08:26:33 -08:00
Linus Torvalds	32a92f8c89	Convert more 'alloc_obj' cases to default GFP_KERNEL arguments This converts some of the visually simpler cases that have been split over multiple lines. I only did the ones that are easy to verify the resulting diff by having just that final GFP_KERNEL argument on the next line. Somebody should probably do a proper coccinelle script for this, but for me the trivial script actually resulted in an assertion failure in the middle of the script. I probably had made it a bit _too_ trivial. So after fighting that far a while I decided to just do some of the syntactically simpler cases with variations of the previous 'sed' scripts. The more syntactically complex multi-line cases would mostly really want whitespace cleanup anyway. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2026-02-21 20:03:00 -08:00
Linus Torvalds	323bbfcf1e	Convert 'alloc_flex' family to use the new default GFP_KERNEL argument This is the exact same thing as the 'alloc_obj()' version, only much smaller because there are a lot fewer users of the alloc_flex() interface. As with alloc_obj() version, this was done entirely with mindless brute force, using the same script, except using 'flex' in the pattern rather than 'objs'. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2026-02-21 17:09:51 -08:00
Linus Torvalds	bf4afc53b7	Convert 'alloc_obj' family to use the new default GFP_KERNEL argument This was done entirely with mindless brute force, using git grep -l '\<k[vmz]alloc_objs(., GFP_KERNEL)' \| xargs sed -i 's/$alloc_objs(.*$, GFP_KERNEL)/\1)/' to convert the new alloc_obj() users that had a simple GFP_KERNEL argument to just drop that argument. Note that due to the extreme simplicity of the scripting, any slightly more complex cases spread over multiple lines would not be triggered: they definitely exist, but this covers the vast bulk of the cases, and the resulting diff is also then easier to check automatically. For the same reason the 'flex' versions will be done as a separate conversion. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2026-02-21 17:09:51 -08:00
Sven Eckelmann	cfc83a3c71	batman-adv: Avoid double-rtnl_lock ELP metric worker batadv_v_elp_get_throughput() might be called when the RTNL lock is already held. This could be problematic when the work queue item is cancelled via cancel_delayed_work_sync() in batadv_v_elp_iface_disable(). In this case, an rtnl_lock() would cause a deadlock. To avoid this, rtnl_trylock() was used in this function to skip the retrieval of the ethtool information in case the RTNL lock was already held. But for cfg80211 interfaces, batadv_get_real_netdev() was called - which also uses rtnl_lock(). The approach for __ethtool_get_link_ksettings() must also be used instead and the lockless version __batadv_get_real_netdev() has to be called. Cc: stable@vger.kernel.org Fixes: `8c8ecc98f5` ("batman-adv: Drop unmanaged ELP metric worker") Reported-by: Christian Schmidbauer <github@grische.xyz> Signed-off-by: Sven Eckelmann <sven@narfation.org> Tested-by: Sören Skaarup <freifunk_nordm4nn@gmx.de> Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>	2026-02-21 13:01:55 +01:00
Kees Cook	69050f8d6d	treewide: Replace kmalloc with kmalloc_obj for non-scalar types This is the result of running the Coccinelle script from scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to avoid scalar types (which need careful case-by-case checking), and instead replace kmalloc-family calls that allocate struct or union object instances: Single allocations: kmalloc(sizeof(TYPE), ...) are replaced with: kmalloc_obj(TYPE, ...) Array allocations: kmalloc_array(COUNT, sizeof(TYPE), ...) are replaced with: kmalloc_objs(TYPE, COUNT, ...) Flex array allocations: kmalloc(struct_size(PTR, FAM, COUNT), ...) are replaced with: kmalloc_flex(PTR, FAM, COUNT, ...) (where TYPE may also be VAR) The resulting allocations no longer return "void ", instead returning "TYPE ". Signed-off-by: Kees Cook <kees@kernel.org>	2026-02-21 01:02:28 -08:00
Kuniyuki Iwashima	470c7ca2b4	udplite: Fix null-ptr-deref in __udp_enqueue_schedule_skb(). syzbot reported null-ptr-deref of udp_sk(sk)->udp_prod_queue. [0] Since the cited commit, udp_lib_init_sock() can fail, as can udp_init_sock() and udpv6_init_sock(). Let's handle the error in udplite_sk_init() and udplitev6_sk_init(). [0]: BUG: KASAN: null-ptr-deref in instrument_atomic_read include/linux/instrumented.h:82 [inline] BUG: KASAN: null-ptr-deref in atomic_read include/linux/atomic/atomic-instrumented.h:32 [inline] BUG: KASAN: null-ptr-deref in __udp_enqueue_schedule_skb+0x151/0x1480 net/ipv4/udp.c:1719 Read of size 4 at addr 0000000000000008 by task syz.2.18/2944 CPU: 1 UID: 0 PID: 2944 Comm: syz.2.18 Not tainted syzkaller #0 PREEMPTLAZY Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/25/2025 Call Trace: <IRQ> dump_stack_lvl+0xe8/0x150 lib/dump_stack.c:120 kasan_report+0xa2/0xe0 mm/kasan/report.c:595 check_region_inline mm/kasan/generic.c:-1 [inline] kasan_check_range+0x264/0x2c0 mm/kasan/generic.c:200 instrument_atomic_read include/linux/instrumented.h:82 [inline] atomic_read include/linux/atomic/atomic-instrumented.h:32 [inline] __udp_enqueue_schedule_skb+0x151/0x1480 net/ipv4/udp.c:1719 __udpv6_queue_rcv_skb net/ipv6/udp.c:795 [inline] udpv6_queue_rcv_one_skb+0xa2e/0x1ad0 net/ipv6/udp.c:906 udp6_unicast_rcv_skb+0x227/0x380 net/ipv6/udp.c:1064 ip6_protocol_deliver_rcu+0xe17/0x1540 net/ipv6/ip6_input.c:438 ip6_input_finish+0x191/0x350 net/ipv6/ip6_input.c:489 NF_HOOK+0x354/0x3f0 include/linux/netfilter.h:318 ip6_input+0x16c/0x2b0 net/ipv6/ip6_input.c:500 NF_HOOK+0x354/0x3f0 include/linux/netfilter.h:318 __netif_receive_skb_one_core net/core/dev.c:6149 [inline] __netif_receive_skb+0xd3/0x370 net/core/dev.c:6262 process_backlog+0x4d6/0x1160 net/core/dev.c:6614 __napi_poll+0xae/0x320 net/core/dev.c:7678 napi_poll net/core/dev.c:7741 [inline] net_rx_action+0x60d/0xdc0 net/core/dev.c:7893 handle_softirqs+0x209/0x8d0 kernel/softirq.c:622 do_softirq+0x52/0x90 kernel/softirq.c:523 </IRQ> <TASK> __local_bh_enable_ip+0xe7/0x120 kernel/softirq.c:450 local_bh_enable include/linux/bottom_half.h:33 [inline] rcu_read_unlock_bh include/linux/rcupdate.h:924 [inline] __dev_queue_xmit+0x109c/0x2dc0 net/core/dev.c:4856 __ip6_finish_output net/ipv6/ip6_output.c:-1 [inline] ip6_finish_output+0x158/0x4e0 net/ipv6/ip6_output.c:219 NF_HOOK_COND include/linux/netfilter.h:307 [inline] ip6_output+0x342/0x580 net/ipv6/ip6_output.c:246 ip6_send_skb+0x1d7/0x3c0 net/ipv6/ip6_output.c:1984 udp_v6_send_skb+0x9a5/0x1770 net/ipv6/udp.c:1442 udp_v6_push_pending_frames+0xa2/0x140 net/ipv6/udp.c:1469 udpv6_sendmsg+0xfe0/0x2830 net/ipv6/udp.c:1759 sock_sendmsg_nosec net/socket.c:727 [inline] __sock_sendmsg+0xe5/0x270 net/socket.c:742 __sys_sendto+0x3eb/0x580 net/socket.c:2206 __do_sys_sendto net/socket.c:2213 [inline] __se_sys_sendto net/socket.c:2209 [inline] __x64_sys_sendto+0xde/0x100 net/socket.c:2209 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xd2/0xf20 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x7f67b4d9c629 Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 e8 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007f67b5c98028 EFLAGS: 00000246 ORIG_RAX: 000000000000002c RAX: ffffffffffffffda RBX: 00007f67b5015fa0 RCX: 00007f67b4d9c629 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000003 RBP: 00007f67b4e32b39 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000040000 R11: 0000000000000246 R12: 0000000000000000 R13: 00007f67b5016038 R14: 00007f67b5015fa0 R15: 00007ffe3cb66dd8 </TASK> Fixes: `b650bf0977` ("udp: remove busylock and add per NUMA queues") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260219173142.310741-1-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-20 16:14:10 -08:00
Jakub Kicinski	0aebd81fa8	ipsec-2026-02-20 -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEH7ZpcWbFyOOp6OJbrB3Eaf9PW7cFAmmYIsoACgkQrB3Eaf9P W7d4HQ/8DblZh6FcGi+/XR9jSfqWjFRTU37FHrgpveOtCcoXqYYC8OwKqYnqRABh rKgynAjrzKLKYUNFEOZVEYRb/ohKdZ9WlxAOK9ezSlFoEp7W3jfB8hkYv98vl45E 8gw6dqRZt2J9hKa0mcyBPosKZ43yShNWEVktQWpoFOL4fy6fCZVpgOwMSzEr8oRV 56AjvHjM8oFDP3BEDPeCGayzC8GFlER8fc79sUZNDpRr5OQtGo1NoceyUaGIJxZS d7g7WPgbewbfpx+IQavhmfiLYWXNwPal8aTtUNIZclPVB75+efkDNWf89O7ZGlZE 5LLo2Ix2oG/IP3EmKA42IqO6Rx7T6N89kK3AwXeEVP1BciwYhYch0L0ts5XdU6nG A9fQQ+qNukVK8F65dk32zSTStAsGUh/WxgAgY0jnbDwJlOsVwf4B9CEcTC3RavtS OvW2vIVtBYq3xdLh3DoUMxvLj+LIk6WOuicO4QHk+qDqHD0/gbkxVbb7hpXALOvc CCf5/+PG6s2uatIlsOJp+hg8BAQqG1s8vcvfHYpfBzLjJhTA4cem2pIFchMeIgei f25W5vzftMNm+sZejAhCzBwDkrEegNpjE6BbyQ4psYh44QIyRzveDVIHdVZmgpv6 nXCcL2K9jgkdUG4TLOj1FYTp/cWhNOGGyh6gVCVH+mupbdyTd6A= =Aa4a -----END PGP SIGNATURE----- Merge tag 'ipsec-2026-02-20' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec Steffen Klassert says: ==================== pull request (net): ipsec 2026-02-20 1) Check the value of ipv6_dev_get_saddr() to fix an uninitialized saddr in xfrm6_get_saddr(). From Jiayuan Chen. 2) Skip the templates check for packet offload in tunnel mode. Is was already done by the hardware and causes an unexpected XfrmInTmplMismatch increase. From Leon Romanovsky. 3) Fix a unregister_netdevice stall due to not dropped refcounts by always flushing xfrm state and policy on a NETDEV_UNREGISTER event. From Tetsuo Handa. * tag 'ipsec-2026-02-20' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec: xfrm: always flush state and policy upon NETDEV_UNREGISTER event xfrm: skip templates check for packet offload tunnel mode xfrm6: fix uninitialized saddr in xfrm6_get_saddr() ==================== Link: https://patch.msgid.link/20260220094133.14219-1-steffen.klassert@secunet.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-20 15:57:55 -08:00
Hyunwoo Kim	e1512c1db9	espintcp: Fix race condition in espintcp_close() This issue was discovered during a code audit. After cancel_work_sync() is called from espintcp_close(), espintcp_tx_work() can still be scheduled from paths such as the Delayed ACK handler or ksoftirqd. As a result, the espintcp_tx_work() worker may dereference a freed espintcp ctx or sk. The following is a simple race scenario: cpu0 cpu1 espintcp_close() cancel_work_sync(&ctx->work); espintcp_write_space() schedule_work(&ctx->work); To prevent this race condition, cancel_work_sync() is replaced with disable_work_sync(). Fixes: `e27cca96cd` ("xfrm: add espintcp (RFC 8229)") Signed-off-by: Hyunwoo Kim <imv4bel@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/aZSie7rEdh9Nu0eM@v4bel Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-19 14:27:40 -08:00
Eric Dumazet	f891007ab1	psp: use sk->sk_hash in psp_write_headers() udp_flow_src_port() is indirectly using sk->sk_txhash as a base, because __tcp_transmit_skb() uses skb_set_hash_from_sk(). This is problematic because this field can change over the lifetime of a TCP flow, thanks to calls to sk_rethink_txhash(). Problem is that some NIC might (ab)use the PSP UDP source port in their RSS computation, and PSP packets for a given flow could jump from one queue to another. In order to avoid surprises, it is safer to let Protective Load Balancing (PLB) get its entropy from the IPv6 flowlabel, and change psp_write_headers() to use sk->sk_hash which does not change for the duration of the flow. We might add a sysctl to select the behavior, if there is a need for it. Fixes: `fc72451574` ("psp: provide encapsulation helper for drivers") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-By: Daniel Zahka <daniel.zahka@gmail.com> Link: https://patch.msgid.link/20260218141337.999945-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-19 14:04:23 -08:00
Eric Dumazet	858d2a4f67	tcp: fix potential race in tcp_v6_syn_recv_sock() Code in tcp_v6_syn_recv_sock() after the call to tcp_v4_syn_recv_sock() is done too late. After tcp_v4_syn_recv_sock(), the child socket is already visible from TCP ehash table and other cpus might use it. Since newinet->pinet6 is still pointing to the listener ipv6_pinfo bad things can happen as syzbot found. Move the problematic code in tcp_v6_mapped_child_init() and call this new helper from tcp_v4_syn_recv_sock() before the ehash insertion. This allows the removal of one tcp_sync_mss(), since tcp_v4_syn_recv_sock() will call it with the correct context. Fixes: `1da177e4c3` ("Linux-2.6.12-rc2") Reported-by: syzbot+937b5bbb6a815b3e5d0b@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/69949275.050a0220.2eeac1.0145.GAE@google.com/ Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260217161205.2079883-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-19 14:02:19 -08:00
Linus Torvalds	8bf22c33e7	Including fixes from Netfilter. Current release - new code bugs: - net: fix backlog_unlock_irq_restore() vs CONFIG_PREEMPT_RT - eth: mlx5e: XSK, Fix unintended ICOSQ change - phy_port: correctly recompute the port's linkmodes - vsock: prevent child netns mode switch from local to global - couple of kconfig fixes for new symbols Previous releases - regressions: - nfc: nci: fix false-positive parameter validation for packet data - net: do not delay zero-copy skbs in skb_attempt_defer_free() Previous releases - always broken: - mctp: ensure our nlmsg responses to user space are zero-initialised - ipv6: ioam: fix heap buffer overflow in __ioam6_fill_trace_data() - fixes for ICMP rate limiting Misc: - intel: fix PCI device ID conflict between i40e and ipw2200 Signed-off-by: Jakub Kicinski <kuba@kernel.org> -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmmXUh8ACgkQMUZtbf5S IrufYA//ZVj+4gvegqKwKZYXNBndVW00GGTYqaILbaenK1olUVUelVB91eV2Klc/ dXCeKG/MgEPuT89IjkPzVr2Yg4x6uhjcQL1rsahORn+GuQfSI/P8y7ysDOPnHVeM Rtsg1m8z3EizJcHPeAJe7nEqFzfvZ2m+FCEGe++z8BYaUZUVApytgpIWOHO/aB+p t13bCNzd05XxPphMl610T00Fncj2jCVDHILMgTB5rmFmkeJuQwNrRGXQSoQame46 +g+yCZjT0eVTrBaH1EUssWfrOT3VJj3BEee6gSp7k9mxMkbW18i8shBgmxS+EHjk u19wwBzSrHK+JY1UExim+1E/rZisQVmEE1Gs0ALedxAu9zC/Julzfa2/+BFsc0j7 QTXd4jukG3aTPIX8v3TV2Igu0j+bAT4WdpzvnsXXBMVKy7wFYMd1+aSOLyFH2W9L qRbg50oUATcsz77bZt6YUTJEgua4HXNYGtn15FMZOR7HJVR2L44Q5TK5mQxGp5iM GabeKMzg6bsjE98STM3nbWks3pIb9ptIk++i0913eSqKgn84bDPtp3Gabfgle2SJ 8gjKS61K8rDt5x8StXVod7oGQ4asL8RJyOtE/avgbWUu9BNH8/oKqsE6TQrpXauv 1ndiyim/mPe4fBCxkVAi2+uq5/ph9z8XyleESz9VYwyL3Rl4nsg= =qSCj -----END PGP SIGNATURE----- Merge tag 'net-7.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from Netfilter. Current release - new code bugs: - net: fix backlog_unlock_irq_restore() vs CONFIG_PREEMPT_RT - eth: mlx5e: XSK, Fix unintended ICOSQ change - phy_port: correctly recompute the port's linkmodes - vsock: prevent child netns mode switch from local to global - couple of kconfig fixes for new symbols Previous releases - regressions: - nfc: nci: fix false-positive parameter validation for packet data - net: do not delay zero-copy skbs in skb_attempt_defer_free() Previous releases - always broken: - mctp: ensure our nlmsg responses to user space are zero-initialised - ipv6: ioam: fix heap buffer overflow in __ioam6_fill_trace_data() - fixes for ICMP rate limiting Misc: - intel: fix PCI device ID conflict between i40e and ipw2200" * tag 'net-7.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (85 commits) net: nfc: nci: Fix parameter validation for packet data net/mlx5e: Use unsigned for mlx5e_get_max_num_channels net/mlx5e: Fix deadlocks between devlink and netdev instance locks net/mlx5e: MACsec, add ASO poll loop in macsec_aso_set_arm_event net/mlx5: Fix misidentification of write combining CQE during poll loop net/mlx5e: Fix misidentification of ASO CQE during poll loop net/mlx5: Fix multiport device check over light SFs bonding: alb: fix UAF in rlb_arp_recv during bond up/down bnge: fix reserving resources from FW eth: fbnic: Advertise supported XDP features. rds: tcp: fix uninit-value in __inet_bind net/rds: Fix NULL pointer dereference in rds_tcp_accept_one octeontx2-af: Fix default entries mcam entry action net/mlx5e: XSK, Fix unintended ICOSQ change ipv6: icmp: icmpv6_xrlim_allow() optimization if net.ipv6.icmp.ratelimit is zero ipv4: icmp: icmpv4_xrlim_allow() optimization if net.ipv4.icmp_ratelimit is zero ipv6: icmp: remove obsolete code in icmpv6_xrlim_allow() inet: move icmp_global_{credit,stamp} to a separate cache line icmp: prevent possible overflow in icmp_global_allow() selftests/net: packetdrill: add ipv4-mapped-ipv6 tests ...	2026-02-19 10:39:08 -08:00
Michael Thalmeier	571dcbeb8e	net: nfc: nci: Fix parameter validation for packet data Since commit `9c328f5474` ("net: nfc: nci: Add parameter validation for packet data") communication with nci nfc chips is not working any more. The mentioned commit tries to fix access of uninitialized data, but failed to understand that in some cases the data packet is of variable length and can therefore not be compared to the maximum packet length given by the sizeof(struct). Fixes: `9c328f5474` ("net: nfc: nci: Add parameter validation for packet data") Cc: stable@vger.kernel.org Signed-off-by: Michael Thalmeier <michael.thalmeier@hale.at> Reported-by: syzbot+740e04c2a93467a0f8c8@syzkaller.appspotmail.com Link: https://patch.msgid.link/20260218083000.301354-1-michael.thalmeier@hale.at Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-19 09:32:51 -08:00
Tabrez Ahmed	7b821da55b	rds: tcp: fix uninit-value in __inet_bind KMSAN reported an uninit-value access in __inet_bind() when binding an RDS TCP socket. The uninitialized memory originates from rds_tcp_conn_alloc(), which uses kmem_cache_alloc() to allocate the rds_tcp_connection structure. Specifically, the field 't_client_port_group' is incremented in rds_tcp_conn_path_connect() without being initialized first: if (++tc->t_client_port_group >= port_groups) Since kmem_cache_alloc() does not zero the memory, this field contains garbage, leading to the KMSAN report. Fix this by using kmem_cache_zalloc() to ensure the structure is zero-initialized upon allocation. Reported-by: syzbot+aae646f09192f72a68dc@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=aae646f09192f72a68dc Tested-by: syzbot+aae646f09192f72a68dc@syzkaller.appspotmail.com Fixes: `a20a699255` ("net/rds: Encode cp_index in TCP source port") Signed-off-by: Tabrez Ahmed <tabreztalks@gmail.com> Reviewed-by: Charalampos Mitrodimas <charmitro@posteo.net> Reviewed-by: Allison Henderson <achender@kernel.org> Link: https://patch.msgid.link/20260217135350.33641-1-tabreztalks@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-02-19 16:05:56 +01:00
Allison Henderson	6bf45704a9	net/rds: Fix NULL pointer dereference in rds_tcp_accept_one Save a local pointer to new_sock->sk and hold a reference before installing callbacks in rds_tcp_accept_one. After rds_tcp_set_callbacks() or rds_tcp_reset_callbacks(), tc->t_sock is set to new_sock which may race with the shutdown path. A concurrent rds_tcp_conn_path_shutdown() may call sock_release(), which sets new_sock->sk = NULL and may eventually free sk when the refcount reaches zero. Subsequent accesses to new_sock->sk->sk_state would dereference NULL, causing the crash. The fix saves a local sk pointer before callbacks are installed so that sk_state can be accessed safely even after new_sock->sk is nulled, and uses sock_hold()/sock_put() to ensure sk itself remains valid for the duration. Fixes: `826c1004d4` ("net/rds: rds_tcp_conn_path_shutdown must not discard messages") Reported-by: syzbot+96046021045ffe6d7709@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=96046021045ffe6d7709 Signed-off-by: Allison Henderson <achender@kernel.org> Link: https://patch.msgid.link/20260216222643.2391390-1-achender@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-02-19 15:57:56 +01:00
Jakub Kicinski	284f1f176f	netfilter pull request nf-26-02-17 -----BEGIN PGP SIGNATURE----- iQJdBAABCABHFiEEgKkgxbID4Gn1hq6fcJGo2a1f9gAFAmmUlGEbFIAAAAAABAAO bWFudTIsMi41KzEuMTEsMiwyDRxmd0BzdHJsZW4uZGUACgkQcJGo2a1f9gAIYxAA 0AfkdXmCWMcHgWpMeW/K1R6SVxFpPLiXbvBSgQ7rUARARFQLTn7HkJGwpeq/slYp eAOnAF2e6j50dPJTAaa7hL4zMAuMe2a3zHPB03xg937FX/1wc/8kv73WCM7FkOSk yqOD/VhrbbW4Texc93wiYk+EZCjVyoQPb9wEbQkn6G1uUZddGeSllQLsYcs2/Vz0 o4vL6ouoxwIAud5Hqt63dV5IAwneCqp2g2Wg9QiwZK9oOf1l9QgN0axFXkn1z1pP I60zQwWQgalMeuQw3jiktvZF45hyUvf9tR4CHvaOlpkg0ExOOr3o6o70J1eq9WlR aIax9oRZC9N8EEel7AyFHD53E0ooI1T87kz05XLnB2Jb+vJ25smpTNLGOZMi68bD Sg6xywF3lUsHe0K1rdH+dXeY4VmN1oB8r8jxbxU15H9JiGGnO5/pVxxfq0kFhRHU Q1490qAX8Se+Pa+cj1nwBESTvAFH9jDZSbBQC6wc328l4kXOzQa+EKCrj+dae5q2 PNTG0xkvMMV3RKB15V2YZ9Ay3bA5Lf4iz8/E0pUHhCN8E/SADctfhZQLG0T3eg1K D3Ix1/iqeMDuq2t8A6uCq2bPrxKg8ijDsLWJ4ilCyirfRurlr3SrQAkMx6Tu7AnC XOYh5VBFrKywoWi/4kivXhQ1EengGUA9sev7WRC5yXo= =yncs -----END PGP SIGNATURE----- Merge tag 'nf-26-02-17' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf Florian Westphal says: ==================== netfilter: updates for net The following patchset contains Netfilter fixes for net: 1) Add missing __rcu annotations to NAT helper hook pointers in Amanda, FTP, IRC, SNMP and TFTP helpers. From Sun Jian. 2-4): - Add global spinlock to serialize nft_counter fetch+reset operations. - Use atomic64_xchg() for nft_quota reset instead of read+subtract pattern. Note AI review detects a race in this change but it isn't new. The 'racing' bit only exists to prevent constant stream of 'quota expired' notifications. - Revert commit_mutex usage in nf_tables reset path, it caused circular lock dependency. All from Brian Witte. 5) Fix uninitialized l3num value in nf_conntrack_h323 helper. 6) Fix musl libc compatibility in netfilter_bridge.h UAPI header. This change isn't nice (UAPI headers should not include libc headers), but as-is musl builds may fail due to redefinition of struct ethhdr. 7) Fix protocol checksum validation in IPVS for IPv6 with extension headers, from Julian Anastasov. 8) Fix device reference leak in IPVS when netdev goes down. Also from Julian. 9) Remove WARN_ON_ONCE when accessing forward path array, this can trigger with sufficiently long forward paths. From Pablo Neira Ayuso. 10) Fix use-after-free in nf_tables_addchain() error path, from Inseo An. * tag 'nf-26-02-17' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf: netfilter: nf_tables: fix use-after-free in nf_tables_addchain() net: remove WARN_ON_ONCE when accessing forward path array ipvs: do not keep dest_dst if dev is going down ipvs: skip ipv6 extension headers for csum checks include: uapi: netfilter_bridge.h: Cover for musl libc netfilter: nf_conntrack_h323: don't pass uninitialised l3num value netfilter: nf_tables: revert commit_mutex usage in reset path netfilter: nft_quota: use atomic64_xchg for reset netfilter: nft_counter: serialize reset with spinlock netfilter: annotate NAT helper hook pointers with __rcu ==================== Link: https://patch.msgid.link/20260217163233.31455-1-fw@strlen.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-18 17:09:31 -08:00
Eric Dumazet	9395b1bb1f	ipv6: icmp: icmpv6_xrlim_allow() optimization if net.ipv6.icmp.ratelimit is zero If net.ipv6.icmp.ratelimit is zero we do not have to call inet_getpeer_v6() and inet_peer_xrlim_allow(). Both can be very expensive under DDOS. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260216142832.3834174-6-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-18 16:46:37 -08:00
Eric Dumazet	d8d9ef2988	ipv4: icmp: icmpv4_xrlim_allow() optimization if net.ipv4.icmp_ratelimit is zero If net.ipv4.icmp_ratelimit is zero, we do not have to call inet_getpeer_v4() and inet_peer_xrlim_allow(). Both can be very expensive under DDOS. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260216142832.3834174-5-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-18 16:46:36 -08:00
Eric Dumazet	0201eedb69	ipv6: icmp: remove obsolete code in icmpv6_xrlim_allow() Following part was needed before the blamed commit, because inet_getpeer_v6() second argument was the prefix. /* Give more bandwidth to wider prefixes. */ if (rt->rt6i_dst.plen < 128) tmo >>= ((128 - rt->rt6i_dst.plen)>>5); Now inet_getpeer_v6() retrieves hosts, we need to remove @tmo adjustement or wider prefixes likes /24 allow 8x more ICMP to be sent for a given ratelimit. As we had this issue for a while, this patch changes net.ipv6.icmp.ratelimit default value from 1000ms to 100ms to avoid potential regressions. Also add a READ_ONCE() when reading net->ipv6.sysctl.icmpv6_time. Fixes: `fd0273d793` ("ipv6: Remove external dependency on rt6i_dst and rt6i_src") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Cc: Martin KaFai Lau <martin.lau@kernel.org> Link: https://patch.msgid.link/20260216142832.3834174-4-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-18 16:46:36 -08:00
Eric Dumazet	034bbd8062	icmp: prevent possible overflow in icmp_global_allow() Following expression can overflow if sysctl_icmp_msgs_per_sec is big enough. sysctl_icmp_msgs_per_sec * delta / HZ; Fixes: `4cdf507d54` ("icmp: add a global rate limitation") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260216142832.3834174-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-18 16:46:36 -08:00
Ruitong Liu	be054cc66f	net/sched: act_skbedit: fix divide-by-zero in tcf_skbedit_hash() Commit `38a6f08657` ("net: sched: support hash selecting tx queue") added SKBEDIT_F_TXQ_SKBHASH support. The inclusive range size is computed as: mapping_mod = queue_mapping_max - queue_mapping + 1; The range size can be 65536 when the requested range covers all possible u16 queue IDs (e.g. queue_mapping=0 and queue_mapping_max=U16_MAX). That value cannot be represented in a u16 and previously wrapped to 0, so tcf_skbedit_hash() could trigger a divide-by-zero: queue_mapping += skb_get_hash(skb) % params->mapping_mod; Compute mapping_mod in a wider type and reject ranges larger than U16_MAX to prevent params->mapping_mod from becoming 0 and avoid the crash. Fixes: `38a6f08657` ("net: sched: support hash selecting tx queue") Cc: stable@vger.kernel.org # 6.12+ Signed-off-by: Ruitong Liu <cnitlrt@gmail.com> Link: https://patch.msgid.link/20260213175948.1505257-1-cnitlrt@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-17 17:27:39 -08:00
Eric Dumazet	ad5dfde2a5	ping: annotate data-races in ping_lookup() isk->inet_num, isk->inet_rcv_saddr and sk->sk_bound_dev_if are read locklessly in ping_lookup(). Add READ_ONCE()/WRITE_ONCE() annotations. The race on isk->inet_rcv_saddr is probably coming from IPv6 support, but does not deserve a specific backport. Fixes: `dbca1596bb` ("ping: convert to RCU lookups, get rid of rwlock") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260216100149.3319315-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-17 17:11:08 -08:00
Eric Dumazet	0943404b1f	net: do not delay zero-copy skbs in skb_attempt_defer_free() After the blamed commit, TCP tx zero copy notifications could be arbitrarily delayed and cause regressions in applications waiting for them. Signed-off-by: Eric Dumazet <edumazet@google.com> Fixes: `e20dfbad8a` ("net: fix napi_consume_skb() with alien skbs") Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20260216193653.627617-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-17 17:06:18 -08:00
Arnd Bergmann	6e980df452	net: psp: select CONFIG_SKB_EXTENSIONS psp now uses skb extensions, failing to build when that is disabled: In file included from include/net/psp.h:7, from net/psp/psp_sock.c:9: include/net/psp/functions.h: In function '__psp_skb_coalesce_diff': include/net/psp/functions.h:60:13: error: implicit declaration of function 'skb_ext_find'; did you mean 'skb_ext_copy'? [-Wimplicit-function-declaration] 60 \| a = skb_ext_find(one, SKB_EXT_PSP); \| ^~~~~~~~~~~~ \| skb_ext_copy include/net/psp/functions.h:60:31: error: 'SKB_EXT_PSP' undeclared (first use in this function) 60 \| a = skb_ext_find(one, SKB_EXT_PSP); \| ^~~~~~~~~~~ include/net/psp/functions.h:60:31: note: each undeclared identifier is reported only once for each function it appears in include/net/psp/functions.h: In function '__psp_sk_rx_policy_check': include/net/psp/functions.h:94:53: error: 'SKB_EXT_PSP' undeclared (first use in this function) 94 \| struct psp_skb_ext *pse = skb_ext_find(skb, SKB_EXT_PSP); \| ^~~~~~~~~~~ net/psp/psp_sock.c: In function 'psp_sock_recv_queue_check': net/psp/psp_sock.c:164:41: error: 'SKB_EXT_PSP' undeclared (first use in this function) 164 \| pse = skb_ext_find(skb, SKB_EXT_PSP); \| ^~~~~~~~~~~ Select the Kconfig symbol as we do from its other users. Fixes: `6b46ca260e` ("net: psp: add socket security association code") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Daniel Zahka <daniel.zahka@gmail.com> Link: https://patch.msgid.link/20260216105500.2382181-1-arnd@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-17 17:05:29 -08:00
Linus Torvalds	87a367f1bf	This adds support for the upcoming aes256k key type in CephX that is based on Kerberos 5 and brings a bunch of assorted CephFS fixes from Ethan and Sam. One of Sam's patches in particular undoes a change in the fscrypt area that had an inadvertent side effect of making CephFS behave as if mounted with wsize=4096 and leading to the corresponding degradation in performance, especially for sequential writes. -----BEGIN PGP SIGNATURE----- iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAmmUpbQTHGlkcnlvbW92 QGdtYWlsLmNvbQAKCRBKf944AhHziy+iB/9oWArHfGu/OLbmb+gQEikcGVmzr9r/ XE3Pcp6JQUMUf8mlOf18RdWn+ak509jQcnJDSyXzk+mHBOw/+VwPod3bZZGNHcYw RwaUAWh9r79Bm0FnUewfQguj2FFnW1X4SrBrGCqsl/yOXbzHAGvDVzsoditfSB+J 8NPYJeFOk6VpRx5Qie66t2wwUoI/VtGs++D9R0CWEy1EpROH/nRkcTk7KlnfSIV0 FWSItUmssxp7Gm67O12390PxC0ZfQ6ApPNl5UOVkL7kfjqYsQKY948qlsTFHHFiM M58fGysAfsfTCXuFWjnmTGhLubV2d9fdIN8OjYFaOjpXeJQ6WRAg8nbe =jx2K -----END PGP SIGNATURE----- Merge tag 'ceph-for-7.0-rc1' of https://github.com/ceph/ceph-client Pull ceph updates from Ilya Dryomov: "This adds support for the upcoming aes256k key type in CephX that is based on Kerberos 5 and brings a bunch of assorted CephFS fixes from Ethan and Sam. One of Sam's patches in particular undoes a change in the fscrypt area that had an inadvertent side effect of making CephFS behave as if mounted with wsize=4096 and leading to the corresponding degradation in performance, especially for sequential writes" * tag 'ceph-for-7.0-rc1' of https://github.com/ceph/ceph-client: ceph: assert loop invariants in ceph_writepages_start() ceph: remove error return from ceph_process_folio_batch() ceph: fix write storm on fscrypted files ceph: do not propagate page array emplacement errors as batch errors ceph: supply snapshot context in ceph_uninline_data() ceph: supply snapshot context in ceph_zero_partial_object() libceph: adapt ceph_x_challenge_blob hashing and msgr1 message signing libceph: add support for CEPH_CRYPTO_AES256KRB5 libceph: introduce ceph_crypto_key_prepare() libceph: generalize ceph_x_encrypt_offset() and ceph_x_encrypt_buflen() libceph: define and enforce CEPH_MAX_KEY_LEN	2026-02-17 15:18:51 -08:00
Linus Torvalds	505d195b0f	Char/Misc/IIO driver changes for 7.0-rc1 Here is the big set of char/misc/iio and other smaller driver subsystem changes for 7.0-rc1. Lots of little things in here, including: - Loads of iio driver changes and updates and additions - gpib driver updates - interconnect driver updates - i3c driver updates - hwtracing (coresight and intel) driver updates - deletion of the obsolete mwave driver - binder driver updates (rust and c versions) - mhi driver updates (causing a merge conflict, see below) - mei driver updates - fsi driver updates - eeprom driver updates - lots of other small char and misc driver updates and cleanups All of these have been in linux-next for a while, with no reported issues except for a merge conflict with your tree due to the mhi driver changes in the drivers/net/wireless/ath/ath12k/mhi.c file. To fix that up, just delete the "auto_queue" structure fields being set, see this message for the full change needed: https://lore.kernel.org/r/aXD6X23btw8s-RZP@sirena.org.uk Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> -----BEGIN PGP SIGNATURE----- iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCaZRxOg8cZ3JlZ0Brcm9h aC5jb20ACgkQMUfUDdst+ykIrACgs9S+A/GG9X0Kvc+ND/J1XYZpj3QAoKl0yXGj SV1SR/giEBc7iKV6Dn6O =jbok -----END PGP SIGNATURE----- Merge tag 'char-misc-7.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc Pull char/misc/IIO driver updates from Greg KH: "Here is the big set of char/misc/iio and other smaller driver subsystem changes for 7.0-rc1. Lots of little things in here, including: - Loads of iio driver changes and updates and additions - gpib driver updates - interconnect driver updates - i3c driver updates - hwtracing (coresight and intel) driver updates - deletion of the obsolete mwave driver - binder driver updates (rust and c versions) - mhi driver updates (causing a merge conflict, see below) - mei driver updates - fsi driver updates - eeprom driver updates - lots of other small char and misc driver updates and cleanups All of these have been in linux-next for a while, with no reported issues" * tag 'char-misc-7.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (297 commits) mux: mmio: fix regmap leak on probe failure rust_binder: return p from rust_binder_transaction_target_node() drivers: android: binder: Update ARef imports from sync::aref rust_binder: fix needless borrow in context.rs iio: magn: mmc5633: Fix Kconfig for combination of I3C as module and driver builtin iio: sca3000: Fix a resource leak in sca3000_probe() iio: proximity: rfd77402: Add interrupt handling support iio: proximity: rfd77402: Document device private data structure iio: proximity: rfd77402: Use devm-managed mutex initialization iio: proximity: rfd77402: Use kernel helper for result polling iio: proximity: rfd77402: Align polling timeout with datasheet iio: cros_ec: Allow enabling/disabling calibration mode iio: frequency: ad9523: correct kernel-doc bad line warning iio: buffer: buffer_impl.h: fix kernel-doc warnings iio: gyro: itg3200: Fix unchecked return value in read_raw MAINTAINERS: add entry for ADE9000 driver iio: accel: sca3000: remove unused last_timestamp field iio: accel: adxl372: remove unused int2_bitmask field iio: adc: ad7766: Use iio_trigger_generic_data_rdy_poll() iio: magnetometer: Remove IRQF_ONESHOT ...	2026-02-17 09:11:04 -08:00
Inseo An	71e99ee20f	netfilter: nf_tables: fix use-after-free in nf_tables_addchain() nf_tables_addchain() publishes the chain to table->chains via list_add_tail_rcu() (in nft_chain_add()) before registering hooks. If nf_tables_register_hook() then fails, the error path calls nft_chain_del() (list_del_rcu()) followed by nf_tables_chain_destroy() with no RCU grace period in between. This creates two use-after-free conditions: 1) Control-plane: nf_tables_dump_chains() traverses table->chains under rcu_read_lock(). A concurrent dump can still be walking the chain when the error path frees it. 2) Packet path: for NFPROTO_INET, nf_register_net_hook() briefly installs the IPv4 hook before IPv6 registration fails. Packets entering nft_do_chain() via the transient IPv4 hook can still be dereferencing chain->blob_gen_X when the error path frees the chain. Add synchronize_rcu() between nft_chain_del() and the chain destroy so that all RCU readers -- both dump threads and in-flight packet evaluation -- have finished before the chain is freed. Fixes: `91c7b38dc9` ("netfilter: nf_tables: use new transaction infrastructure to handle chain") Signed-off-by: Inseo An <y0un9sa@gmail.com> Signed-off-by: Florian Westphal <fw@strlen.de>	2026-02-17 15:04:20 +01:00
Pablo Neira Ayuso	008e7a7c29	net: remove WARN_ON_ONCE when accessing forward path array Although unlikely, recent support for IPIP tunnels increases chances of reaching this WARN_ON_ONCE if userspace manages to build a sufficiently long forward path. Remove it. Fixes: `ddb94eafab` ("net: resolve forwarding path from virtual netdevice and HW destination address") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Florian Westphal <fw@strlen.de>	2026-02-17 15:04:20 +01:00
Julian Anastasov	8fde939b02	ipvs: do not keep dest_dst if dev is going down There is race between the netdev notifier ip_vs_dst_event() and the code that caches dst with dev that is going down. As the FIB can be notified for the closed device after our handler finishes, it is possible valid route to be returned and cached resuling in a leaked dev reference until the dest is not removed. To prevent new dest_dst to be attached to dest just after the handler dropped the old one, add a netif_running() check to make sure the notifier handler is not currently running for device that is closing. Fixes: `7a4f0761fc` ("IPVS: init and cleanup restructuring") Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Florian Westphal <fw@strlen.de>	2026-02-17 15:04:20 +01:00
Julian Anastasov	05cfe9863e	ipvs: skip ipv6 extension headers for csum checks Protocol checksum validation fails for IPv6 if there are extension headers before the protocol header. iph->len already contains its offset, so use it to fix the problem. Fixes: `2906f66a56` ("ipvs: SCTP Trasport Loadbalancing Support") Fixes: `0bbdd42b7e` ("IPVS: Extend protocol DNAT/SNAT and state handlers") Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Florian Westphal <fw@strlen.de>	2026-02-17 15:04:20 +01:00
Florian Westphal	a6d28eb8ef	netfilter: nf_conntrack_h323: don't pass uninitialised l3num value Mihail Milev reports: Error: UNINIT (CWE-457): net/netfilter/nf_conntrack_h323_main.c:1189:2: var_decl: Declaring variable "tuple" without initializer. net/netfilter/nf_conntrack_h323_main.c:1197:2: uninit_use_in_call: Using uninitialized value "tuple.src.l3num" when calling "__nf_ct_expect_find". net/netfilter/nf_conntrack_expect.c:142:2: read_value: Reading value "tuple->src.l3num" when calling "nf_ct_expect_dst_hash". 1195\| tuple.dst.protonum = IPPROTO_TCP; 1196\| 1197\|-> exp = __nf_ct_expect_find(net, nf_ct_zone(ct), &tuple); 1198\| if (exp && exp->master == ct) 1199\| return exp; Switch this to a C99 initialiser and set the l3num value. Fixes: `f587de0e2f` ("[NETFILTER]: nf_conntrack/nf_nat: add H.323 helper port") Signed-off-by: Florian Westphal <fw@strlen.de>	2026-02-17 15:04:20 +01:00
Brian Witte	7f261bb906	netfilter: nf_tables: revert commit_mutex usage in reset path It causes circular lock dependency between commit_mutex, nfnl_subsys_ipset and nlk_cb_mutex when nft reset, ipset list, and iptables-nft with '-m set' rule run at the same time. Previous patches made it safe to run individual reset handlers concurrently so commit_mutex is no longer required to prevent this. Fixes: `bd662c4218` ("netfilter: nf_tables: Add locking for NFT_MSG_GETOBJ_RESET requests") Fixes: `3d483faa66` ("netfilter: nf_tables: Add locking for NFT_MSG_GETSETELEM_RESET requests") Fixes: `3cb03edb4d` ("netfilter: nf_tables: Add locking for NFT_MSG_GETRULE_RESET requests") Link: https://lore.kernel.org/all/aUh_3mVRV8OrGsVo@strlen.de/ Reported-by: <syzbot+ff16b505ec9152e5f448@syzkaller.appspotmail.com> Closes: https://syzkaller.appspot.com/bug?extid=ff16b505ec9152e5f448 Signed-off-by: Brian Witte <brianwitte@mailfence.com> Signed-off-by: Florian Westphal <fw@strlen.de>	2026-02-17 15:04:20 +01:00
Brian Witte	30c4d7fb59	netfilter: nft_quota: use atomic64_xchg for reset Use atomic64_xchg() to atomically read and zero the consumed value on reset, which is simpler than the previous read+sub pattern and doesn't require lock serialization. Fixes: `bd662c4218` ("netfilter: nf_tables: Add locking for NFT_MSG_GETOBJ_RESET requests") Fixes: `3d483faa66` ("netfilter: nf_tables: Add locking for NFT_MSG_GETSETELEM_RESET requests") Fixes: `3cb03edb4d` ("netfilter: nf_tables: Add locking for NFT_MSG_GETRULE_RESET requests") Suggested-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Brian Witte <brianwitte@mailfence.com> Signed-off-by: Florian Westphal <fw@strlen.de>	2026-02-17 15:04:20 +01:00
Brian Witte	779c60a519	netfilter: nft_counter: serialize reset with spinlock Add a global static spinlock to serialize counter fetch+reset operations, preventing concurrent dump-and-reset from underrunning values. The lock is taken before fetching the total so that two parallel resets cannot both read the same counter values and then both subtract them. A global lock is used for simplicity since resets are infrequent. If this becomes a bottleneck, it can be replaced with a per-net lock later. Fixes: `bd662c4218` ("netfilter: nf_tables: Add locking for NFT_MSG_GETOBJ_RESET requests") Fixes: `3d483faa66` ("netfilter: nf_tables: Add locking for NFT_MSG_GETSETELEM_RESET requests") Fixes: `3cb03edb4d` ("netfilter: nf_tables: Add locking for NFT_MSG_GETRULE_RESET requests") Suggested-by: Florian Westphal <fw@strlen.de> Signed-off-by: Brian Witte <brianwitte@mailfence.com> Signed-off-by: Florian Westphal <fw@strlen.de>	2026-02-17 15:04:20 +01:00
Sun Jian	07919126ec	netfilter: annotate NAT helper hook pointers with __rcu The NAT helper hook pointers are updated and dereferenced under RCU rules, but lack the proper __rcu annotation. This makes sparse report address space mismatches when the hooks are used with rcu_dereference(). Add the missing __rcu annotations to the global hook pointer declarations and definitions in Amanda, FTP, IRC, SNMP and TFTP. No functional change intended. Suggested-by: Florian Westphal <fw@strlen.de> Signed-off-by: Sun Jian <sun.jian.kdev@gmail.com> Signed-off-by: Florian Westphal <fw@strlen.de>	2026-02-17 15:04:20 +01:00
Eric Dumazet	26f29b1491	net: fix backlog_unlock_irq_restore() vs CONFIG_PREEMPT_RT CONFIG_PREEMPT_RT is special, make this clear in backlog_lock_irq_save() and backlog_unlock_irq_restore(). The issue shows up with CONFIG_DEBUG_IRQFLAGS=y raw_local_irq_restore() called with IRQs enabled WARNING: kernel/locking/irqflag-debug.c:10 at warn_bogus_irq_restore+0xc/0x20 kernel/locking/irqflag-debug.c:10, CPU#1: aoe_tx0/1321 Modules linked in: CPU: 1 UID: 0 PID: 1321 Comm: aoe_tx0 Not tainted syzkaller #0 PREEMPT_{RT,(full)} Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/24/2026 RIP: 0010:warn_bogus_irq_restore+0xc/0x20 kernel/locking/irqflag-debug.c:10 Call Trace: <TASK> backlog_unlock_irq_restore net/core/dev.c:253 [inline] enqueue_to_backlog+0x525/0xcf0 net/core/dev.c:5347 netif_rx_internal+0x120/0x550 net/core/dev.c:5659 __netif_rx+0xa9/0x110 net/core/dev.c:5679 loopback_xmit+0x43a/0x660 drivers/net/loopback.c:90 __netdev_start_xmit include/linux/netdevice.h:5275 [inline] netdev_start_xmit include/linux/netdevice.h:5284 [inline] xmit_one net/core/dev.c:3864 [inline] dev_hard_start_xmit+0x2df/0x830 net/core/dev.c:3880 __dev_queue_xmit+0x16f4/0x3990 net/core/dev.c:4829 dev_queue_xmit include/linux/netdevice.h:3384 [inline] Fixes: `27a01c1969` ("net: fully inline backlog_unlock_irq_restore()") Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260213120427.2914544-1-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-02-17 13:28:10 +01:00
Nikolay Aleksandrov	8b769e311a	net: bridge: mcast: always update mdb_n_entries for vlan contexts syzbot triggered a warning[1] about the number of mdb entries in a context. It turned out that there are multiple ways to trigger that warning today (some got added during the years), the root cause of the problem is that the increase is done conditionally, and over the years these different conditions increased so there were new ways to trigger the warning, that is to do a decrease which wasn't paired with a previous increase. For example one way to trigger it is with flush: $ ip l add br0 up type bridge vlan_filtering 1 mcast_snooping 1 $ ip l add dumdum up master br0 type dummy $ bridge mdb add dev br0 port dumdum grp 239.0.0.1 permanent vid 1 $ ip link set dev br0 down $ ip link set dev br0 type bridge mcast_vlan_snooping 1 ^^^^ this will enable snooping, but will not update mdb_n_entries because in __br_multicast_enable_port_ctx() we check !netif_running $ bridge mdb flush dev br0 ^^^ this will trigger the warning because it will delete the pg which we added above, which will try to decrease mdb_n_entries Fix the problem by removing the conditional increase and always keep the count up-to-date while the vlan exists. In order to do that we have to first initialize it on port-vlan context creation, and then always increase or decrease the value regardless of mcast options. To keep the current behaviour we have to enforce the mdb limit only if the context is port's or if the port-vlan's mcast snooping is enabled. [1] ------------[ cut here ]------------ n == 0 WARNING: net/bridge/br_multicast.c:718 at br_multicast_port_ngroups_dec_one net/bridge/br_multicast.c:718 [inline], CPU#0: syz.4.4607/22043 WARNING: net/bridge/br_multicast.c:718 at br_multicast_port_ngroups_dec net/bridge/br_multicast.c:771 [inline], CPU#0: syz.4.4607/22043 WARNING: net/bridge/br_multicast.c:718 at br_multicast_del_pg+0x1bbe/0x1e20 net/bridge/br_multicast.c:825, CPU#0: syz.4.4607/22043 Modules linked in: CPU: 0 UID: 0 PID: 22043 Comm: syz.4.4607 Not tainted syzkaller #0 PREEMPT(full) Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/24/2026 RIP: 0010:br_multicast_port_ngroups_dec_one net/bridge/br_multicast.c:718 [inline] RIP: 0010:br_multicast_port_ngroups_dec net/bridge/br_multicast.c:771 [inline] RIP: 0010:br_multicast_del_pg+0x1bbe/0x1e20 net/bridge/br_multicast.c:825 Code: 41 5f 5d e9 04 7a 48 f7 e8 3f 73 5c f7 90 0f 0b 90 e9 cf fd ff ff e8 31 73 5c f7 90 0f 0b 90 e9 16 fd ff ff e8 23 73 5c f7 90 <0f> 0b 90 e9 60 fd ff ff e8 15 73 5c f7 eb 05 e8 0e 73 5c f7 48 8b RSP: 0018:ffffc9000c207220 EFLAGS: 00010293 RAX: ffffffff8a68042d RBX: ffff88807c6f1800 RCX: ffff888066e90000 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 RBP: 0000000000000000 R08: ffff888066e90000 R09: 000000000000000c R10: 000000000000000c R11: 0000000000000000 R12: ffff8880303ef800 R13: dffffc0000000000 R14: ffff888050eb11c4 R15: 1ffff1100a1d6238 FS: 00007fa45921b6c0(0000) GS:ffff8881256f5000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fa4591f9ff8 CR3: 0000000081df2000 CR4: 00000000003526f0 Call Trace: <TASK> br_mdb_flush_pgs net/bridge/br_mdb.c:1525 [inline] br_mdb_flush net/bridge/br_mdb.c:1544 [inline] br_mdb_del_bulk+0x5e2/0xb20 net/bridge/br_mdb.c:1561 rtnl_mdb_del+0x48a/0x640 net/core/rtnetlink.c:-1 rtnetlink_rcv_msg+0x77e/0xbe0 net/core/rtnetlink.c:6967 netlink_rcv_skb+0x232/0x4b0 net/netlink/af_netlink.c:2550 netlink_unicast_kernel net/netlink/af_netlink.c:1318 [inline] netlink_unicast+0x80f/0x9b0 net/netlink/af_netlink.c:1344 netlink_sendmsg+0x813/0xb40 net/netlink/af_netlink.c:1894 sock_sendmsg_nosec net/socket.c:727 [inline] __sock_sendmsg net/socket.c:742 [inline] ____sys_sendmsg+0xa68/0xad0 net/socket.c:2592 ___sys_sendmsg+0x2a5/0x360 net/socket.c:2646 __sys_sendmsg net/socket.c:2678 [inline] __do_sys_sendmsg net/socket.c:2683 [inline] __se_sys_sendmsg net/socket.c:2681 [inline] __x64_sys_sendmsg+0x1bd/0x2a0 net/socket.c:2681 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xe2/0xf80 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7fa45839aeb9 Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 e8 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007fa45921b028 EFLAGS: 00000246 ORIG_RAX: 000000000000002e RAX: ffffffffffffffda RBX: 00007fa458615fa0 RCX: 00007fa45839aeb9 RDX: 0000000000000000 RSI: 00002000000000c0 RDI: 0000000000000004 RBP: 00007fa458408c1f R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 00007fa458616038 R14: 00007fa458615fa0 R15: 00007fff0b59fae8 </TASK> Fixes: `b57e8d870d` ("net: bridge: Maintain number of MDB entries in net_bridge_mcast_port") Reported-by: syzbot+d5d1b7343531d17bd3c5@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/aYrWbRp83MQR1ife@debil/T/#t Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Link: https://patch.msgid.link/20260213070031.1400003-2-nikolay@nvidia.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-02-17 13:00:14 +01:00
Allison Henderson	da29e453dc	net/rds: rds_sendmsg should not discard payload_len Commit `3db6e0d172` ("rds: use RCU to synchronize work-enqueue with connection teardown") modifies rds_sendmsg to avoid enqueueing work while a tear down is in progress. However, it also changed the return value of rds_sendmsg to that of rds_send_xmit instead of the payload_len. This means the user may incorrectly receive errno values when it should have simply received a payload of 0 while the peer attempts a reconnections. So this patch corrects the teardown handling code to only use the out error path in that case, thus restoring the original payload_len return value. Fixes: `3db6e0d172` ("rds: use RCU to synchronize work-enqueue with connection teardown") Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Allison Henderson <achender@kernel.org> Link: https://patch.msgid.link/20260213035409.1963391-1-achender@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-02-17 12:03:57 +01:00
Linus Torvalds	011af61b9f	- 9p/xen racy double-free fix - track 9p RPC waiting time as IO -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE/IPbcYBuWt0zoYhOq06b7GqY5nAFAmmRjJcACgkQq06b7GqY 5nCUMA//WDifZs5ug0Zf/6pKJ3PIr/NmQf0XXPmn0YlR53Uz11X9CoXV0ZjQJsz6 Dsp5DMgHpTBizBb+iS/0wTuNqB1Imx+r5opEi/+vgVdl53XxcurmmqfJBULLdOhj a02PKH3VBzpTel93mYpvJvxil6dpN/fqfpKm2dGOcz8U6d5KFYpjkMZoYBVQWhQx 4Afyp71XHbuHamg68sw4UOmCTeWNGOxYPDrL/pVCl3usKC+6zX4vTADtS6ckjYtb 7s45hJmYuRQ8IyZnXEJeUefm3OoxwSxXUERP3lotMEiImMLE6g/IK5hiQaLYkPwb aEyNGP2ReISXUdXy0jJA3ILUcYs1+fbq89Rs93LC/+EzjQm5b23yfsEB5DyIAZnw giOXAcBjm/3cC7czw8Rw6Aa6dGjjWDg//5hOzDbvdVH54j3MnKzLiwQylgsszBKX dY58SMDGPdn/Bcf+TDyu5Ahr4GDARX0vVSLFot3xEULt9x90Sdd89GQffujIKx3v bIcGj0Ql/BRMnPVB5gZD2c8K+HeIIYrv/l4tlVLpOeKIepBw0yNnFMBDp+yQxbSf gkJxajkcjd9hEjaK1U4znJeX3cQzhgj0cpPRJq7b9SSC1+4c6PThL8PSDrmzCPJV DGBSFMHQR6g+zFy4xpbc5ErBReNZVXN1FqvCGo6TM0fTPvTNcV0= =ew5K -----END PGP SIGNATURE----- Merge tag '9p-for-7.0-rc1' of https://github.com/martinetd/linux Pull 9p updates from Dominique Martinet: - 9p/xen racy double-free fix - track 9p RPC waiting time as IO * tag '9p-for-7.0-rc1' of https://github.com/martinetd/linux: 9p/xen: protect xen_9pfs_front_free against concurrent calls 9p: Track 9P RPC waiting time as IO wait: Introduce io_wait_event_killable()	2026-02-15 10:24:46 -08:00
Stefano Garzarella	6a997f38bd	vsock: prevent child netns mode switch from local to global A "local" namespace can change its `child_ns_mode` sysctl to "global", allowing nested namespaces to access global CIDs. This can be exploited by an unprivileged user who gained CAP_NET_ADMIN through a user namespace. Prevent this by rejecting writes that attempt to set `child_ns_mode` to "global" when the current namespace's mode is "local". Fixes: `eafb64f40c` ("vsock: add netns to vsock core") Cc: bobbyeshleman@meta.com Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Reviewed-by: Bobby Eshleman <bobbyeshleman@meta.com> Link: https://patch.msgid.link/20260212205916.97533-3-sgarzare@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-13 12:28:38 -08:00
Stefano Garzarella	9dd391493a	vsock: fix child netns mode initialization When a new network namespace is created, vsock_net_init() correctly initializes the namespace's mode by reading the parent's `child_ns_mode` via vsock_net_child_mode(). However, the `child_ns_mode` of the new namespace was always hardcoded to VSOCK_NET_MODE_GLOBAL, regardless of its own mode. This means that if a parent namespace has `child_ns_mode` set to "local", the child namespace correctly gets mode "local", but its `child_ns_mode` is reset to "global". As a result, further nested namespaces will incorrectly get mode "global" instead of inheriting "local", breaking the expected propagation of the mode through nested namespaces. Fix this by initializing `child_ns_mode` to the namespace's own mode, so the setting propagates correctly through all levels of nesting. Fixes: `eafb64f40c` ("vsock: add netns to vsock core") Cc: bobbyeshleman@meta.com Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Reviewed-by: Bobby Eshleman <bobbyeshleman@meta.com> Link: https://patch.msgid.link/20260212205916.97533-2-sgarzare@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-13 12:28:38 -08:00
Kuniyuki Iwashima	8244f959e2	ipv6: Fix out-of-bound access in fib6_add_rt2node(). syzbot reported out-of-bound read in fib6_add_rt2node(). [0] When IPv6 route is created with RTA_NH_ID, struct fib6_info does not have the trailing struct fib6_nh. The cited commit started to check !iter->fib6_nh->fib_nh_gw_family to ensure that rt6_qualify_for_ecmp() will return false for iter. If iter->nh is not NULL, rt6_qualify_for_ecmp() returns false anyway. Let's check iter->nh before reading iter->fib6_nh and avoid OOB read. [0]: BUG: KASAN: slab-out-of-bounds in fib6_add_rt2node+0x349c/0x3500 net/ipv6/ip6_fib.c:1142 Read of size 1 at addr ffff8880384ba6de by task syz.0.18/5500 CPU: 0 UID: 0 PID: 5500 Comm: syz.0.18 Not tainted syzkaller #0 PREEMPT(full) Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0xe8/0x150 lib/dump_stack.c:120 print_address_description mm/kasan/report.c:378 [inline] print_report+0xba/0x230 mm/kasan/report.c:482 kasan_report+0x117/0x150 mm/kasan/report.c:595 fib6_add_rt2node+0x349c/0x3500 net/ipv6/ip6_fib.c:1142 fib6_add_rt2node_nh net/ipv6/ip6_fib.c:1363 [inline] fib6_add+0x910/0x18c0 net/ipv6/ip6_fib.c:1531 __ip6_ins_rt net/ipv6/route.c:1351 [inline] ip6_route_add+0xde/0x1b0 net/ipv6/route.c:3957 inet6_rtm_newroute+0x268/0x19e0 net/ipv6/route.c:5660 rtnetlink_rcv_msg+0x7d5/0xbe0 net/core/rtnetlink.c:6958 netlink_rcv_skb+0x232/0x4b0 net/netlink/af_netlink.c:2550 netlink_unicast_kernel net/netlink/af_netlink.c:1318 [inline] netlink_unicast+0x80f/0x9b0 net/netlink/af_netlink.c:1344 netlink_sendmsg+0x813/0xb40 net/netlink/af_netlink.c:1894 sock_sendmsg_nosec net/socket.c:727 [inline] __sock_sendmsg net/socket.c:742 [inline] ____sys_sendmsg+0xa68/0xad0 net/socket.c:2592 ___sys_sendmsg+0x2a5/0x360 net/socket.c:2646 __sys_sendmsg net/socket.c:2678 [inline] __do_sys_sendmsg net/socket.c:2683 [inline] __se_sys_sendmsg net/socket.c:2681 [inline] __x64_sys_sendmsg+0x1bd/0x2a0 net/socket.c:2681 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xe2/0xf80 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7f9316b9aeb9 Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 e8 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007ffd8809b678 EFLAGS: 00000246 ORIG_RAX: 000000000000002e RAX: ffffffffffffffda RBX: 00007f9316e15fa0 RCX: 00007f9316b9aeb9 RDX: 0000000000000000 RSI: 0000200000004380 RDI: 0000000000000003 RBP: 00007f9316c08c1f R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 00007f9316e15fac R14: 00007f9316e15fa0 R15: 00007f9316e15fa0 </TASK> Allocated by task 5499: kasan_save_stack mm/kasan/common.c:57 [inline] kasan_save_track+0x3e/0x80 mm/kasan/common.c:78 poison_kmalloc_redzone mm/kasan/common.c:398 [inline] __kasan_kmalloc+0x93/0xb0 mm/kasan/common.c:415 kasan_kmalloc include/linux/kasan.h:263 [inline] __do_kmalloc_node mm/slub.c:5657 [inline] __kmalloc_noprof+0x40c/0x7e0 mm/slub.c:5669 kmalloc_noprof include/linux/slab.h:961 [inline] kzalloc_noprof include/linux/slab.h:1094 [inline] fib6_info_alloc+0x30/0xf0 net/ipv6/ip6_fib.c:155 ip6_route_info_create+0x142/0x860 net/ipv6/route.c:3820 ip6_route_add+0x49/0x1b0 net/ipv6/route.c:3949 inet6_rtm_newroute+0x268/0x19e0 net/ipv6/route.c:5660 rtnetlink_rcv_msg+0x7d5/0xbe0 net/core/rtnetlink.c:6958 netlink_rcv_skb+0x232/0x4b0 net/netlink/af_netlink.c:2550 netlink_unicast_kernel net/netlink/af_netlink.c:1318 [inline] netlink_unicast+0x80f/0x9b0 net/netlink/af_netlink.c:1344 netlink_sendmsg+0x813/0xb40 net/netlink/af_netlink.c:1894 sock_sendmsg_nosec net/socket.c:727 [inline] __sock_sendmsg net/socket.c:742 [inline] ____sys_sendmsg+0xa68/0xad0 net/socket.c:2592 ___sys_sendmsg+0x2a5/0x360 net/socket.c:2646 __sys_sendmsg net/socket.c:2678 [inline] __do_sys_sendmsg net/socket.c:2683 [inline] __se_sys_sendmsg net/socket.c:2681 [inline] __x64_sys_sendmsg+0x1bd/0x2a0 net/socket.c:2681 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xe2/0xf80 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f Fixes: `bbf4a17ad9` ("ipv6: Fix ECMP sibling count mismatch when clearing RTF_ADDRCONF") Reported-by: syzbot+707d6a5da1ab9e0c6f9d@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/698cbfba.050a0220.2eeac1.009d.GAE@google.com/ Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Fernando Fernandez Mancera <fmancera@suse.de> Reviewed-by: Shigeru Yoshida <syoshida@redhat.com> Link: https://patch.msgid.link/20260211175133.3657034-1-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-13 12:24:28 -08:00
Qanux	6db8b56eed	ipv6: ioam: fix heap buffer overflow in __ioam6_fill_trace_data() On the receive path, __ioam6_fill_trace_data() uses trace->nodelen to decide how much data to write for each node. It trusts this field as-is from the incoming packet, with no consistency check against trace->type (the 24-bit field that tells which data items are present). A crafted packet can set nodelen=0 while setting type bits 0-21, causing the function to write ~100 bytes past the allocated region (into skb_shared_info), which corrupts adjacent heap memory and leads to a kernel panic. Add a shared helper ioam6_trace_compute_nodelen() in ioam6.c to derive the expected nodelen from the type field, and use it: - in ioam6_iptunnel.c (send path, existing validation) to replace the open-coded computation; - in exthdrs.c (receive path, ipv6_hop_ioam) to drop packets whose nodelen is inconsistent with the type field, before any data is written. Per RFC 9197, bits 12-21 are each short (4-octet) fields, so they are included in IOAM6_MASK_SHORT_FIELDS (changed from 0xff100000 to 0xff1ffc00). Fixes: `9ee11f0fff` ("ipv6: ioam: Data plane support for Pre-allocated Trace") Cc: stable@vger.kernel.org Signed-off-by: Junxi Qian <qjx1298677004@gmail.com> Reviewed-by: Justin Iurman <justin.iurman@gmail.com> Link: https://patch.msgid.link/20260211040412.86195-1-qjx1298677004@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-13 12:24:05 -08:00
Linus Torvalds	a353e7260b	virtio,vhost,vdpa: features, fixes - in order support in virtio core - multiple address space support in vduse - fixes, cleanups all over the place, notably - dma alignment fixes for non cache coherent systems Signed-off-by: Michael S. Tsirkin <mst@redhat.com> -----BEGIN PGP SIGNATURE----- iQFDBAABCgAtFiEEXQn9CHHI+FuUyooNKB8NuNKNVGkFAmmO9rYPHG1zdEByZWRo YXQuY29tAAoJECgfDbjSjVRpBzYH/2wUPo3T8/CKGFjF7QSPzgL/UI2NhnP8iSm4 btg1zVnrWmJK6vVIwnf5UsG8dFKsMcp/BEGCewTmIddNM2wEeSul0kKDXtIzrK/U jdA9bJrUKLMeU7IFKne1Fip/yE+5nkWJttWXXyVRJtOJrYxZlkWfqSns3qYcPvsG g7HXvF6tmici5uoKdRCLqHtQCWsvpnvTD5A7qoZAlEUjlQCDKKmuukpN9oK5UYLl 9uUOgPQAJaxIwx1C4uP7L+AwbLUcN/+MtrvQRNz+sFpP3sN9oXeDJKBpNQp109NB JGk1sUsINL+54Cmdd5RwZ9T1vBJyRDrdWRDy1yHj95LildaPfh0= =pnob -----END PGP SIGNATURE----- Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost Pull virtio updates from Michael Tsirkin: - in-order support in virtio core - multiple address space support in vduse - fixes, cleanups all over the place, notably dma alignment fixes for non-cache-coherent systems * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: (59 commits) vduse: avoid adding implicit padding vhost: fix caching attributes of MMIO regions by setting them explicitly vdpa/mlx5: update MAC address handling in mlx5_vdpa_set_attr() vdpa/mlx5: reuse common function for MAC address updates vdpa/mlx5: update mlx_features with driver state check crypto: virtio: Replace package id with numa node id crypto: virtio: Remove duplicated virtqueue_kick in virtio_crypto_skcipher_crypt_req crypto: virtio: Add spinlock protection with virtqueue notification Documentation: Add documentation for VDUSE Address Space IDs vduse: bump version number vduse: add vq group asid support vduse: merge tree search logic of IOTLB_GET_FD and IOTLB_GET_INFO ioctls vduse: take out allocations from vduse_dev_alloc_coherent vduse: remove unused vaddr parameter of vduse_domain_free_coherent vduse: refactor vdpa_dev_add for goto err handling vhost: forbid change vq groups ASID if DRIVER_OK is set vdpa: document set_group_asid thread safety vduse: return internal vq group struct as map token vduse: add vq group support vduse: add v1 API definition ...	2026-02-13 12:02:18 -08:00
Jeremy Kerr	a6a9bc544b	net: mctp: ensure our nlmsg responses are initialised Syed Faraz Abrar (@farazsth98) from Zellic, and Pumpkin (@u1f383) from DEVCORE Research Team working with Trend Micro Zero Day Initiative report that a RTM_GETNEIGH will return uninitalised data in the pad bytes of the ndmsg data. Ensure we're initialising the netlink data to zero, in the link, addr and neigh response messages. Fixes: `831119f887` ("mctp: Add neighbour netlink interface") Fixes: `06d2f4c583` ("mctp: Add netlink route management") Fixes: `583be982d9` ("mctp: Add device handling and netlink interface") Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260209-dev-mctp-nlmsg-v1-1-f1e30c346a43@codeconstruct.com.au Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-12 18:35:45 -08:00
Linus Torvalds	7449f86baf	NFS Client Updates for Linux 7.0 New Features: * Use an LRU list for returning unused delegations * Introduce a KConfig option to disable NFS v4.0 and make NFS v4.1 the default Bugfixes: * NFS/localio: Handle short writes by retrying * NFS/localio: prevent direct reclaim recursion into NFS via nfs_writepages * NFS/localio: use GFP_NOIO and non-memreclaim workqueue in nfs_local_commit * NFS/localio: remove -EAGAIN handling in nfs_local_doio() * pNFS: fix a missing wake up while waiting on NFS_LAYOUT_DRAIN * fs/nfs: Fix a readdir slow-start regression * SUNRPC: fix gss_auth kref leak in gss_alloc_msg error path Other Cleanups and Improvements: * A few other NFS/localio cleanups * Various other delegation handling cleanups from Christoph * Unify security_inode_listsecurity() calls * Improvements to NFSv4 lease handling * Clean up SUNRPC _debug fields when CONFIG_SUNRPC_DEBUG is not set -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAmmORX0ACgkQ18tUv7Cl QOtFORAAyCwTst5iEPRJ9rKZ/Kl39zHbA/QUn3CmmVkGlOBj0j7mWRyU5X0vlIQ9 mUF3Ikm1XYpsxPTKBEELVumPkggT2nfsFx5518BrpRTODibzc/CZ10/z7q4qarvI UhdFlt9SRG4RhhOdAaThF6XVUsRSwGwVZo/YyYemCc/evjNVyXa0wfwbDl9l4Nzr 1Sxt2/zeq3Eu4IfrxQpFM+0UuSScmVODSe8Jm4GnmlU/Q7x+onW35IvyuzTkgDwG 8PAeH4b5uADY9VWnTHpvr1fQNnBoEw8b4qr9a7AXQKRIcPGMvgKkdK+f6hOh1cEs +O+L4+uixo7QXudnWC27brZSyHwDIVVaJGPF/kNv4O2GKDyEcbsHtQv2G1+1+PtR FCtRFGpLq2pZxb9SY/s73FKp6a8bd81FAtzAL7iYU+2FDtvEDKss1nG6sQNG1+Z4 G8rI79PoimR4I6Jr5hk4sl8pM8wJVLZdcW+ytrEKl9FC+rFDrP9lVzHYArTFgIky N/IjEflejRfZ9bYIZ9/CYnFZC3Htrm8K9zerCRDsf96tvhxkX8FZM8tuZpHEMIbx Cx8XKCk+ubqLIF2mT+FKOc5T6CUmMiGRNagLkx0h0mbvRSI8HTpgQZGrbYkMk0Hs abUhvH73pRi0LRvkzPHfcNaZ7Y/mFBYfwBMwTUWJzh6CEgXnpks= =CwZq -----END PGP SIGNATURE----- Merge tag 'nfs-for-7.0-1' of git://git.linux-nfs.org/projects/anna/linux-nfs Pull NFS client updates from Anna Schumaker: "New Features: - Use an LRU list for returning unused delegations - Introduce a KConfig option to disable NFS v4.0 and make NFS v4.1 the default Bugfixes: - NFS/localio: - Handle short writes by retrying - Prevent direct reclaim recursion into NFS via nfs_writepages - Use GFP_NOIO and non-memreclaim workqueue in nfs_local_commit - Remove -EAGAIN handling in nfs_local_doio() - pNFS: fix a missing wake up while waiting on NFS_LAYOUT_DRAIN - fs/nfs: Fix a readdir slow-start regression - SUNRPC: fix gss_auth kref leak in gss_alloc_msg error path Other cleanups and improvements: - A few other NFS/localio cleanups - Various other delegation handling cleanups from Christoph - Unify security_inode_listsecurity() calls - Improvements to NFSv4 lease handling - Clean up SUNRPC _debug fields when CONFIG_SUNRPC_DEBUG is not set" * tag 'nfs-for-7.0-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (60 commits) SUNRPC: fix gss_auth kref leak in gss_alloc_msg error path nfs: nfs4proc: Convert comma to semicolon SUNRPC: Change list definition method sunrpc: rpc_debug and others are defined even if CONFIG_SUNRPC_DEBUG unset NFSv4: limit lease period in nfs4_set_lease_period() NFSv4: pass lease period in seconds to nfs4_set_lease_period() nfs: unify security_inode_listsecurity() calls fs/nfs: Fix readdir slow-start regression pNFS: fix a missing wake up while waiting on NFS_LAYOUT_DRAIN NFS: fix delayed delegation return handling NFS: simplify error handling in nfs_end_delegation_return NFS: fold nfs_abort_delegation_return into nfs_end_delegation_return NFS: remove the delegation == NULL check in nfs_end_delegation_return NFS: use bool for the issync argument to nfs_end_delegation_return NFS: return void from ->return_delegation NFS: return void from nfs4_inode_make_writeable NFS: Merge CONFIG_NFS_V4_1 with CONFIG_NFS_V4 NFS: Add a way to disable NFS v4.0 via KConfig NFS: Move sequence slot operations into minorversion operations NFS: Pass a struct nfs_client to nfs4_init_sequence() ...	2026-02-12 17:49:33 -08:00
Linus Torvalds	311aa68319	RDMA v7.0 merge window Usual smallish cycle: - Various code improvements in irdma, rtrs, qedr, ocrdma, irdma, rxe - Small driver improvements and minor bug fixes to hns, mlx5, rxe, mana, mlx5, irdma - Robusness improvements in completion processing for EFA - New query_port_speed() verb to move past limited IBA defined speed steps - Support for SG_GAPS in rts and many other small improvements - Rare list corruption fix in iwcm - Better support different page sizes in rxe - Device memory support for mana - Direct bio vec to kernel MR for use by NFS-RDMA - QP rate limiting for bnxt_re - Remote triggerable NULL pointer crash in siw - DMA-buf exporter support for RDMA mmaps like doorbells -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRRRCHOFoQz/8F5bUaFwuHvBreFYQUCaY44vgAKCRCFwuHvBreF YfiZAP91cMZfogN7r1FMD75xDZu55dI3Jvy8OaixyRxlWLGPcQEAjritdL0o7fZp YrD1OXNS/1XG//rPBVw7xj+54Aa8hAU= =AVcu -----END PGP SIGNATURE----- Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma Pull rdma updates from Jason Gunthorpe: "Usual smallish cycle. The NFS biovec work to push it down into RDMA instead of indirecting through a scatterlist is pretty nice to see, been talked about for a long time now. - Various code improvements in irdma, rtrs, qedr, ocrdma, irdma, rxe - Small driver improvements and minor bug fixes to hns, mlx5, rxe, mana, mlx5, irdma - Robusness improvements in completion processing for EFA - New query_port_speed() verb to move past limited IBA defined speed steps - Support for SG_GAPS in rts and many other small improvements - Rare list corruption fix in iwcm - Better support different page sizes in rxe - Device memory support for mana - Direct bio vec to kernel MR for use by NFS-RDMA - QP rate limiting for bnxt_re - Remote triggerable NULL pointer crash in siw - DMA-buf exporter support for RDMA mmaps like doorbells" * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (66 commits) RDMA/mlx5: Implement DMABUF export ops RDMA/uverbs: Add DMABUF object type and operations RDMA/uverbs: Support external FD uobjects RDMA/siw: Fix potential NULL pointer dereference in header processing RDMA/umad: Reject negative data_len in ib_umad_write IB/core: Extend rate limit support for RC QPs RDMA/mlx5: Support rate limit only for Raw Packet QP RDMA/bnxt_re: Report QP rate limit in debugfs RDMA/bnxt_re: Report packet pacing capabilities when querying device RDMA/bnxt_re: Add support for QP rate limiting MAINTAINERS: Drop RDMA files from Hyper-V section RDMA/uverbs: Add __GFP_NOWARN to ib_uverbs_unmarshall_recv() kmalloc svcrdma: use bvec-based RDMA read/write API RDMA/core: add rdma_rw_max_sge() helper for SQ sizing RDMA/core: add MR support for bvec-based RDMA operations RDMA/core: use IOVA-based DMA mapping for bvec RDMA operations RDMA/core: add bio_vec based RDMA read/write API RDMA/irdma: Use kvzalloc for paged memory DMA address array RDMA/rxe: Fix race condition in QP timer handlers RDMA/mana_ib: Add device‑memory support ...	2026-02-12 17:05:20 -08:00
Linus Torvalds	136114e0ab	mm.git review status for linus..mm-nonmm-stable Total patches: 107 Reviews/patch: 1.07 Reviewed rate: 67% - The 2 patch series "ocfs2: give ocfs2 the ability to reclaim suballocator free bg" from Heming Zhao saves disk space by teaching ocfs2 to reclaim suballocator block group space. - The 4 patch series "Add ARRAY_END(), and use it to fix off-by-one bugs" from Alejandro Colomar adds the ARRAY_END() macro and uses it in various places. - The 2 patch series "vmcoreinfo: support VMCOREINFO_BYTES larger than PAGE_SIZE" from Pnina Feder makes the vmcore code future-safe, if VMCOREINFO_BYTES ever exceeds the page size. - The 7 patch series "kallsyms: Prevent invalid access when showing module buildid" from Petr Mladek cleans up kallsyms code related to module buildid and fixes an invalid access crash when printing backtraces. - The 3 patch series "Address page fault in ima_restore_measurement_list()" from Harshit Mogalapalli fixes a kexec-related crash that can occur when booting the second-stage kernel on x86. - The 6 patch series "kho: ABI headers and Documentation updates" from Mike Rapoport updates the kexec handover ABI documentation. - The 4 patch series "Align atomic storage" from Finn Thain adds the __aligned attribute to atomic_t and atomic64_t definitions to get natural alignment of both types on csky, m68k, microblaze, nios2, openrisc and sh. - The 2 patch series "kho: clean up page initialization logic" from Pratyush Yadav simplifies the page initialization logic in kho_restore_page(). - The 6 patch series "Unload linux/kernel.h" from Yury Norov moves several things out of kernel.h and into more appropriate places. - The 7 patch series "don't abuse task_struct.group_leader" from Oleg Nesterov removes the usage of ->group_leader when it is "obviously unnecessary". - The 5 patch series "list private v2 & luo flb" from Pasha Tatashin adds some infrastructure improvements to the live update orchestrator. -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaY4giAAKCRDdBJ7gKXxA jgusAQDnKkP8UWTqXPC1jI+OrDJGU5ciAx8lzLeBVqMKzoYk9AD/TlhT2Nlx+Ef6 0HCUHUD0FMvAw/7/Dfc6ZKxwBEIxyww= =mmsH -----END PGP SIGNATURE----- Merge tag 'mm-nonmm-stable-2026-02-12-10-48' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull non-MM updates from Andrew Morton: - "ocfs2: give ocfs2 the ability to reclaim suballocator free bg" saves disk space by teaching ocfs2 to reclaim suballocator block group space (Heming Zhao) - "Add ARRAY_END(), and use it to fix off-by-one bugs" adds the ARRAY_END() macro and uses it in various places (Alejandro Colomar) - "vmcoreinfo: support VMCOREINFO_BYTES larger than PAGE_SIZE" makes the vmcore code future-safe, if VMCOREINFO_BYTES ever exceeds the page size (Pnina Feder) - "kallsyms: Prevent invalid access when showing module buildid" cleans up kallsyms code related to module buildid and fixes an invalid access crash when printing backtraces (Petr Mladek) - "Address page fault in ima_restore_measurement_list()" fixes a kexec-related crash that can occur when booting the second-stage kernel on x86 (Harshit Mogalapalli) - "kho: ABI headers and Documentation updates" updates the kexec handover ABI documentation (Mike Rapoport) - "Align atomic storage" adds the __aligned attribute to atomic_t and atomic64_t definitions to get natural alignment of both types on csky, m68k, microblaze, nios2, openrisc and sh (Finn Thain) - "kho: clean up page initialization logic" simplifies the page initialization logic in kho_restore_page() (Pratyush Yadav) - "Unload linux/kernel.h" moves several things out of kernel.h and into more appropriate places (Yury Norov) - "don't abuse task_struct.group_leader" removes the usage of ->group_leader when it is "obviously unnecessary" (Oleg Nesterov) - "list private v2 & luo flb" adds some infrastructure improvements to the live update orchestrator (Pasha Tatashin) * tag 'mm-nonmm-stable-2026-02-12-10-48' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (107 commits) watchdog/hardlockup: simplify perf event probe and remove per-cpu dependency procfs: fix missing RCU protection when reading real_parent in do_task_stat() watchdog/softlockup: fix sample ring index wrap in need_counting_irqs() kcsan, compiler_types: avoid duplicate type issues in BPF Type Format kho: fix doc for kho_restore_pages() tests/liveupdate: add in-kernel liveupdate test liveupdate: luo_flb: introduce File-Lifecycle-Bound global state liveupdate: luo_file: Use private list list: add kunit test for private list primitives list: add primitives for private list manipulations delayacct: fix uapi timespec64 definition panic: add panic_force_cpu= parameter to redirect panic to a specific CPU netclassid: use thread_group_leader(p) in update_classid_task() RDMA/umem: don't abuse current->group_leader drm/pan*: don't abuse current->group_leader drm/amd: kill the outdated "Only the pthreads threading model is supported" checks drm/amdgpu: don't abuse current->group_leader android/binder: use same_thread_group(proc->tsk, current) in binder_mmap() android/binder: don't abuse current->group_leader kho: skip memoryless NUMA nodes when reserving scratch areas ...	2026-02-12 12:13:01 -08:00
Linus Torvalds	2831fa8b8b	NFSD 7.0 Release Notes Neil Brown and Jeff Layton contributed a dynamic thread pool sizing mechanism for NFSD. The sunrpc layer now tracks minimum and maximum thread counts per pool, and NFSD adjusts running thread counts based on workload: idle threads exit after a timeout when the pool exceeds its minimum, and new threads spawn automatically when all threads are busy. Administrators control this behavior via the nfsdctl netlink interface. Rick Macklem, FreeBSD NFS maintainer, generously contributed server- side support for the POSIX ACL extension to NFSv4, as specified in draft-ietf-nfsv4-posix-acls. This extension allows NFSv4 clients to get and set POSIX access and default ACLs using native NFSv4 operations, eliminating the need for sideband protocols. The feature is gated by a Kconfig option since the IETF draft has not yet been ratified. Chuck Lever delivered numerous improvements to the xdrgen tool. Error reporting now covers parsing, AST transformation, and invalid declarations. Generated enum decoders validate incoming values against valid enumerator lists. New features include pass-through line support for embedding C directives in XDR specifications, 16-bit integer types, and program number definitions. Several code generation issues were also addressed. When an administrator revokes NFSv4 state for a filesystem via the unlock_fs interface, ongoing async COPY operations referencing that filesystem are now cancelled, with CB_OFFLOAD callbacks notifying affected clients. The remaining patches in this pull request are clean-ups and minor optimizations. Sincere thanks to all contributors, reviewers, testers, and bug reporters who participated in the v7.0 NFSD development cycle. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmmJ8kAACgkQM2qzM29m f5ejCQ//RdoWNgN1VZdNoUrh1tm1Fhi1YN/RJS26G25OxgTBc3/qtGxrpW+ZAW6+ mIAJ2bT/l66741drki4/x6WJU4OMI/4mJxrLd0WCb1POaeRQWnL1MdzNY+IP/QZv 3DgcTv1T6FKE7pFmAqW0nFPCgaK+vlR+fo4uJognbB6+hZB3HlrLkfeZOWMAmchC y3U6nzrtP+IljAtdzKZ120E+LHp0PtTbJwPCPSt3/FR/dkA0DcjnOS9jybIYlJOu 0ByX24BcrW/c3rJUdL8lL4G7gsPWjdARqczFiN8sufI9Q3zlHOxtYdUT7BNjd+04 jcSKLlAXwcbNcK2f54B/QFKmNxllvoHLB3wo2hfEPig4LQELuxcUHYxmmD4vNKen lp6zmaLq3PiRGlew6eLRFxRxbdLds+9l0xjXV+J+rtQmjppXdXUoVNMm+D+tD6bF T5TUq4WNCGJIrpkR7pdF7uMD51s8fphvaDxOCjhSi3WHAtZAhOR8HFUU97qddM34 KqF6Gph3tN/C4oNb8kKvzxBRpRhHIzKHZbreiu5fZr9pPe9IRBHnn/Dg4p/yYQcw K3/y1EnKrIlprfbFFkY1LzNFpf309uoZTVzwBcMfSJVsFgUqWD7KHJ/rmCJQ/pS6 k0+YLRoUmtUHDYk2QNlstlt7r6FwA0d2GjT8n7viGoNQ3PA7rJQ= =hqla -----END PGP SIGNATURE----- Merge tag 'nfsd-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux Pull nfsd updates from Chuck Lever: "Neil Brown and Jeff Layton contributed a dynamic thread pool sizing mechanism for NFSD. The sunrpc layer now tracks minimum and maximum thread counts per pool, and NFSD adjusts running thread counts based on workload: idle threads exit after a timeout when the pool exceeds its minimum, and new threads spawn automatically when all threads are busy. Administrators control this behavior via the nfsdctl netlink interface. Rick Macklem, FreeBSD NFS maintainer, generously contributed server- side support for the POSIX ACL extension to NFSv4, as specified in draft-ietf-nfsv4-posix-acls. This extension allows NFSv4 clients to get and set POSIX access and default ACLs using native NFSv4 operations, eliminating the need for sideband protocols. The feature is gated by a Kconfig option since the IETF draft has not yet been ratified. Chuck Lever delivered numerous improvements to the xdrgen tool. Error reporting now covers parsing, AST transformation, and invalid declarations. Generated enum decoders validate incoming values against valid enumerator lists. New features include pass-through line support for embedding C directives in XDR specifications, 16-bit integer types, and program number definitions. Several code generation issues were also addressed. When an administrator revokes NFSv4 state for a filesystem via the unlock_fs interface, ongoing async COPY operations referencing that filesystem are now cancelled, with CB_OFFLOAD callbacks notifying affected clients. The remaining patches in this pull request are clean-ups and minor optimizations. Sincere thanks to all contributors, reviewers, testers, and bug reporters who participated in the v7.0 NFSD development cycle" * tag 'nfsd-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: (45 commits) NFSD: Add POSIX ACL file attributes to SUPPATTR bitmasks NFSD: Add POSIX draft ACL support to the NFSv4 SETATTR operation NFSD: Add support for POSIX draft ACLs for file creation NFSD: Add support for XDR decoding POSIX draft ACLs NFSD: Refactor nfsd_setattr()'s ACL error reporting NFSD: Do not allow NFSv4 (N)VERIFY to check POSIX ACL attributes NFSD: Add nfsd4_encode_fattr4_posix_access_acl NFSD: Add nfsd4_encode_fattr4_posix_default_acl NFSD: Add nfsd4_encode_fattr4_acl_trueform_scope NFSD: Add nfsd4_encode_fattr4_acl_trueform Add RPC language definition of NFSv4 POSIX ACL extension NFSD: Add a Kconfig setting to enable support for NFSv4 POSIX ACLs xdrgen: Implement pass-through lines in specifications nfsd: cancel async COPY operations when admin revokes filesystem state nfsd: add controls to set the minimum number of threads per pool nfsd: adjust number of running nfsd threads based on activity sunrpc: allow svc_recv() to return -ETIMEDOUT and -EBUSY sunrpc: split new thread creation into a separate function sunrpc: introduce the concept of a minimum number of threads per pool sunrpc: track the max number of requested threads in a pool ...	2026-02-12 08:23:53 -08:00
Linus Torvalds	37a93dd5c4	Networking changes for 7.0 Core & protocols ---------------- - A significant effort all around the stack to guide the compiler to make the right choice when inlining code, to avoid unneeded calls for small helper and stack canary overhead in the fast-path. This generates better and faster code with very small or no text size increases, as in many cases the call generated more code than the actual inlined helper. - Extend AccECN implementation so that is now functionally complete, also allow the user-space enabling it on a per network namespace basis. - Add support for memory providers with large (above 4K) rx buffer. Paired with hw-gro, larger rx buffer sizes reduce the number of buffers traversing the stack, dincreasing single stream CPU usage by up to ~30%. - Do not add HBH header to Big TCP GSO packets. This simplifies the RX path, the TX path and the NIC drivers, and is possible because user-space taps can now interpret correctly such packets without the HBH hint. - Allow IPv6 routes to be configured with a gateway address that is resolved out of a different interface than the one specified, aligning IPv6 to IPv4 behavior. - Multi-queue aware sch_cake. This makes it possible to scale the rate shaper of sch_cake across multiple CPUs, while still enforcing a single global rate on the interface. - Add support for the nbcon (new buffer console) infrastructure to netconsole, enabling lock-free, priority-based console operations that are safer in crash scenarios. - Improve the TCP ipv6 output path to cache the flow information, saving cpu cycles, reducing cache line misses and stack use. - Improve netfilter packet tracker to resolve clashes for most protocols, avoiding unneeded drops on rare occasions. - Add IP6IP6 tunneling acceleration to the flowtable infrastructure. - Reduce tcp socket size by one cache line. - Notify neighbour changes atomically, avoiding inconsistencies between the notification sequence and the actual states sequence. - Add vsock namespace support, allowing complete isolation of vsocks across different network namespaces. - Improve xsk generic performances with cache-alignment-oriented optimizations. - Support netconsole automatic target recovery, allowing netconsole to reestablish targets when underlying low-level interface comes back online. Driver API ---------- - Support for switching the working mode (automatic vs manual) of a DPLL device via netlink. - Introduce PHY ports representation to expose multiple front-facing media ports over a single MAC. - Introduce "rx-polarity" and "tx-polarity" device tree properties, to generalize polarity inversion requirements for differential signaling. - Add helper to create, prepare and enable managed clocks. Device drivers -------------- - Add Huawei hinic3 PF etherner driver. - Add DWMAC glue driver for Motorcomm YT6801 PCIe ethernet controller. - Add ethernet driver for MaxLinear MxL862xx switches - Remove parallel-port Ethernet driver. - Convert existing driver timestamp configuration reporting to hwtstamp_get and remove legacy ioctl(). - Convert existing drivers to .get_rx_ring_count(), simplifing the RX ring count retrieval. Also remove the legacy fallback path. - Ethernet high-speed NICs: - Broadcom (bnxt, bng): - bnxt: add FW interface update to support FEC stats histogram and NVRAM defragmentation - bng: add TSO and H/W GRO support - nVidia/Mellanox (mlx5): - improve latency of channel restart operations, reducing the used H/W resources - add TSO support for UDP over GRE over VLAN - add flow counters support for hardware steering (HWS) rules - use a static memory area to store headers for H/W GRO, leading to 12% RX tput improvement - Intel (100G, ice, idpf): - ice: reorganizes layout of Tx and Rx rings for cacheline locality and utilizes __cacheline_group* macros on the new layouts - ice: introduces Synchronous Ethernet (SyncE) support - Meta (fbnic): - adds debugfs for firmware mailbox and tx/rx rings vectors - Ethernet virtual: - geneve: introduce GRO/GSO support for double UDP encapsulation - Ethernet NICs consumer, and embedded: - Synopsys (stmmac): - some code refactoring and cleanups - RealTek (r8169): - add support for RTL8127ATF (10G Fiber SFP) - add dash and LTR support - Airoha: - AN8811HB 2.5 Gbps phy support - Freescale (fec): - add XDP zero-copy support - Thunderbolt: - add get link setting support to allow bonding - Renesas: - add support for RZ/G3L GBETH SoC - Ethernet switches: - Maxlinear: - support R(G)MII slow rate configuration - add support for Intel GSW150 - Motorcomm (yt921x): - add DCB/QoS support - TI: - icssm-prueth: support bridging (STP/RSTP) via the switchdev framework - Ethernet PHYs: - Realtek: - enable SGMII and 2500Base-X in-band auto-negotiation - simplify and reunify C22/C45 drivers - Micrel: convert bindings to DT schema - CAN: - move skb headroom content into skb extensions, making CAN metadata access more robust - CAN drivers: - rcar_canfd: - add support for FD-only mode - add support for the RZ/T2H SoC - sja1000: cleanup the CAN state handling - WiFi: - implement EPPKE/802.1X over auth frames support - split up drop reasons better, removing generic RX_DROP - additional FTM capabilities: 6 GHz support, supported number of spatial streams and supported number of LTF repetitions - better mac80211 iterators to enumerate resources - initial UHR (Wi-Fi 8) support for cfg80211/mac80211 - WiFi drivers: - Qualcomm/Atheros: - ath11k: support for Channel Frequency Response measurement - ath12k: a significant driver refactor to support multi-wiphy devices and and pave the way for future device support in the same driver (rather than splitting to ath13k) - ath12k: support for the QCC2072 chipset - Intel: - iwlwifi: partial Neighbor Awareness Networking (NAN) support - iwlwifi: initial support for U-NII-9 and IEEE 802.11bn - RealTek (rtw89): - preparations for RTL8922DE support - Bluetooth: - implement setsockopt(BT_PHY) to set the connection packet type/PHY - set link_policy on incoming ACL connections - Bluetooth drivers: - btusb: add support for MediaTek7920, Realtek RTL8761BU and 8851BE - btqca: add WCN6855 firmware priority selection feature Signed-off-by: Paolo Abeni <pabeni@redhat.com> -----BEGIN PGP SIGNATURE----- iQJGBAABCgAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmmMum4SHHBhYmVuaUBy ZWRoYXQuY29tAAoJECkkeY3MjxOkDnMP/3bpHAGj+gylTid3Xsj0TjJ8AkPsQs+W uvSMiCB1TvGCTD9kK36Vr+qPoIgJY10UxYMMjt5Gs0A9TvGDDfYnUOVoUIkfkWCH grqSdp6dVkyaJVfyLEcuOVQQG2HwEnhC4c3ZOhOxaKNAnsLCP142lYsMR9ktGRuA 4vDGtz1+y7t8qBk/lyfXDM71KRrtq0HWJZIhmhz8QXTBsgPDfSejbTPNxXQOJoeO sKeArsHr/Cmvf89ZtLZ63vbfr4BKDm4PeXqPYR3PrQs2Yu6I1EK4lehygTY2yE2O I3MEPlvpa/tiVLxqXNNwEFbYIkMPY6FXS9x05hTxNZM65A6aB3vvdkqPVnVmAlXE f+4PYg9paI13lbzZOeQbGfZ5HgPpzQvnginaaX6s9Fp12K3Ll1FkwWdUznFWhzVn 5LSrGyecR00CdKJByTIw9JGg/1ptz5a57pa8OQmcKRx3WhQ1XeV5TIJQF4QcPgHw ApyjmeGDTQMQMzha1fsaVr+i6BK2zgZvKK9uGDTX90xn2JUw/M75tyOlsTtGlnuM sZgj0KVGQlG2wLwBB/+D4S9Oi9YlPG00rkCs0E4jk5C/G4NBmMgpEPQg6azkb57h Uiy0paohxfwcZ3qbGA9In091ClGqIwOiCBaq+uXRq1ro88Neo6PWkjz5ItNrsD8t Ttgd5AVAQyPT =O31Y -----END PGP SIGNATURE----- Merge tag 'net-next-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Paolo Abeni: "Core & protocols: - A significant effort all around the stack to guide the compiler to make the right choice when inlining code, to avoid unneeded calls for small helper and stack canary overhead in the fast-path. This generates better and faster code with very small or no text size increases, as in many cases the call generated more code than the actual inlined helper. - Extend AccECN implementation so that is now functionally complete, also allow the user-space enabling it on a per network namespace basis. - Add support for memory providers with large (above 4K) rx buffer. Paired with hw-gro, larger rx buffer sizes reduce the number of buffers traversing the stack, dincreasing single stream CPU usage by up to ~30%. - Do not add HBH header to Big TCP GSO packets. This simplifies the RX path, the TX path and the NIC drivers, and is possible because user-space taps can now interpret correctly such packets without the HBH hint. - Allow IPv6 routes to be configured with a gateway address that is resolved out of a different interface than the one specified, aligning IPv6 to IPv4 behavior. - Multi-queue aware sch_cake. This makes it possible to scale the rate shaper of sch_cake across multiple CPUs, while still enforcing a single global rate on the interface. - Add support for the nbcon (new buffer console) infrastructure to netconsole, enabling lock-free, priority-based console operations that are safer in crash scenarios. - Improve the TCP ipv6 output path to cache the flow information, saving cpu cycles, reducing cache line misses and stack use. - Improve netfilter packet tracker to resolve clashes for most protocols, avoiding unneeded drops on rare occasions. - Add IP6IP6 tunneling acceleration to the flowtable infrastructure. - Reduce tcp socket size by one cache line. - Notify neighbour changes atomically, avoiding inconsistencies between the notification sequence and the actual states sequence. - Add vsock namespace support, allowing complete isolation of vsocks across different network namespaces. - Improve xsk generic performances with cache-alignment-oriented optimizations. - Support netconsole automatic target recovery, allowing netconsole to reestablish targets when underlying low-level interface comes back online. Driver API: - Support for switching the working mode (automatic vs manual) of a DPLL device via netlink. - Introduce PHY ports representation to expose multiple front-facing media ports over a single MAC. - Introduce "rx-polarity" and "tx-polarity" device tree properties, to generalize polarity inversion requirements for differential signaling. - Add helper to create, prepare and enable managed clocks. Device drivers: - Add Huawei hinic3 PF etherner driver. - Add DWMAC glue driver for Motorcomm YT6801 PCIe ethernet controller. - Add ethernet driver for MaxLinear MxL862xx switches - Remove parallel-port Ethernet driver. - Convert existing driver timestamp configuration reporting to hwtstamp_get and remove legacy ioctl(). - Convert existing drivers to .get_rx_ring_count(), simplifing the RX ring count retrieval. Also remove the legacy fallback path. - Ethernet high-speed NICs: - Broadcom (bnxt, bng): - bnxt: add FW interface update to support FEC stats histogram and NVRAM defragmentation - bng: add TSO and H/W GRO support - nVidia/Mellanox (mlx5): - improve latency of channel restart operations, reducing the used H/W resources - add TSO support for UDP over GRE over VLAN - add flow counters support for hardware steering (HWS) rules - use a static memory area to store headers for H/W GRO, leading to 12% RX tput improvement - Intel (100G, ice, idpf): - ice: reorganizes layout of Tx and Rx rings for cacheline locality and utilizes __cacheline_group* macros on the new layouts - ice: introduces Synchronous Ethernet (SyncE) support - Meta (fbnic): - adds debugfs for firmware mailbox and tx/rx rings vectors - Ethernet virtual: - geneve: introduce GRO/GSO support for double UDP encapsulation - Ethernet NICs consumer, and embedded: - Synopsys (stmmac): - some code refactoring and cleanups - RealTek (r8169): - add support for RTL8127ATF (10G Fiber SFP) - add dash and LTR support - Airoha: - AN8811HB 2.5 Gbps phy support - Freescale (fec): - add XDP zero-copy support - Thunderbolt: - add get link setting support to allow bonding - Renesas: - add support for RZ/G3L GBETH SoC - Ethernet switches: - Maxlinear: - support R(G)MII slow rate configuration - add support for Intel GSW150 - Motorcomm (yt921x): - add DCB/QoS support - TI: - icssm-prueth: support bridging (STP/RSTP) via the switchdev framework - Ethernet PHYs: - Realtek: - enable SGMII and 2500Base-X in-band auto-negotiation - simplify and reunify C22/C45 drivers - Micrel: convert bindings to DT schema - CAN: - move skb headroom content into skb extensions, making CAN metadata access more robust - CAN drivers: - rcar_canfd: - add support for FD-only mode - add support for the RZ/T2H SoC - sja1000: cleanup the CAN state handling - WiFi: - implement EPPKE/802.1X over auth frames support - split up drop reasons better, removing generic RX_DROP - additional FTM capabilities: 6 GHz support, supported number of spatial streams and supported number of LTF repetitions - better mac80211 iterators to enumerate resources - initial UHR (Wi-Fi 8) support for cfg80211/mac80211 - WiFi drivers: - Qualcomm/Atheros: - ath11k: support for Channel Frequency Response measurement - ath12k: a significant driver refactor to support multi-wiphy devices and and pave the way for future device support in the same driver (rather than splitting to ath13k) - ath12k: support for the QCC2072 chipset - Intel: - iwlwifi: partial Neighbor Awareness Networking (NAN) support - iwlwifi: initial support for U-NII-9 and IEEE 802.11bn - RealTek (rtw89): - preparations for RTL8922DE support - Bluetooth: - implement setsockopt(BT_PHY) to set the connection packet type/PHY - set link_policy on incoming ACL connections - Bluetooth drivers: - btusb: add support for MediaTek7920, Realtek RTL8761BU and 8851BE - btqca: add WCN6855 firmware priority selection feature" * tag 'net-next-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1254 commits) bnge/bng_re: Add a new HSI net: macb: Fix tx/rx malfunction after phy link down and up af_unix: Fix memleak of newsk in unix_stream_connect(). net: ti: icssg-prueth: Add optional dependency on HSR net: dsa: add basic initial driver for MxL862xx switches net: mdio: add unlocked mdiodev C45 bus accessors net: dsa: add tag format for MxL862xx switches dt-bindings: net: dsa: add MaxLinear MxL862xx selftests: drivers: net: hw: Modify toeplitz.c to poll for packets octeontx2-pf: Unregister devlink on probe failure net: renesas: rswitch: fix forwarding offload statemachine ionic: Rate limit unknown xcvr type messages tcp: inet6_csk_xmit() optimization tcp: populate inet->cork.fl.u.ip6 in tcp_v6_syn_recv_sock() tcp: populate inet->cork.fl.u.ip6 in tcp_v6_connect() ipv6: inet6_csk_xmit() and inet6_csk_update_pmtu() use inet->cork.fl.u.ip6 ipv6: use inet->cork.fl.u.ip6 and np->final in ip6_datagram_dst_update() ipv6: use np->final in inet6_sk_rebuild_header() ipv6: add daddr/final storage in struct ipv6_pinfo net: stmmac: qcom-ethqos: fix qcom_ethqos_serdes_powerup() ...	2026-02-11 19:31:52 -08:00
Paolo Abeni	83310d6133	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Merge in late fixes in preparation for the net-next PR. Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-02-11 15:14:35 +01:00
Kuniyuki Iwashima	6884028cd7	af_unix: Fix memleak of newsk in unix_stream_connect(). When prepare_peercred() fails in unix_stream_connect(), unix_release_sock() is not called for newsk, and the memory is leaked. Let's move prepare_peercred() before unix_create1(). Fixes: `fd0a109a0f` ("net, pidfs: prepare for handing out pidfds for reaped sk->sk_peer_pid") Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260207232236.2557549-1-kuniyu@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-02-11 13:01:13 +01:00
Daniel Golle	85ee987429	net: dsa: add tag format for MxL862xx switches Add proprietary special tag format for the MaxLinear MXL862xx family of switches. While using the same Ethertype as MaxLinear's GSW1xx switches, the actual tag format differs significantly, hence we need a dedicated tag driver for that. Signed-off-by: Daniel Golle <daniel@makrotopia.org> Link: https://patch.msgid.link/c64e6ddb6c93a4fac39f9ab9b2d8bf551a2b118d.1770433307.git.daniel@makrotopia.org Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-02-11 11:27:57 +01:00
Eric Dumazet	97d7ae6e14	tcp: inet6_csk_xmit() optimization After prior patches, inet6_csk_xmit() can reuse inet->cork.fl.u.ip6 if __sk_dst_check() returns a valid dst. Otherwise call inet6_csk_route_socket() to refresh inet->cork.fl.u.ip6 content and get a new dst. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260206173426.1638518-8-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-10 20:57:50 -08:00
Eric Dumazet	a6eee39cc2	tcp: populate inet->cork.fl.u.ip6 in tcp_v6_syn_recv_sock() As explained in commit `85d05e2817` ("ipv6: change inet6_sk_rebuild_header() to use inet->cork.fl.u.ip6"): TCP v6 spends a good amount of time rebuilding a fresh fl6 at each transmit in inet6_csk_xmit()/inet6_csk_route_socket(). TCP v4 caches the information in inet->cork.fl.u.ip4 instead. After this patch, passive TCP ipv6 flows have correctly initialized inet->cork.fl.u.ip6 structure. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260206173426.1638518-7-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-10 20:57:50 -08:00
Eric Dumazet	19bdb267f7	tcp: populate inet->cork.fl.u.ip6 in tcp_v6_connect() Instead of using private @fl6 and @final variables use respectively inet->cork.fl.u.ip6 and np->final. As explained in commit `85d05e2817` ("ipv6: change inet6_sk_rebuild_header() to use inet->cork.fl.u.ip6"): TCP v6 spends a good amount of time rebuilding a fresh fl6 at each transmit in inet6_csk_xmit()/inet6_csk_route_socket(). TCP v4 caches the information in inet->cork.fl.u.ip4 instead. After this patch, active TCP ipv6 flows have correctly initialized inet->cork.fl.u.ip6 structure. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260206173426.1638518-6-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-10 20:57:50 -08:00
Eric Dumazet	969a20198b	ipv6: inet6_csk_xmit() and inet6_csk_update_pmtu() use inet->cork.fl.u.ip6 Convert inet6_csk_route_socket() to use np->final instead of an automatic variable to get rid of a stack canary. Convert inet6_csk_xmit() and inet6_csk_update_pmtu() to use inet->cork.fl.u.ip6 instead of @fl6 automatic variable. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260206173426.1638518-5-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-10 20:57:50 -08:00
Eric Dumazet	4e6c91cf60	ipv6: use inet->cork.fl.u.ip6 and np->final in ip6_datagram_dst_update() Get rid of @fl6 and &final variables in ip6_datagram_dst_update(). Use instead inet->cork.fl.u.ip6 and np->final so that a stack canary is no longer needed. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260206173426.1638518-4-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-10 20:57:49 -08:00
Eric Dumazet	3d3f075e80	ipv6: use np->final in inet6_sk_rebuild_header() Instead of using an automatic variable, use np->final to get rid of the stack canary in inet6_sk_rebuild_header(). Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260206173426.1638518-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-10 20:57:49 -08:00
Jakub Kicinski	792aaea994	netfilter pull request nf-next-26-02-06 -----BEGIN PGP SIGNATURE----- iQJdBAABCABHFiEEgKkgxbID4Gn1hq6fcJGo2a1f9gAFAmmGB20bFIAAAAAABAAO bWFudTIsMi41KzEuMTEsMiwyDRxmd0BzdHJsZW4uZGUACgkQcJGo2a1f9gC/tQ/7 B7/akiCP/QeGF7go78PZQlpIGmjtoCOcQ9uxymlmpLkArepcIEkgZ04tFH0FClY6 d3QPfT9iNap222aCQxZwCiaWrXqUNynW7RwH72SkqGmO8JTLKlzW8CQC+yGkyznj FxwRKzB8XO5Ohtw0wED3mzcf9DelsvJpX5rCU5gEjsHZjKA/rEwYgovyM+es+xSx JbHHc2tzLQuDZ1BL7rEW8TJDxmJ2bCsFJHKeIvykk3D2nVg01P0AwhUeIy+7ObV7 bQh7B8DhYwKNLtgZvDi8D6o4nWQvkjfF5BadrWusumDCtIupcwbelpcUeCsUWBqC oCjLMcH7TwmT513RXWMId50z93FWciduCHUGrQt4BxLBZmkQ9kE0iamZVIAAzLl8 VYIM9qb+nUk58jnLFl3xTqW2GetSj/p31bp6e78+SQFvqjie2z9/I+nGBr7A8aAB bNd5vpvHSEg5OP7oKk+Dhr26MiCDowtuzvdC4lYR+loFYoI+a1FS6a1w/kcw9/VA XmR6Y8is+CTy4XYTQZ4klYTVpoTkWa/D/t1CTC4IlELzYS49L6qSyef6m91IWeQ6 Way5+3ZON7sA6SM1PZ/zjsKDxYLo/hQz2+dw6YLVflfY62khvuK2Yc56MQcZEjsH 7x0b3MaKvNn9yqKC+Mk7QZ55nCjV3wyGp3GQ+ClAqZ4= =wU6p -----END PGP SIGNATURE----- Merge tag 'nf-next-26-02-06' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next Florian Westphal says: ==================== netfilter: updates for net-next The following patchset contains Netfilter updates for net-next: 1) Fix net-next-only use-after-free bug in nf_tables rbtree set: Expired elements cannot be released right away after unlink anymore because there is no guarantee that the binary-search blob is going to be updated. Spotted by syzkaller. 2) Fix esoteric bug in nf_queue with udp fraglist gro, broken since 6.11. Patch 3 adds extends the nfqueue selftest for this. 4) Use dedicated slab for flowtable entries, currently the -512 cache is used, which is wasteful. From Qingfang Deng. 5) Recent net-next update extended existing test for ip6ip6 tunnels, add the required /config entry. Test still passed by accident because the previous tests network setup gets re-used, so also update the test so it will fail in case the ip6ip6 tunnel interface cannot be added. 6) Fix 'nft get element mytable myset { 1.2.3.4 }' on big endian platforms, this was broken since code was added in v5.1. 7) Fix nf_tables counter reset support on 32bit platforms, where counter reset may cause huge values to appear due to wraparound. Broken since reset feature was added in v6.11. From Anders Grahn. 8-11) update nf_tables rbtree set type to detect partial operlaps. This will eventually speed up nftables userspace: at this time userspace does a netlink dump of the set content which slows down incremental updates on interval sets. From Pablo Neira Ayuso. * tag 'nf-next-26-02-06' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next: netfilter: nft_set_rbtree: validate open interval overlap netfilter: nft_set_rbtree: validate element belonging to interval netfilter: nft_set_rbtree: check for partial overlaps in anonymous sets netfilter: nft_set_rbtree: fix bogus EEXIST with NLM_F_CREATE with null interval netfilter: nft_counter: fix reset of counters on 32bit archs netfilter: nft_set_hash: fix get operation on big endian selftests: netfilter: add IPV6_TUNNEL to config netfilter: flowtable: dedicated slab for flow entry selftests: netfilter: nft_queue.sh: add udp fraglist gro test case netfilter: nfnetlink_queue: do shared-unconfirmed check before segmentation netfilter: nft_set_rbtree: don't gc elements on insert ==================== Link: https://patch.msgid.link/20260206153048.17570-1-fw@strlen.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-10 20:25:38 -08:00
Geliang Tang	e5e2e43002	mptcp: allow overridden write_space to be invoked Future extensions with psock will override their own sk->sk_write_space callback. This patch ensures that the overridden sk_write_space can be invoked by MPTCP. INDIRECT_CALL is used to keep the default path optimised. Note that sk->sk_write_space was never called directly with MPTCP sockets, so changing it to sk_stream_write_space in the init, and using it from mptcp_write_space() is not supposed to change the current behaviour. This patch is shared early to ease discussions around future RFC and avoid confusions with this "fix" that is needed for different future extensions. Suggested-by: Paolo Abeni <pabeni@redhat.com> Co-developed-by: Gang Yan <yangang@kylinos.cn> Signed-off-by: Gang Yan <yangang@kylinos.cn> Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20260206-net-next-mptcp-write_space-override-v2-1-e0b12be818c6@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-10 19:54:21 -08:00
Linus Torvalds	0923fd0419	Locking updates for v6.20: Lock debugging: - Implement compiler-driven static analysis locking context checking, using the upcoming Clang 22 compiler's context analysis features. (Marco Elver) We removed Sparse context analysis support, because prior to removal even a defconfig kernel produced 1,700+ context tracking Sparse warnings, the overwhelming majority of which are false positives. On an allmodconfig kernel the number of false positive context tracking Sparse warnings grows to over 5,200... On the plus side of the balance actual locking bugs found by Sparse context analysis is also rather ... sparse: I found only 3 such commits in the last 3 years. So the rate of false positives and the maintenance overhead is rather high and there appears to be no active policy in place to achieve a zero-warnings baseline to move the annotations & fixers to developers who introduce new code. Clang context analysis is more complete and more aggressive in trying to find bugs, at least in principle. Plus it has a different model to enabling it: it's enabled subsystem by subsystem, which results in zero warnings on all relevant kernel builds (as far as our testing managed to cover it). Which allowed us to enable it by default, similar to other compiler warnings, with the expectation that there are no warnings going forward. This enforces a zero-warnings baseline on clang-22+ builds. (Which are still limited in distribution, admittedly.) Hopefully the Clang approach can lead to a more maintainable zero-warnings status quo and policy, with more and more subsystems and drivers enabling the feature. Context tracking can be enabled for all kernel code via WARN_CONTEXT_ANALYSIS_ALL=y (default disabled), but this will generate a lot of false positives. ( Having said that, Sparse support could still be added back, if anyone is interested - the removal patch is still relatively straightforward to revert at this stage. ) Rust integration updates: (Alice Ryhl, Fujita Tomonori, Boqun Feng) - Add support for Atomic<i8/i16/bool> and replace most Rust native AtomicBool usages with Atomic<bool> - Clean up LockClassKey and improve its documentation - Add missing Send and Sync trait implementation for SetOnce - Make ARef Unpin as it is supposed to be - Add __rust_helper to a few Rust helpers as a preparation for helper LTO - Inline various lock related functions to avoid additional function calls. WW mutexes: - Extend ww_mutex tests and other test-ww_mutex updates (John Stultz) Misc fixes and cleanups: - rcu: Mark lockdep_assert_rcu_helper() __always_inline (Arnd Bergmann) - locking/local_lock: Include more missing headers (Peter Zijlstra) - seqlock: fix scoped_seqlock_read kernel-doc (Randy Dunlap) - rust: sync: Replace `kernel::c_str!` with C-Strings (Tamir Duberstein) Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmmIXiURHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1gH+A/9GX5UmU6+HuDfDrCtXm9GDve6wkwahvcW jLDxOYjs764I2BhyjZnjKjyF5zw60hbykem7Wcf5EV2YH30nM4XRgEWVJfkr1UAI Pra415X4DdOzZ6qYQIpO8Udt1LtR7BMSaXITVLJaLicxEoOVtq3SKxjqyhCFs7UW MfJdqleB+RMLqq3LlzgB4l43eKk1xyeHh+oQwI0RSxuIpVZme3p4TObnCKjIWnK7 Ihd+dkgC852WBjANgNL7F/sd5UsF5QX3wjtOrLhMKvkIgTPdXln0g398pivjN/G/ Kpnw18SFeb159JfJu8eMotsYvVnQ0D5aOcTBfL4qvOHCImhpcu2s6ik9BcXqt2yT 8IiuWk9xEM3Ok+I/I4ClT5cf5GYpyigV2QsXxn+IjDX5Na8v4zlHh0r8SElP8fOt 7dpQx7iw8UghAib3AzA3suN78Oh39m8l5BNobj7LAjnqOQcVvoPo4o7/48ntuH7A 38EucFrXfxQBMfGbMwvxEmgYuX7MyVfQLaPE06MHy1BkZkffT8Um38TB0iNtZmtf WUx01yLKWYspehlwFi319uVI4/Zp7FnTfqa5uKv1oSXVdL9vZojSXUzrgDV7FVqT Z4xAAw/kwNHpUG7y0zNOqd6PukovG1t+CjbLvK+eHPwc5c0vEGG2oTRAfEvvP1z/ kesYDmCyJnk= =N1gA -----END PGP SIGNATURE----- Merge tag 'locking-core-2026-02-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking updates from Ingo Molnar: "Lock debugging: - Implement compiler-driven static analysis locking context checking, using the upcoming Clang 22 compiler's context analysis features (Marco Elver) We removed Sparse context analysis support, because prior to removal even a defconfig kernel produced 1,700+ context tracking Sparse warnings, the overwhelming majority of which are false positives. On an allmodconfig kernel the number of false positive context tracking Sparse warnings grows to over 5,200... On the plus side of the balance actual locking bugs found by Sparse context analysis is also rather ... sparse: I found only 3 such commits in the last 3 years. So the rate of false positives and the maintenance overhead is rather high and there appears to be no active policy in place to achieve a zero-warnings baseline to move the annotations & fixers to developers who introduce new code. Clang context analysis is more complete and more aggressive in trying to find bugs, at least in principle. Plus it has a different model to enabling it: it's enabled subsystem by subsystem, which results in zero warnings on all relevant kernel builds (as far as our testing managed to cover it). Which allowed us to enable it by default, similar to other compiler warnings, with the expectation that there are no warnings going forward. This enforces a zero-warnings baseline on clang-22+ builds (Which are still limited in distribution, admittedly) Hopefully the Clang approach can lead to a more maintainable zero-warnings status quo and policy, with more and more subsystems and drivers enabling the feature. Context tracking can be enabled for all kernel code via WARN_CONTEXT_ANALYSIS_ALL=y (default disabled), but this will generate a lot of false positives. ( Having said that, Sparse support could still be added back, if anyone is interested - the removal patch is still relatively straightforward to revert at this stage. ) Rust integration updates: (Alice Ryhl, Fujita Tomonori, Boqun Feng) - Add support for Atomic<i8/i16/bool> and replace most Rust native AtomicBool usages with Atomic<bool> - Clean up LockClassKey and improve its documentation - Add missing Send and Sync trait implementation for SetOnce - Make ARef Unpin as it is supposed to be - Add __rust_helper to a few Rust helpers as a preparation for helper LTO - Inline various lock related functions to avoid additional function calls WW mutexes: - Extend ww_mutex tests and other test-ww_mutex updates (John Stultz) Misc fixes and cleanups: - rcu: Mark lockdep_assert_rcu_helper() __always_inline (Arnd Bergmann) - locking/local_lock: Include more missing headers (Peter Zijlstra) - seqlock: fix scoped_seqlock_read kernel-doc (Randy Dunlap) - rust: sync: Replace `kernel::c_str!` with C-Strings (Tamir Duberstein)" * tag 'locking-core-2026-02-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (90 commits) locking/rwlock: Fix write_trylock_irqsave() with CONFIG_INLINE_WRITE_TRYLOCK rcu: Mark lockdep_assert_rcu_helper() __always_inline compiler-context-analysis: Remove __assume_ctx_lock from initializers tomoyo: Use scoped init guard crypto: Use scoped init guard kcov: Use scoped init guard compiler-context-analysis: Introduce scoped init guards cleanup: Make __DEFINE_LOCK_GUARD handle commas in initializers seqlock: fix scoped_seqlock_read kernel-doc tools: Update context analysis macros in compiler_types.h rust: sync: Replace `kernel::c_str!` with C-Strings rust: sync: Inline various lock related methods rust: helpers: Move #define __rust_helper out of atomic.c rust: wait: Add __rust_helper to helpers rust: time: Add __rust_helper to helpers rust: task: Add __rust_helper to helpers rust: sync: Add __rust_helper to helpers rust: refcount: Add __rust_helper to helpers rust: rcu: Add __rust_helper to helpers rust: processor: Add __rust_helper to helpers ...	2026-02-10 12:28:44 -08:00
Linus Torvalds	f17b474e36	bpf-next-7.0 -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmmGmrgACgkQ6rmadz2v bTq6NxAAkCHosxzGn9GYYBV8xhrBJoJJDCyEbQ4nR0XNY+zaWnuykmiPP9w1aOAM zm/po3mQB2pZjetvlrPrgG5RLgBCAUHzqVGy0r+phUvD3vbohKlmSlMm2kiXOb9N T01BgLWsyqN2ZcNFvORdSsftqIJUHcXxU6RdupGD60sO5XM9ty5cwyewLX8GBOas UN2bOhbK2DpqYWUvtv+3Q3ykxoStMSkXZvDRurwLKl4RHeLjXZXPo8NjnfBlk/F2 vdFo/F4NO4TmhOave6UPXvKb4yo9IlBRmiPAl0RmNKBxenY8j9XuV/xZxU6YgzDn +SQfDK+CKQ4IYIygE+fqd4e5CaQrnjmPPcIw12AB2CF0LimY9Xxyyk6FSAhMN7wm GTVh5K2C3Dk3OiRQk4G58EvQ5QcxzX98IeeCpcckMUkPsFWHRvF402WMUcv9SWpD DsxxPkfENY/6N67EvH0qcSe/ikdUorQKFl4QjXKwsMCd5WhToeP4Z7Ck1gVSNkAh 9CX++mLzg333Lpsc4SSIuk9bEPpFa5cUIKUY7GCsCiuOXciPeMDP3cGSd5LioqxN qWljs4Z88QDM2LJpAh8g4m3sA7bMhES3nPmdlI5CfgBcVyLW8D8CqQq4GEZ1McwL Ky084+lEosugoVjRejrdMMEOsqAfcbkTr2b8jpuAZdwJKm6p/bw= =cBdK -----END PGP SIGNATURE----- Merge tag 'bpf-next-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Pull bpf updates from Alexei Starovoitov: - Support associating BPF program with struct_ops (Amery Hung) - Switch BPF local storage to rqspinlock and remove recursion detection counters which were causing false positives (Amery Hung) - Fix live registers marking for indirect jumps (Anton Protopopov) - Introduce execution context detection BPF helpers (Changwoo Min) - Improve verifier precision for 32bit sign extension pattern (Cupertino Miranda) - Optimize BTF type lookup by sorting vmlinux BTF and doing binary search (Donglin Peng) - Allow states pruning for misc/invalid slots in iterator loops (Eduard Zingerman) - In preparation for ASAN support in BPF arenas teach libbpf to move global BPF variables to the end of the region and enable arena kfuncs while holding locks (Emil Tsalapatis) - Introduce support for implicit arguments in kfuncs and migrate a number of them to new API. This is a prerequisite for cgroup sub-schedulers in sched-ext (Ihor Solodrai) - Fix incorrect copied_seq calculation in sockmap (Jiayuan Chen) - Fix ORC stack unwind from kprobe_multi (Jiri Olsa) - Speed up fentry attach by using single ftrace direct ops in BPF trampolines (Jiri Olsa) - Require frozen map for calculating map hash (KP Singh) - Fix lock entry creation in TAS fallback in rqspinlock (Kumar Kartikeya Dwivedi) - Allow user space to select cpu in lookup/update operations on per-cpu array and hash maps (Leon Hwang) - Make kfuncs return trusted pointers by default (Matt Bobrowski) - Introduce "fsession" support where single BPF program is executed upon entry and exit from traced kernel function (Menglong Dong) - Allow bpf_timer and bpf_wq use in all programs types (Mykyta Yatsenko, Andrii Nakryiko, Kumar Kartikeya Dwivedi, Alexei Starovoitov) - Make KF_TRUSTED_ARGS the default for all kfuncs and clean up their definition across the tree (Puranjay Mohan) - Allow BPF arena calls from non-sleepable context (Puranjay Mohan) - Improve register id comparison logic in the verifier and extend linked registers with negative offsets (Puranjay Mohan) - In preparation for BPF-OOM introduce kfuncs to access memcg events (Roman Gushchin) - Use CFI compatible destructor kfunc type (Sami Tolvanen) - Add bitwise tracking for BPF_END in the verifier (Tianci Cao) - Add range tracking for BPF_DIV and BPF_MOD in the verifier (Yazhou Tang) - Make BPF selftests work with 64k page size (Yonghong Song) * tag 'bpf-next-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (268 commits) selftests/bpf: Fix outdated test on storage->smap selftests/bpf: Choose another percpu variable in bpf for btf_dump test selftests/bpf: Remove test_task_storage_map_stress_lookup selftests/bpf: Update task_local_storage/task_storage_nodeadlock test selftests/bpf: Update task_local_storage/recursion test selftests/bpf: Update sk_storage_omem_uncharge test bpf: Switch to bpf_selem_unlink_nofail in bpf_local_storage_{map_free, destroy} bpf: Support lockless unlink when freeing map or local storage bpf: Prepare for bpf_selem_unlink_nofail() bpf: Remove unused percpu counter from bpf_local_storage_map_free bpf: Remove cgroup local storage percpu counter bpf: Remove task local storage percpu counter bpf: Change local_storage->lock and b->lock to rqspinlock bpf: Convert bpf_selem_unlink to failable bpf: Convert bpf_selem_link_map to failable bpf: Convert bpf_selem_unlink_map to failable bpf: Select bpf_local_storage_map_bucket based on bpf_local_storage selftests/xsk: fix number of Tx frags in invalid packet selftests/xsk: properly handle batch ending in the middle of a packet bpf: Prevent reentrance into call_rcu_tasks_trace() ...	2026-02-10 11:26:21 -08:00
Linus Torvalds	13d83ea9d8	Crypto library updates for 7.0 - Add support for verifying ML-DSA signatures. ML-DSA (Module-Lattice-Based Digital Signature Algorithm) is a recently-standardized post-quantum (quantum-resistant) signature algorithm. It was known as Dilithium pre-standardization. The first use case in the kernel will be module signing. But there are also other users of RSA and ECDSA signatures in the kernel that might want to upgrade to ML-DSA eventually. - Improve the AES library: - Make the AES key expansion and single block encryption and decryption functions use the architecture-optimized AES code. Enable these optimizations by default. - Support preparing an AES key for encryption-only, using about half as much memory as a bidirectional key. - Replace the existing two generic implementations of AES with a single one. - Simplify how Adiantum message hashing is implemented. Remove the "nhpoly1305" crypto_shash in favor of direct lib/crypto/ support for NH hashing, and enable optimizations by default. -----BEGIN PGP SIGNATURE----- iIoEABYIADIWIQSacvsUNc7UX4ntmEPzXCl4vpKOKwUCaYlV8xQcZWJpZ2dlcnNA a2VybmVsLm9yZwAKCRDzXCl4vpKOK1ffAQCbM+cnqF4ThspBCgLZGSScx02KsA4U dQblKoOFyIEbnwEA1ElJNhNQs2m7AT+R0hOh6yI+5+ttUfqLMT9tuNs2mwM= =iZ06 -----END PGP SIGNATURE----- Merge tag 'libcrypto-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux Pull crypto library updates from Eric Biggers: - Add support for verifying ML-DSA signatures. ML-DSA (Module-Lattice-Based Digital Signature Algorithm) is a recently-standardized post-quantum (quantum-resistant) signature algorithm. It was known as Dilithium pre-standardization. The first use case in the kernel will be module signing. But there are also other users of RSA and ECDSA signatures in the kernel that might want to upgrade to ML-DSA eventually. - Improve the AES library: - Make the AES key expansion and single block encryption and decryption functions use the architecture-optimized AES code. Enable these optimizations by default. - Support preparing an AES key for encryption-only, using about half as much memory as a bidirectional key. - Replace the existing two generic implementations of AES with a single one. - Simplify how Adiantum message hashing is implemented. Remove the "nhpoly1305" crypto_shash in favor of direct lib/crypto/ support for NH hashing, and enable optimizations by default. * tag 'libcrypto-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux: (53 commits) lib/crypto: mldsa: Clarify the documentation for mldsa_verify() slightly lib/crypto: aes: Drop 'volatile' from aes_sbox and aes_inv_sbox lib/crypto: aes: Remove old AES en/decryption functions lib/crypto: aesgcm: Use new AES library API lib/crypto: aescfb: Use new AES library API crypto: omap - Use new AES library API crypto: inside-secure - Use new AES library API crypto: drbg - Use new AES library API crypto: crypto4xx - Use new AES library API crypto: chelsio - Use new AES library API crypto: ccp - Use new AES library API crypto: x86/aes-gcm - Use new AES library API crypto: arm64/ghash - Use new AES library API crypto: arm/ghash - Use new AES library API staging: rtl8723bs: core: Use new AES library API net: phy: mscc: macsec: Use new AES library API chelsio: Use new AES library API Bluetooth: SMP: Use new AES library API crypto: x86/aes - Remove the superseded AES-NI crypto_cipher lib/crypto: x86/aes: Add AES-NI optimization ...	2026-02-10 08:31:09 -08:00
Vladimir Oltean	c22ba07c82	net: dsa: eliminate local type for tc policers David Yang is saying that struct flow_action_entry in include/net/flow_offload.h has gained new fields and DSA's struct dsa_mall_policer_tc_entry, derived from that, isn't keeping up. This structure is passed to drivers and they are completely oblivious to the values of fields they don't see. This has happened before, and almost always the solution was to make the DSA layer thinner and use the upstream data structures. Here, the reason why we didn't do that is because struct flow_action_entry :: police is an anonymous structure. That is easily enough fixable, just name those fields "struct flow_action_police" and reference them from DSA. Make the according transformations to the two users (sja1105 and felix): "rate_bytes_per_sec" -> "rate_bytes_ps". Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Co-developed-by: David Yang <mmyangfl@gmail.com> Signed-off-by: David Yang <mmyangfl@gmail.com> Link: https://patch.msgid.link/20260206075427.44733-1-mmyangfl@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-02-10 15:30:11 +01:00
Jiayuan Chen	81b84de32b	xfrm: fix ip_rt_bug race in icmp_route_lookup reverse path icmp_route_lookup() performs multiple route lookups to find a suitable route for sending ICMP error messages, with special handling for XFRM (IPsec) policies. The lookup sequence is: 1. First, lookup output route for ICMP reply (dst = original src) 2. Pass through xfrm_lookup() for policy check 3. If blocked (-EPERM) or dst is not local, enter "reverse path" 4. In reverse path, call xfrm_decode_session_reverse() to get fl4_dec which reverses the original packet's flow (saddr<->daddr swapped) 5. If fl4_dec.saddr is local (we are the original destination), use __ip_route_output_key() for output route lookup 6. If fl4_dec.saddr is NOT local (we are a forwarding node), use ip_route_input() to simulate the reverse packet's input path 7. Finally, pass rt2 through xfrm_lookup() with XFRM_LOOKUP_ICMP flag The bug occurs in step 6: ip_route_input() is called with fl4_dec.daddr (original packet's source) as destination. If this address becomes local between the initial check and ip_route_input() call (e.g., due to concurrent "ip addr add"), ip_route_input() returns a LOCAL route with dst.output set to ip_rt_bug. This route is then used for ICMP output, causing dst_output() to call ip_rt_bug(), triggering a WARN_ON: ------------[ cut here ]------------ WARNING: net/ipv4/route.c:1275 at ip_rt_bug+0x21/0x30, CPU#1 Call Trace: <TASK> ip_push_pending_frames+0x202/0x240 icmp_push_reply+0x30d/0x430 __icmp_send+0x1149/0x24f0 ip_options_compile+0xa2/0xd0 ip_rcv_finish_core+0x829/0x1950 ip_rcv+0x2d7/0x420 __netif_receive_skb_one_core+0x185/0x1f0 netif_receive_skb+0x90/0x450 tun_get_user+0x3413/0x3fb0 tun_chr_write_iter+0xe4/0x220 ... Fix this by checking rt2->rt_type after ip_route_input(). If it's RTN_LOCAL, the route cannot be used for output, so treat it as an error. The reproducer requires kernel modification to widen the race window, making it unsuitable as a selftest. It is available at: https://gist.github.com/mrpre/eae853b72ac6a750f5d45d64ddac1e81 Reported-by: syzbot+e738404dcd14b620923c@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/000000000000b1060905eada8881@google.com/T/ Closes: https://lore.kernel.org/r/20260128090523.356953-1-jiayuan.chen@linux.dev Fixes: `8b7817f3a9` ("[IPSEC]: Add ICMP host relookup support") Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com> Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev> Link: https://patch.msgid.link/20260206050220.59642-1-jiayuan.chen@linux.dev Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-02-10 15:06:11 +01:00
Felix Maurer	aae9d6b616	hsr: Implement more robust duplicate discard for HSR The HSR duplicate discard algorithm had even more basic problems than the described for PRP in the previous patch. It relied only on the last received sequence number to decide if a new frame should be forwarded to any port. This does not work correctly in any case where frames are received out of order. The linked bug report claims that this can even happen with perfectly fine links due to the order in which incoming frames are processed (which can be unexpected on multi-core systems). The issue also occasionally shows up in the HSR selftests. The main reason is that the sequence number that was last forwarded to the master port may have skipped a number which will in turn never be delivered to the host. As the problem (we accidentally skip over a sequence number that has not been received but will be received in the future) is similar to PRP, we can apply a similar solution. The duplicate discard algorithm based on the "sparse bitmap" works well for HSR if it is extended to track one bitmap for each port (A, B, master, interlink). To do this, change the sequence number blocks to contain a flexible array member as the last member that can keep chunks for as many bitmaps as we need. This design makes it easy to reuse the same algorithm in a potential PRP RedBox implementation. The duplicate discard algorithm functions are modified to deal with sequence number blocks of different sizes and to correctly use the array of bitmap chunks. There is a notable speciality for HSR: the port type has a special port type NONE with value 0. This leads to the number of port types being 5 instead of actually 4. To save memory, remove the NONE port from the bitmap (by subtracting 1) when setting up the block buffer and when accessing the bitmap chunks in the array. Removing the old algorithm allows us to get rid of a few fields that are not needed any more: time_out and seq_out for each port. We can also remove some functions that were only necessary for the previous duplicate discard algorithm. The removal of seq_out is possible despite its previous usage in hsr_register_frame_in: it was used to prevent updates to time_in when "invalid" sequence numbers were received. With the new duplicate discard algorithm, time_in has no relevance for the expiry of sequence numbers anymore. They will expire based on the timestamps in the sequence number blocks after at most 400ms. There is no need that a node "re-registers" to "resume communication": after 400ms, all sequence numbers are accepted again. Also, according to the IEC 62439-3:2021, all nodes are supposed to send no traffic for 500ms after boot to lead exactly to this expiry of seen sequence numbers. time_in is still used for pruning nodes from the node table after no traffic has been received for 60sec. Pruning is only needed if the node is really gone and has not been sending any traffic for that period. seq_out was also used to report the last incoming sequence number from a node through netlink. I am not sure how useful this value is to userspace at all, but added getting it from the sequence number blocks. This number can be outdated after node merging until a new block has been added. Update the KUnit test for the PRP duplicate discard so that the node allocation matches and expectations on the removed fields are removed. Reported-by: Yoann Congal <yoann.congal@smile.fr> Closes: https://lore.kernel.org/netdev/7d221a07-8358-4c0b-a09c-3b029c052245@smile.fr/ Signed-off-by: Felix Maurer <fmaurer@redhat.com> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://patch.msgid.link/36dc3bc5bdb7e68b70bb5ef86f53ca95a3f35418.1770299429.git.fmaurer@redhat.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-02-10 12:02:29 +01:00
Felix Maurer	415e636751	hsr: Implement more robust duplicate discard for PRP The PRP duplicate discard algorithm does not work reliably with certain link faults. Especially with packet loss on one link, the duplicate discard algorithm drops valid packets which leads to packet loss on the PRP interface where the link fault should in theory be perfectly recoverable by PRP. This happens because the algorithm opens the drop window on the lossy link, covering received and lost sequence numbers. If the other, non-lossy link receives the duplicate for a lost frame, it is within the drop window of the lossy link and therefore dropped. Since IEC 62439-3:2012, a node has one sequence number counter for frames it sends, instead of one sequence number counter for each destination. Therefore, a node can not expect to receive contiguous sequence numbers from a sender. A missing sequence number can be totally normal (if the sender intermittently communicates with another node) or mean a frame was lost. The algorithm, as previously implemented in commit `05fd00e5e7` ("net: hsr: Fix PRP duplicate detection"), was part of IEC 62439-3:2010 (HSRv0/PRPv0) but was removed with IEC 62439-3:2012 (HSRv1/PRPv1). Since that, no algorithm is specified but up to implementers. It should be "designed such that it never rejects a legitimate frame, while occasional acceptance of a duplicate can be tolerated" (IEC 62439-3:2021). For the duplicate discard algorithm, this means that 1) we need to track the sequence numbers individually to account for non-contiguous sequence numbers, and 2) we should always err on the side of accepting a duplicate than dropping a valid frame. The idea of the new algorithm is to store the seen sequence numbers in a bitmap. To keep the size of the bitmap in control, we store it as a "sparse bitmap" where the bitmap is split into blocks and not all blocks exist at the same time. The sparse bitmap is implemented using an xarray that keeps the references to the individual blocks and a backing ring buffer that stores the actual blocks. New blocks are initialized in the buffer and added to the xarray as needed when new frames arrive. Existing blocks are removed in two conditions: 1. The block found for an arriving sequence number is old and therefore not relevant to the duplicate discard algorithm anymore, i.e., it has been added more than the entry forget time ago. In this case, the block is removed from the xarray and marked as forgotten (by setting its timestamp to 0). 2. Space is needed in the ring buffer for a new block. In this case, the block is removed from the xarray, if it hasn't already been forgotten (by 1.). Afterwards, the new block is initialized in its place. This has the nice property that we can reliably track sequence numbers on low traffic situations (where they expire based on their timestamp) and more quickly forget sequence numbers in high traffic situations before they potentially wrap over and repeat before they are expired. When nodes are merged, the blocks are merged as well. The timestamp of a merged block is set to the minimum of the two timestamps to never keep around a seen sequence number for too long. The bitmaps are or'd to mark all seen sequence numbers as seen. All of this still happens under seq_out_lock, to prevent concurrent access to the blocks. The KUnit test for the algorithm is updated as well. The updates are done in a way to match the original intends pretty closely. Currently, there is much knowledge about the actual algorithm baked into the tests (especially the expectations) which may need some redesign in the future. Reported-by: Steffen Lindner <steffen.lindner@de.abb.com> Signed-off-by: Felix Maurer <fmaurer@redhat.com> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Tested-by: Steffen Lindner <steffen.lindner@de.abb.com> Link: https://patch.msgid.link/8ce15a996099df2df5b700969a39e7df400e8dbb.1770299429.git.fmaurer@redhat.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-02-10 12:02:28 +01:00
Jiayuan Chen	ae88a5d2f2	net: atm: fix crash due to unvalidated vcc pointer in sigd_send() Reproducer available at [1]. The ATM send path (sendmsg -> vcc_sendmsg -> sigd_send) reads the vcc pointer from msg->vcc and uses it directly without any validation. This pointer comes from userspace via sendmsg() and can be arbitrarily forged: int fd = socket(AF_ATMSVC, SOCK_DGRAM, 0); ioctl(fd, ATMSIGD_CTRL); // become ATM signaling daemon struct msghdr msg = { .msg_iov = &iov, ... }; (unsigned long )(buf + 4) = 0xdeadbeef; // fake vcc pointer sendmsg(fd, &msg, 0); // kernel dereferences 0xdeadbeef In normal operation, the kernel sends the vcc pointer to the signaling daemon via sigd_enq() when processing operations like connect(), bind(), or listen(). The daemon is expected to return the same pointer when responding. However, a malicious daemon can send arbitrary pointer values. Fix this by introducing find_get_vcc() which validates the pointer by searching through vcc_hash (similar to how sigd_close() iterates over all VCCs), and acquires a reference via sock_hold() if found. Since struct atm_vcc embeds struct sock as its first member, they share the same lifetime. Therefore using sock_hold/sock_put is sufficient to keep the vcc alive while it is being used. Note that there may be a race with sigd_close() which could mark the vcc with various flags (e.g., ATM_VF_RELEASED) after find_get_vcc() returns. However, sock_hold() guarantees the memory remains valid, so this race only affects the logical state, not memory safety. [1]: https://gist.github.com/mrpre/1ba5949c45529c511152e2f4c755b0f3 Fixes: `1da177e4c3` ("Linux-2.6.12-rc2") Reported-by: syzbot+1f22cb1769f249df9fa0@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/69039850.a70a0220.5b2ed.005d.GAE@google.com/T/ Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com> Link: https://patch.msgid.link/20260205095501.131890-1-jiayuan.chen@linux.dev Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-02-10 11:24:47 +01:00
Linus Torvalds	d16738a4e7	The kthread code provides an infrastructure which manages the preferred affinity of unbound kthreads (node or custom cpumask) against housekeeping (CPU isolation) constraints and CPU hotplug events. One crucial missing piece is the handling of cpuset: when an isolated partition is created, deleted, or its CPUs updated, all the unbound kthreads in the top cpuset become indifferently affine to _all_ the non-isolated CPUs, possibly breaking their preferred affinity along the way. Solve this with performing the kthreads affinity update from cpuset to the kthreads consolidated relevant code instead so that preferred affinities are honoured and applied against the updated cpuset isolated partitions. The dispatch of the new isolated cpumasks to timers, workqueues and kthreads is performed by housekeeping, as per the nice Tejun's suggestion. As a welcome side effect, HK_TYPE_DOMAIN then integrates both the set from boot defined domain isolation (through isolcpus=) and cpuset isolated partitions. Housekeeping cpumasks are now modifiable with a specific RCU based synchronization. A big step toward making nohz_full= also mutable through cpuset in the future. -----BEGIN PGP SIGNATURE----- iQJPBAABCAA5FiEEd76+gtGM8MbftQlOhSRUR1COjHcFAmmE0mYbFIAAAAAABAAO bWFudTIsMi41KzEuMTEsMiwyAAoJEIUkVEdQjox36eMP/0Ls/ArfYVi/MNAXWlpy rAt6m9Y/X9GBcDM/VI9BXq1ZX4qEr2XjJ8UUb8cM08uHEAt0ErlmpRxREwJFrKbI H4jzg5EwO0D0c6MnvgQJEAwkHxQVIjsxG9DovRIjxyW4ycx3aSsRg/f2VKyWoLvY 7ZT7CbLFE+I/MQh2ZgUu/9pnCDQVR2anss2WYIej5mmgFL5pyEv3YvYgKYVyK08z sXyNxpP976g2d9ECJ9OtFJV9we6mlqxlG0MVCiv/Uxh7DBjxWWPsLvlmLAXggQ03 +0GW+nnutDaKz83pgS7Z4zum/+Oa+I1dTLIN27pARUNcMCYip7njM2KNpJwPdov3 +fAIODH2JVX1xewT+U1cCq6gdI55ejbwdQYGFV075dKBUxKQeIyrghvfC3Ga6aKQ Gw3y68jdrXOw6iyfHR5k/0Mnu2/FDKUW2fZxLKm55PvNZP5jQFmSlz9wyiwwyb3m UUSgThj6Ozodxks8hDX41rGVezCcm1ni+qNSiNIs8HPaaZQrwbnvKHQFBBJHQzJP rJ39VWBx3Hq/ly71BOR6pCzoZsfS1f85YKhJ4vsfjLO6BfhI16nBat89eROSRKcz XptyWqW0PgAD0teDuMCTPNuUym/viBHALXHKuSO12CIizacvftiGcmaQNPlLiiFZ /Dr2+aOhwYw3UD6djn3u94M9 =nWGh -----END PGP SIGNATURE----- Merge tag 'kthread-for-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks Pull kthread updates from Frederic Weisbecker: "The kthread code provides an infrastructure which manages the preferred affinity of unbound kthreads (node or custom cpumask) against housekeeping (CPU isolation) constraints and CPU hotplug events. One crucial missing piece is the handling of cpuset: when an isolated partition is created, deleted, or its CPUs updated, all the unbound kthreads in the top cpuset become indifferently affine to _all_ the non-isolated CPUs, possibly breaking their preferred affinity along the way. Solve this with performing the kthreads affinity update from cpuset to the kthreads consolidated relevant code instead so that preferred affinities are honoured and applied against the updated cpuset isolated partitions. The dispatch of the new isolated cpumasks to timers, workqueues and kthreads is performed by housekeeping, as per the nice Tejun's suggestion. As a welcome side effect, HK_TYPE_DOMAIN then integrates both the set from boot defined domain isolation (through isolcpus=) and cpuset isolated partitions. Housekeeping cpumasks are now modifiable with a specific RCU based synchronization. A big step toward making nohz_full= also mutable through cpuset in the future" * tag 'kthread-for-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks: (33 commits) doc: Add housekeeping documentation kthread: Document kthread_affine_preferred() kthread: Comment on the purpose and placement of kthread_affine_node() call kthread: Honour kthreads preferred affinity after cpuset changes sched/arm64: Move fallback task cpumask to HK_TYPE_DOMAIN sched: Switch the fallback task allowed cpumask to HK_TYPE_DOMAIN kthread: Rely on HK_TYPE_DOMAIN for preferred affinity management kthread: Include kthreadd to the managed affinity list kthread: Include unbound kthreads in the managed affinity list kthread: Refine naming of affinity related fields PCI: Remove superfluous HK_TYPE_WQ check sched/isolation: Remove HK_TYPE_TICK test from cpu_is_isolated() cpuset: Remove cpuset_cpu_is_isolated() timers/migration: Remove superfluous cpuset isolation test cpuset: Propagate cpuset isolation update to timers through housekeeping cpuset: Propagate cpuset isolation update to workqueue through housekeeping PCI: Flush PCI probe workqueue on cpuset isolated partition change sched/isolation: Flush vmstat workqueues on cpuset isolated partition change sched/isolation: Flush memcg workqueues on cpuset isolated partition change cpuset: Update HK_TYPE_DOMAIN cpumask from cpuset ...	2026-02-09 19:57:30 -08:00
Daniel Hodges	dd2fdc3504	SUNRPC: fix gss_auth kref leak in gss_alloc_msg error path Commit `5940d1cf9f` ("SUNRPC: Rebalance a kref in auth_gss.c") added a kref_get(&gss_auth->kref) call to balance the gss_put_auth() done in gss_release_msg(), but forgot to add a corresponding kref_put() on the error path when kstrdup_const() fails. If service_name is non-NULL and kstrdup_const() fails, the function jumps to err_put_pipe_version which calls put_pipe_version() and kfree(gss_msg), but never releases the gss_auth reference. This leads to a kref leak where the gss_auth structure is never freed. Add a forward declaration for gss_free_callback() and call kref_put() in the err_put_pipe_version error path to properly release the reference taken earlier. Fixes: `5940d1cf9f` ("SUNRPC: Rebalance a kref in auth_gss.c") Cc: stable@vger.kernel.org Signed-off-by: Daniel Hodges <git@danielhodges.dev> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>	2026-02-09 16:39:50 -05:00
Chenguang Zhao	afb24505ff	SUNRPC: Change list definition method The LIST_HEAD macro can both define a linked list and initialize it in one step. To simplify code, we replace the separate operations of linked list definition and manual initialization with the LIST_HEAD macro. Signed-off-by: Chenguang Zhao <zhaochenguang@kylinos.cn> Reviewed-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>	2026-02-09 14:24:19 -05:00
Linus Torvalds	698749164a	audit/stable-7.0 PR 20260203 -----BEGIN PGP SIGNATURE----- iQJIBAABCgAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAmmCuoQUHHBhdWxAcGF1 bC1tb29yZS5jb20ACgkQ6iDy2pc3iXMRFhAAntv4vmqRciFI4oEqxi5X8wmYmzc9 BUQV2XXcfO63IOHdGrXmYHByx3+mZZddAPpYMTqrzA0p2NCqi4svCVwspUHwUcTY btl+xlppBJpBtUL5pmLiP6Q4u+zURYCwuA/OKfxuKa5Frm8D3kbkd5MpxJS15Mev qqEhLT0aj6/rjQpYVwOFGMwehKfE7iuyc8XTBaetvUKHW38sj18ANSpLnN5bmiuE 3lz252kCjyDoOsu+vO0Saa8Rv8lVDjlSMn6mYr4L2fVygYwFDg2Gj7+bmB6LGYy9 YyIm6P+b23E8GOltEObpvrz8ItPR7nvKNiDMEeP1eqGzQ/Mc5OqqljVaNMNPmP+s XN/jZt02XePKXlje+C08620mDVeIYp35TK1bY2/HrYMqySE0wwO1iSyBI4ftPFtu CteM8XA8oH49pspFWbEKCHmtFFGxDVjfVM7YrHeDc+qw2tJjZ7R1GRk5hadP1Ou7 emxGLb6jfejT6NMNU8rM2RVmQNs1jcFh+8lHvDgqqQmaCXJd3AEgr+Om5w9kZ6fJ FyEkh0f9HuZdcEn8tqWaIwCAZXTzECOThj6hhxZGiG9xXFXza1eIXxq7VtIE6fdO ATAJ6cpcj0LIQmE7QWntS1NloPkWD1OSiWLNis0AgCiN97oRTk0oFJ5q7fD0bLUa HGAQqd4OJKmNciw= =pFeD -----END PGP SIGNATURE----- Merge tag 'audit-pr-20260203' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit Pull audit updates from Paul Moore: - Improve the NETFILTER_PKT audit records Add source and destination ports to the NETFILTER_PKT audit records while also consolidating a lot of the code into a new, singular audit_log_nf_skb() function. This new approach to structuring the NETFILTER_PKT record generation should eliminate some unnecessary overhead when audit is not built into the kernel. - Update the audit syscall classifier code Add the listxattrat(), getxattrat(), and fchmodat2() syscall to the audit code which classifies syscalls into categories of operations, e.g. "read" or "change attributes". - Move the syscall classifier declarations into audit_arch.h Shuffle around some header file declarations to resolve some sparse warnings. * tag 'audit-pr-20260203' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit: audit: move the compat_xxx_class[] extern declarations to audit_arch.h audit: add missing syscalls to read class audit: include source and destination ports to NETFILTER_PKT audit: add audit_log_nf_skb helper function audit: add fchmodat2() to change attributes class	2026-02-09 10:13:03 -08:00
Ilya Dryomov	8356b4b110	libceph: adapt ceph_x_challenge_blob hashing and msgr1 message signing The existing approach where ceph_x_challenge_blob is encrypted with the client's secret key and then the digest derived from the ciphertext is used for the test doesn't work with CEPH_CRYPTO_AES256KRB5 because the confounder randomizes the ciphertext: the client and the server get two different ciphertexts and therefore two different digests. msgr1 signatures are affected the same way: a digest derived from the ciphertext for the message's "sigblock" is what becomes a signature and the two sides disagree on the expected value. For CEPH_CRYPTO_AES256KRB5 (and potential future encryption schemes), switch to HMAC-SHA256 function keyed in the same way as the existing encryption. For CEPH_CRYPTO_AES, everything is preserved as is. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2026-02-09 12:29:22 +01:00
Ilya Dryomov	b7cc142dba	libceph: add support for CEPH_CRYPTO_AES256KRB5 This is based on AES256-CTS-HMAC384-192 crypto algorithm per RFC 8009 (i.e. Kerberos 5, hence the name) with custom-defined key usage numbers. The implementation allows a given key to have/be linked to between one and three usage numbers. The existing CEPH_CRYPTO_AES remains in place and unchanged. The usage_slot parameter that needed to be added to ceph_crypt() and its wrappers is simply ignored there. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2026-02-09 12:29:22 +01:00
Ilya Dryomov	6cec0b61aa	libceph: introduce ceph_crypto_key_prepare() In preparation for bringing in a new encryption scheme/key type, decouple decoding or cloning the key from allocating required crypto API objects and setting them up. The rationale is that a) in some cases a shallow clone is sufficient and b) ceph_crypto_key_prepare() may grow additional parameters that would be inconvenient to provide at the point the key is originally decoded. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2026-02-09 12:29:22 +01:00
Ilya Dryomov	0ee8bccf73	libceph: generalize ceph_x_encrypt_offset() and ceph_x_encrypt_buflen() - introduce the notion of a data offset for ceph_x_encrypt_offset() to allow for e.g. confounder to be prepended before the encryption header in the future. For CEPH_CRYPTO_AES, the data offset is 0 (i.e. nothing is prepended). - adjust ceph_x_encrypt_buflen() accordingly and make it account for PKCS#7 padding that is used by CEPH_CRYPTO_AES precisely instead of just always adding 16. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2026-02-09 12:29:21 +01:00
Ilya Dryomov	ac431d597a	libceph: define and enforce CEPH_MAX_KEY_LEN When decoding the key, verify that the key material would fit into a fixed-size buffer in process_auth_done() and generally has a sane length. The new CEPH_MAX_KEY_LEN check replaces the existing check for a key with no key material which is a) not universal since CEPH_CRYPTO_NONE has to be excluded and b) doesn't provide much value since a smaller than needed key is just as invalid as no key -- this has to be handled elsewhere anyway. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2026-02-09 12:29:21 +01:00
Tetsuo Handa	4efa91a285	xfrm: always flush state and policy upon NETDEV_UNREGISTER event syzbot is reporting that "struct xfrm_state" refcount is leaking. unregister_netdevice: waiting for netdevsim0 to become free. Usage count = 2 ref_tracker: netdev@ffff888052f24618 has 1/1 users at __netdev_tracker_alloc include/linux/netdevice.h:4400 [inline] netdev_tracker_alloc include/linux/netdevice.h:4412 [inline] xfrm_dev_state_add+0x3a5/0x1080 net/xfrm/xfrm_device.c:316 xfrm_state_construct net/xfrm/xfrm_user.c:986 [inline] xfrm_add_sa+0x34ff/0x5fa0 net/xfrm/xfrm_user.c:1022 xfrm_user_rcv_msg+0x58e/0xc00 net/xfrm/xfrm_user.c:3507 netlink_rcv_skb+0x158/0x420 net/netlink/af_netlink.c:2550 xfrm_netlink_rcv+0x71/0x90 net/xfrm/xfrm_user.c:3529 netlink_unicast_kernel net/netlink/af_netlink.c:1318 [inline] netlink_unicast+0x5aa/0x870 net/netlink/af_netlink.c:1344 netlink_sendmsg+0x8c8/0xdd0 net/netlink/af_netlink.c:1894 sock_sendmsg_nosec net/socket.c:727 [inline] __sock_sendmsg net/socket.c:742 [inline] ____sys_sendmsg+0xa5d/0xc30 net/socket.c:2592 ___sys_sendmsg+0x134/0x1d0 net/socket.c:2646 __sys_sendmsg+0x16d/0x220 net/socket.c:2678 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xcd/0xf80 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f This is because commit `d77e38e612` ("xfrm: Add an IPsec hardware offloading API") implemented xfrm_dev_unregister() as no-op despite xfrm_dev_state_add() from xfrm_state_construct() acquires a reference to "struct net_device". I guess that that commit expected that NETDEV_DOWN event is fired before NETDEV_UNREGISTER event fires, and also assumed that xfrm_dev_state_add() is called only if (dev->features & NETIF_F_HW_ESP) != 0. Sabrina Dubroca identified steps to reproduce the same symptoms as below. echo 0 > /sys/bus/netdevsim/new_device dev=$(ls -1 /sys/bus/netdevsim/devices/netdevsim0/net/) ip xfrm state add src 192.168.13.1 dst 192.168.13.2 proto esp \ spi 0x1000 mode tunnel aead 'rfc4106(gcm(aes))' $key 128 \ offload crypto dev $dev dir out ethtool -K $dev esp-hw-offload off echo 0 > /sys/bus/netdevsim/del_device Like these steps indicate, the NETIF_F_HW_ESP bit can be cleared after xfrm_dev_state_add() acquired a reference to "struct net_device". Also, xfrm_dev_state_add() does not check for the NETIF_F_HW_ESP bit when acquiring a reference to "struct net_device". Commit `03891f820c` ("xfrm: handle NETDEV_UNREGISTER for xfrm device") re-introduced the NETDEV_UNREGISTER event to xfrm_dev_event(), but that commit for unknown reason chose to share xfrm_dev_down() between the NETDEV_DOWN event and the NETDEV_UNREGISTER event. I guess that that commit missed the behavior in the previous paragraph. Therefore, we need to re-introduce xfrm_dev_unregister() in order to release the reference to "struct net_device" by unconditionally flushing state and policy. Reported-by: syzbot+881d65229ca4f9ae8c84@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=881d65229ca4f9ae8c84 Fixes: `d77e38e612` ("xfrm: Add an IPsec hardware offloading API") Cc: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>	2026-02-09 10:28:05 +01:00
Alice Mikityanska	1676ebba39	net/ipv6: Remove jumbo_remove step from TX path Now that the kernel doesn't insert HBH for BIG TCP IPv6 packets, remove unnecessary steps from the GSO TX path, that used to check and remove HBH. Signed-off-by: Alice Mikityanska <alice@isovalent.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260205133925.526371-5-alice.kernel@fastmail.im Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-06 20:50:12 -08:00
Alice Mikityanska	81be30c1f5	net/ipv6: Drop HBH for BIG TCP on RX side Complementary to the previous commit, stop inserting HBH when building BIG TCP GRO SKBs. Signed-off-by: Alice Mikityanska <alice@isovalent.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260205133925.526371-4-alice.kernel@fastmail.im Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-06 20:50:12 -08:00
Alice Mikityanska	741d069aa4	net/ipv6: Drop HBH for BIG TCP on TX side BIG TCP IPv6 inserts a hop-by-hop extension header to indicate the real IPv6 payload length when it doesn't fit into the 16-bit field in the IPv6 header itself. While it helps tools parse the packet, it also requires every driver that supports TSO and BIG TCP to remove this 8-byte extension header. It might not sound that bad until we try to apply it to tunneled traffic. Currently, the drivers don't attempt to strip HBH if skb->encapsulation = 1. Moreover, trying to do so would require dissecting different tunnel protocols and making corresponding adjustments on case-by-case basis, which would slow down the fastpath (potentially also requiring adjusting checksums in outer headers). At the same time, BIG TCP IPv4 doesn't insert any extra headers and just calculates the payload length from skb->len, significantly simplifying implementing BIG TCP for tunnels. Stop inserting HBH when building BIG TCP GSO SKBs. Signed-off-by: Alice Mikityanska <alice@isovalent.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260205133925.526371-3-alice.kernel@fastmail.im Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-06 20:50:12 -08:00
Alice Mikityanska	b2936b4fd5	net/ipv6: Introduce payload_len helpers The next commits will transition away from using the hop-by-hop extension header to encode packet length for BIG TCP. Add wrappers around ip6->payload_len that return the actual value if it's non-zero, and calculate it from skb->len if payload_len is set to zero (and a symmetrical setter). The new helpers are used wherever the surrounding code supports the hop-by-hop jumbo header for BIG TCP IPv6, or the corresponding IPv4 code uses skb_ip_totlen (e.g., in include/net/netfilter/nf_tables_ipv6.h). No behavioral change in this commit. Signed-off-by: Alice Mikityanska <alice@isovalent.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260205133925.526371-2-alice.kernel@fastmail.im Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-06 20:50:03 -08:00
Matthieu Baerts (NGI0)	136f1e168f	mptcp: fix kdoc warnings The following warnings were visible: $ ./scripts/kernel-doc -Wall -none \ net/mptcp/ include/net/mptcp.h include/uapi/linux/mptcp.h \ include/trace/events/mptcp.h Warning: net/mptcp/token.c:108 No description found for return value of 'mptcp_token_new_request' Warning: net/mptcp/token.c:151 No description found for return value of 'mptcp_token_new_connect' Warning: net/mptcp/token.c:246 No description found for return value of 'mptcp_token_get_sock' Warning: net/mptcp/token.c:298 No description found for return value of 'mptcp_token_iter_next' Warning: net/mptcp/protocol.c:4431 No description found for return value of 'mptcp_splice_read' Warning: include/uapi/linux/mptcp_pm.h:13 missing initial short description on line: enum mptcp_event_type Address all of them: either by using the 'Return:' keyword, or by adding a missing initial short description. The MPTCP CI will soon report issues with kdoc to avoid introducing new issues and being flagged by the Netdev CI. Reviewed-by: Geliang Tang <geliang@kernel.org> Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20260205-net-mptcp-misc-fixes-6-19-rc8-v2-3-c2720ce75c34@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-06 20:35:06 -08:00
Matthieu Baerts (NGI0)	364a7084df	mptcp: pm: in-kernel: clarify mptcp_pm_remove_anno_addr() The variable 'ret' was used, but it was not cleared what it was, and probably led to an issue [1]. Rename it to 'announced' to avoid confusions. While at it, remove the returned value of the helper: it is only used in one place, and the returned value is not used. Link: https://github.com/multipath-tcp/mptcp_net-next/issues/606 [1] Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20260205-net-mptcp-misc-fixes-6-19-rc8-v2-2-c2720ce75c34@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-06 20:35:06 -08:00
Matthieu Baerts (NGI0)	d191101dee	mptcp: pm: in-kernel: always set ID as avail when rm endp Syzkaller managed to find a combination of actions that was generating this warning: WARNING: net/mptcp/pm_kernel.c:1074 at __mark_subflow_endp_available net/mptcp/pm_kernel.c:1074 [inline], CPU#1: syz.7.48/2535 WARNING: net/mptcp/pm_kernel.c:1074 at mptcp_pm_nl_fullmesh net/mptcp/pm_kernel.c:1446 [inline], CPU#1: syz.7.48/2535 WARNING: net/mptcp/pm_kernel.c:1074 at mptcp_pm_nl_set_flags_all net/mptcp/pm_kernel.c:1474 [inline], CPU#1: syz.7.48/2535 WARNING: net/mptcp/pm_kernel.c:1074 at mptcp_pm_nl_set_flags+0x5de/0x640 net/mptcp/pm_kernel.c:1538, CPU#1: syz.7.48/2535 Modules linked in: CPU: 1 UID: 0 PID: 2535 Comm: syz.7.48 Not tainted 6.18.0-03987-gea5f5e676cf5 #17 PREEMPT(voluntary) Hardware name: QEMU Ubuntu 25.10 PC (i440FX + PIIX, 1996), BIOS 1.17.0-debian-1.17.0-1 04/01/2014 RIP: 0010:__mark_subflow_endp_available net/mptcp/pm_kernel.c:1074 [inline] RIP: 0010:mptcp_pm_nl_fullmesh net/mptcp/pm_kernel.c:1446 [inline] RIP: 0010:mptcp_pm_nl_set_flags_all net/mptcp/pm_kernel.c:1474 [inline] RIP: 0010:mptcp_pm_nl_set_flags+0x5de/0x640 net/mptcp/pm_kernel.c:1538 Code: 89 c7 e8 c5 8c 73 fe e9 f7 fd ff ff 49 83 ef 80 e8 b7 8c 73 fe 4c 89 ff be 03 00 00 00 e8 4a 29 e3 fe eb ac e8 a3 8c 73 fe 90 <0f> 0b 90 e9 3d ff ff ff e8 95 8c 73 fe b8 a1 ff ff ff eb 1a e8 89 RSP: 0018:ffffc9001535b820 EFLAGS: 00010287 netdevsim0: tun_chr_ioctl cmd 1074025677 RAX: ffffffff82da294d RBX: 0000000000000001 RCX: 0000000000080000 RDX: ffffc900096d0000 RSI: 00000000000006d6 RDI: 00000000000006d7 netdevsim0: linktype set to 823 RBP: ffff88802cdb2240 R08: 00000000000104ae R09: ffffffffffffffff R10: ffffffff82da27d4 R11: 0000000000000000 R12: 0000000000000000 R13: ffff88801246d8c0 R14: ffffc9001535b8b8 R15: ffff88802cdb1800 FS: 00007fc6ac5a76c0(0000) GS:ffff8880f90c8000(0000) knlGS:0000000000000000 netlink: 'syz.3.50': attribute type 5 has an invalid length. CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 netlink: 1232 bytes leftover after parsing attributes in process `syz.3.50'. CR2: 0000200000010000 CR3: 0000000025b1a000 CR4: 0000000000350ef0 Call Trace: <TASK> mptcp_pm_set_flags net/mptcp/pm_netlink.c:277 [inline] mptcp_pm_nl_set_flags_doit+0x1d7/0x210 net/mptcp/pm_netlink.c:282 genl_family_rcv_msg_doit+0x117/0x180 net/netlink/genetlink.c:1115 genl_family_rcv_msg net/netlink/genetlink.c:1195 [inline] genl_rcv_msg+0x3a8/0x3f0 net/netlink/genetlink.c:1210 netlink_rcv_skb+0x16d/0x240 net/netlink/af_netlink.c:2550 genl_rcv+0x28/0x40 net/netlink/genetlink.c:1219 netlink_unicast_kernel net/netlink/af_netlink.c:1318 [inline] netlink_unicast+0x3e9/0x4c0 net/netlink/af_netlink.c:1344 netlink_sendmsg+0x4ab/0x5b0 net/netlink/af_netlink.c:1894 sock_sendmsg_nosec net/socket.c:718 [inline] __sock_sendmsg+0xc9/0xf0 net/socket.c:733 ____sys_sendmsg+0x272/0x3b0 net/socket.c:2608 ___sys_sendmsg+0x2de/0x320 net/socket.c:2662 __sys_sendmsg net/socket.c:2694 [inline] __do_sys_sendmsg net/socket.c:2699 [inline] __se_sys_sendmsg net/socket.c:2697 [inline] __x64_sys_sendmsg+0x110/0x1a0 net/socket.c:2697 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xed/0x360 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7fc6adb66f6d Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 e8 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007fc6ac5a6ff8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e RAX: ffffffffffffffda RBX: 00007fc6addf5fa0 RCX: 00007fc6adb66f6d RDX: 0000000000048084 RSI: 00002000000002c0 RDI: 000000000000000e RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 netlink: 'syz.5.51': attribute type 2 has an invalid length. R13: 00007fff25e91fe0 R14: 00007fc6ac5a7ce4 R15: 00007fff25e920d7 </TASK> The actions that caused that seem to be: - Create an MPTCP endpoint for address A without any flags - Create a new MPTCP connection from address A - Remove the MPTCP endpoint: the corresponding subflows will be removed - Recreate the endpoint with the same ID, but with the subflow flag - Change the same endpoint to add the fullmesh flag In this case, msk->pm.local_addr_used has been kept to 0 as expected, but the corresponding bit in msk->pm.id_avail_bitmap was still unset after having removed the endpoint, causing the splat later on. When removing an endpoint, the corresponding endpoint ID was only marked as available for "signal" types with an announced address, plus all "subflow" types, but not the other types like an endpoint corresponding to the initial subflow. In these cases, re-creating an endpoint with the same ID didn't signal/create anything. Here, adding the fullmesh flag was creating the splat when calling __mark_subflow_endp_available() from mptcp_pm_nl_fullmesh(), because msk->pm.local_addr_used was set to 0 while the ID was marked as used. To fix this issue, the corresponding bit in msk->pm.id_avail_bitmap can always be set as available when removing an MPTCP in-kernel endpoint. In other words, moving the call to __set_bit() to do it in all cases, except for "subflow" types where this bit is handled in a dedicated helper. Note: instead of adding a new spin_(un)lock_bh that would be taken in all cases, do all the actions requiring the spin lock under the same block. This modification potentially fixes another issue reported by syzbot, see [1]. But without a reproducer or more details about what exactly happened before, it is hard to confirm. Fixes: `e255683c06` ("mptcp: pm: re-using ID of unused removed ADD_ADDR") Cc: stable@vger.kernel.org Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/606 Reported-by: syzbot+f56f7d56e2c6e11a01b6@syzkaller.appspotmail.com Closes: https://lore.kernel.org/68fcfc4a.050a0220.346f24.02fb.GAE@google.com [1] Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20260205-net-mptcp-misc-fixes-6-19-rc8-v2-1-c2720ce75c34@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-06 20:35:06 -08:00
Eric Dumazet	a14d931790	ipv6: do not use skb_header_pointer() in icmpv6_filter() Prefer pskb_may_pull() to avoid a stack canary in raw6_local_deliver(). Note: skb->head can change, hence we reload ip6h pointer in ipv6_raw_deliver() $ scripts/bloat-o-meter -t vmlinux.old vmlinux.new add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-86 (-86) Function old new delta raw6_local_deliver 780 694 -86 Total: Before=24889784, After=24889698, chg -0.00% Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260205211909.4115285-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-06 20:34:20 -08:00
Eric Dumazet	a35b6e4863	tcp: inline tcp_filter() This helper is already (auto)inlined from IPv4 TCP stack. Make it an inline function to benefit IPv6 as well. $ scripts/bloat-o-meter -t vmlinux.old vmlinux.new add/remove: 0/2 grow/shrink: 1/0 up/down: 30/-49 (-19) Function old new delta tcp_v6_rcv 3448 3478 +30 __pfx_tcp_filter 16 - -16 tcp_filter 33 - -33 Total: Before=24891904, After=24891885, chg -0.00% Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260205164329.3401481-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-06 20:12:11 -08:00
Oliver Hartkopp	abf981bb8d	net: skb: allow up to 8 skb extension ids The skb extension ids range from 0 .. 7 to fit their bits as flags into a single byte. The ids are automatically enumnerated in enum skb_ext_id in skbuff.h, where SKB_EXT_NUM is defined as the last value. When having 8 skb extension ids (0 .. 7), SKB_EXT_NUM becomes 8 which is a valid value for SKB_EXT_NUM. Fixes: `96ea3a1e2d` ("can: add CAN skb extension infrastructure") Link: https://lore.kernel.org/netdev/aXoMqaA7b2CqJZNA@strlen.de/ Reviewed-by: Florian Westphal <fw@strlen.de> Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Link: https://patch.msgid.link/20260205-skb_ext-v1-1-9ba992ccee8b@hartkopp.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-06 20:07:24 -08:00
Eric Dumazet	2214aab268	net_sched: sch_fq: rework fq_gc() to avoid stack canary Using kmem_cache_free_bulk() in fq_gc() was not optimal. 1) It needs an array. 2) It is only saving cpu cycles for large batches. The automatic array forces a stack canary, which is expensive. In practice fq_gc was finding zero, one or two flows at most per round. Remove the array, use kmem_cache_free(). This makes fq_enqueue() smaller and faster. $ scripts/bloat-o-meter -t vmlinux.old vmlinux.new add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-79 (-79) Function old new delta fq_enqueue 1629 1550 -79 Total: Before=24886583, After=24886504, chg -0.00% Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260204190034.76277-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-06 20:03:44 -08:00
Qiliang Yuan	7acee67a6b	netns: optimize netns cleaning by batching unhash_nsid calls Currently, unhash_nsid() scans the entire system for each netns being killed, leading to O(L_dying_net * M_alive_net * N_id) complexity, as __peernet2id() also performs a linear search in the IDR. Optimize this to O(M_alive_net * N_id) by batching unhash operations. Move unhash_nsid() out of the per-netns loop in cleanup_net() to perform a single-pass traversal over survivor namespaces. Identify dying peers by an 'is_dying' flag, which is set under net_rwsem write lock after the netns is removed from the global list. This batches the unhashing work and eliminates the O(L_dying_net) multiplier. To minimize the impact on struct net size, 'is_dying' is placed in an existing hole after 'hash_mix' in struct net. Use a restartable idr_get_next() loop for iteration. This avoids the unsafe modification issue inherent to idr_for_each() callbacks and allows dropping the nsid_lock to safely call sleepy rtnl_net_notifyid(). Clean up redundant nsid_lock and simplify the destruction loop now that unhashing is centralized. Signed-off-by: Qiliang Yuan <yuanql9@chinatelecom.cn> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260204074854.3506916-1-realwujing@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-06 20:01:31 -08:00
Amery Hung	0be08389c7	bpf: Switch to bpf_selem_unlink_nofail in bpf_local_storage_{map_free, destroy} Take care of rqspinlock error in bpf_local_storage_{map_free, destroy}() properly by switching to bpf_selem_unlink_nofail(). Both functions iterate their own RCU-protected list of selems and call bpf_selem_unlink_nofail(). In map_free(), to prevent infinite loop when both map_free() and destroy() fail to remove a selem from b->list (extremely unlikely), switch to hlist_for_each_entry_rcu(). In destroy(), also switch to hlist_for_each_entry_rcu() since we no longer iterate local_storage->list under local_storage->lock. bpf_selem_unlink() now becomes dedicated to helpers and syscalls paths so reuse_now should always be false. Remove it from the argument and hardcode it. Acked-by: Alexei Starovoitov <ast@kernel.org> Co-developed-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Amery Hung <ameryhung@gmail.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://patch.msgid.link/20260205222916.1788211-12-ameryhung@gmail.com	2026-02-06 14:47:59 -08:00
Amery Hung	3417dffb58	bpf: Remove unused percpu counter from bpf_local_storage_map_free Percpu locks have been removed from cgroup and task local storage. Now that all local storage no longer use percpu variables as locks preventing recursion, there is no need to pass them to bpf_local_storage_map_free(). Remove the argument from the function. Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Amery Hung <ameryhung@gmail.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://patch.msgid.link/20260205222916.1788211-9-ameryhung@gmail.com	2026-02-06 14:29:18 -08:00
Amery Hung	403e935f91	bpf: Convert bpf_selem_unlink to failable To prepare changing both bpf_local_storage_map_bucket::lock and bpf_local_storage::lock to rqspinlock, convert bpf_selem_unlink() to failable. It still always succeeds and returns 0 until the change happens. No functional change. Open code bpf_selem_unlink_storage() in the only caller, bpf_selem_unlink(), since unlink_map and unlink_storage must be done together after all the necessary locks are acquired. For bpf_local_storage_map_free(), ignore the return from bpf_selem_unlink() for now. A later patch will allow it to unlink selems even when failing to acquire locks. Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Amery Hung <ameryhung@gmail.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://patch.msgid.link/20260205222916.1788211-5-ameryhung@gmail.com	2026-02-06 14:28:59 -08:00
Amery Hung	fd103ffc57	bpf: Convert bpf_selem_link_map to failable To prepare for changing bpf_local_storage_map_bucket::lock to rqspinlock, convert bpf_selem_link_map() to failable. It still always succeeds and returns 0 until the change happens. No functional change. Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Amery Hung <ameryhung@gmail.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://patch.msgid.link/20260205222916.1788211-4-ameryhung@gmail.com	2026-02-06 14:28:55 -08:00
Amery Hung	0ccef7079e	bpf: Select bpf_local_storage_map_bucket based on bpf_local_storage A later bpf_local_storage refactor will acquire all locks before performing any update. To simplified the number of locks needed to take in bpf_local_storage_map_update(), determine the bucket based on the local_storage an selem belongs to instead of the selem pointer. Currently, when a new selem needs to be created to replace the old selem in bpf_local_storage_map_update(), locks of both buckets need to be acquired to prevent racing. This can be simplified if the two selem belongs to the same bucket so that only one bucket needs to be locked. Therefore, instead of hashing selem, hashing the local_storage pointer the selem belongs. Performance wise, this is slightly better as update now requires locking one bucket. It should not change the level of contention on one bucket as the pointers to local storages of selems in a map are just as unique as pointers to selems. Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Amery Hung <ameryhung@gmail.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://patch.msgid.link/20260205222916.1788211-2-ameryhung@gmail.com	2026-02-06 14:28:43 -08:00
Pablo Neira Ayuso	648946966a	netfilter: nft_set_rbtree: validate open interval overlap Open intervals do not have an end element, in particular an open interval at the end of the set is hard to validate because of it is lacking the end element, and interval validation relies on such end element to perform the checks. This patch adds a new flag field to struct nft_set_elem, this is not an issue because this is a temporary object that is allocated in the stack from the insert/deactivate path. This flag field is used to specify that this is the last element in this add/delete command. The last flag is used, in combination with the start element cookie, to check if there is a partial overlap, eg. Already exists: 255.255.255.0-255.255.255.254 Add interval: 255.255.255.0-255.255.255.255 ~~~~~~~~~~~~~ start element overlap Basically, the idea is to check for an existing end element in the set if there is an overlap with an existing start element. However, the last open interval can come in any position in the add command, the corner case can get a bit more complicated: Already exists: 255.255.255.0-255.255.255.254 Add intervals: 255.255.255.0-255.255.255.255,255.255.255.0-255.255.255.254 ~~~~~~~~~~~~~ start element overlap To catch this overlap, annotate that the new start element is a possible overlap, then report the overlap if the next element is another start element that confirms that previous element in an open interval at the end of the set. For deletions, do not update the start cookie when deleting an open interval, otherwise this can trigger spurious EEXIST when adding new elements. Unfortunately, there is no NFT_SET_ELEM_INTERVAL_OPEN flag which would make easier to detect open interval overlaps. Fixes: `7c84d41416` ("netfilter: nft_set_rbtree: Detect partial overlaps on insertion") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Florian Westphal <fw@strlen.de>	2026-02-06 13:36:07 +01:00
Pablo Neira Ayuso	782f268812	netfilter: nft_set_rbtree: validate element belonging to interval The existing partial overlap detection does not check if the elements belong to the interval, eg. add element inet x y { 1.1.1.1-2.2.2.2, 4.4.4.4-5.5.5.5 } add element inet x y { 1.1.1.1-5.5.5.5 } => this should fail: ENOENT Similar situation occurs with deletions: add element inet x y { 1.1.1.1-2.2.2.2, 4.4.4.4-5.5.5.5} delete element inet x y { 1.1.1.1-5.5.5.5 } => this should fail: ENOENT This currently works via mitigation by nft in userspace, which is performing the overlap detection before sending the elements to the kernel. This requires a previous netlink dump of the set content which slows down incremental updates on interval sets, because a netlink set content dump is needed. This patch extends the existing overlap detection to track the most recent start element that already exists. The pointer to the existing start element is stored as a cookie (no pointer dereference is ever possible). If the end element is added and it already exists, then check that the existing end element is adjacent to the already existing start element. Similar logic applies to element deactivation. This patch also annotates the timestamp to identify if start cookie comes from an older batch, in such case reset it. Otherwise, a failing create element command leaves the start cookie in place, resulting in bogus error reporting. There is still a few more corner cases of overlap detection related to the open interval that are addressed in follow up patches. This is address an early design mistake where an interval is expressed as two elements, using the NFT_SET_ELEM_INTERVAL_END flag, instead of the more recent NFTA_SET_ELEM_KEY_END attribute that pipapo already uses. Fixes: `7c84d41416` ("netfilter: nft_set_rbtree: Detect partial overlaps on insertion") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Florian Westphal <fw@strlen.de>	2026-02-06 13:36:07 +01:00
Pablo Neira Ayuso	4780ec142c	netfilter: nft_set_rbtree: check for partial overlaps in anonymous sets Userspace provides an optimized representation in case intervals are adjacent, where the end element is omitted. The existing partial overlap detection logic skips anonymous set checks on start elements for this reason. However, it is possible to add intervals that overlap to this anonymous where two start elements with the same, eg. A-B, A-C where C < B. start end A B start end A C Restore the check on overlapping start elements to report an overlap. Fixes: `c9e6978e27` ("netfilter: nft_set_rbtree: Switch to node list walk for overlap detection") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Florian Westphal <fw@strlen.de>	2026-02-06 13:36:07 +01:00
Pablo Neira Ayuso	7f9203f41a	netfilter: nft_set_rbtree: fix bogus EEXIST with NLM_F_CREATE with null interval Userspace adds a non-matching null element to the kernel for historical reasons. This null element is added when the set is populated with elements. Inclusion of this element is conditional, therefore, userspace needs to dump the set content to check for its presence. If the NLM_F_CREATE flag is turned on, this becomes an issue because kernel bogusly reports EEXIST. Add special case to ignore NLM_F_CREATE in this case, therefore, re-adding the nul-element never fails. Fixes: `c016c7e45d` ("netfilter: nf_tables: honor NLM_F_EXCL flag in set element insertion") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Florian Westphal <fw@strlen.de>	2026-02-06 13:36:07 +01:00
Anders Grahn	1e13f27e06	netfilter: nft_counter: fix reset of counters on 32bit archs nft_counter_reset() calls u64_stats_add() with a negative value to reset the counter. This will work on 64bit archs, hence the negative value added will wrap as a 64bit value which then can wrap the stat counter as well. On 32bit archs, the added negative value will wrap as a 32bit value and _not_ wrapping the stat counter properly. In most cases, this would just lead to a very large 32bit value being added to the stat counter. Fix by introducing u64_stats_sub(). Fixes: `4a1d3acd6e` ("netfilter: nft_counter: Use u64_stats_t for statistic.") Signed-off-by: Anders Grahn <anders.grahn@gmail.com> Signed-off-by: Florian Westphal <fw@strlen.de>	2026-02-06 13:34:55 +01:00
Florian Westphal	2f635adbe2	netfilter: nft_set_hash: fix get operation on big endian tests/shell/testcases/packetpath/set_match_nomatch_hash_fast fails on big endian with: Error: Could not process rule: No such file or directory reset element ip test s { 244.147.90.126 } ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Fatal: Cannot fetch element "244.147.90.126" ... because the wrong bucket is searched, jhash() and jhash1_word are not interchangeable on big endian. Fixes: `3b02b0adc2` ("netfilter: nft_set_hash: fix lookups with fixed size hash on big endian") Signed-off-by: Florian Westphal <fw@strlen.de>	2026-02-06 13:34:55 +01:00
Qingfang Deng	2a441a9aac	netfilter: flowtable: dedicated slab for flow entry The size of `struct flow_offload` has grown beyond 256 bytes on 64-bit kernels (currently 280 bytes) because of the `flow_offload_tunnel` member added recently. So kmalloc() allocates from the kmalloc-512 slab, causing significant memory waste per entry. Introduce a dedicated slab cache for flow entries to reduce memory footprint. Results in a reduction from 512 bytes to 320 bytes per entry on x86_64 kernels. Signed-off-by: Qingfang Deng <dqfext@gmail.com> Signed-off-by: Florian Westphal <fw@strlen.de>	2026-02-06 13:34:55 +01:00
Florian Westphal	207b3ebacb	netfilter: nfnetlink_queue: do shared-unconfirmed check before segmentation Ulrich reports a regression with nfqueue: If an application did not set the 'F_GSO' capability flag and a gso packet with an unconfirmed nf_conn entry is received all packets are now dropped instead of queued, because the check happens after skb_gso_segment(). In that case, we did have exclusive ownership of the skb and its associated conntrack entry. The elevated use count is due to skb_clone happening via skb_gso_segment(). Move the check so that its peformed vs. the aggregated packet. Then, annotate the individual segments except the first one so we can do a 2nd check at reinject time. For the normal case, where userspace does in-order reinjects, this avoids packet drops: first reinjected segment continues traversal and confirms entry, remaining segments observe the confirmed entry. While at it, simplify nf_ct_drop_unconfirmed(): We only care about unconfirmed entries with a refcnt > 1, there is no need to special-case dying entries. This only happens with UDP. With TCP, the only unconfirmed packet will be the TCP SYN, those aren't aggregated by GRO. Next patch adds a udpgro test case to cover this scenario. Reported-by: Ulrich Weber <ulrich.weber@gmail.com> Fixes: `7d8dc1c7be` ("netfilter: nf_queue: drop packets with cloned unconfirmed conntracks") Signed-off-by: Florian Westphal <fw@strlen.de>	2026-02-06 13:34:55 +01:00
Florian Westphal	35f83a7552	netfilter: nft_set_rbtree: don't gc elements on insert During insertion we can queue up expired elements for garbage collection. In case of later abort, the commit hook will never be called. Packet path and 'get' requests will find free'd elements in the binary search blob: nft_set_ext_key include/net/netfilter/nf_tables.h:800 [inline] nft_array_get_cmp+0x1f6/0x2a0 net/netfilter/nft_set_rbtree.c:133 __inline_bsearch include/linux/bsearch.h:15 [inline] bsearch+0x50/0xc0 lib/bsearch.c:33 nft_rbtree_get+0x16b/0x400 net/netfilter/nft_set_rbtree.c:169 nft_setelem_get net/netfilter/nf_tables_api.c:6495 [inline] nft_get_set_elem+0x420/0xaa0 net/netfilter/nf_tables_api.c:6543 nf_tables_getsetelem+0x448/0x5e0 net/netfilter/nf_tables_api.c:6632 nfnetlink_rcv_msg+0x8ae/0x12c0 net/netfilter/nfnetlink.c:290 Also, when we insert an element that triggers -EEXIST, and that insertion happens to also zap a timed-out entry, we end up with same issue: Neither commit nor abort hook is called. Fix this by removing gc api usage during insertion. The blamed commit also removes concurrency of the rbtree with the packet path, so we can now safely rb_erase() the element and move it to a new expired list that can be reaped in the commit hook before building the next blob iteration. This also avoids the need to rebuild the blob in the abort path: Expired elements seen during insertion attempts are kept around until a transaction passes. Reported-by: syzbot+d417922a3e7935517ef6@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=d417922a3e7935517ef6 Fixes: `7e43e0a114` ("netfilter: nft_set_rbtree: translate rbtree to array for binary search") Signed-off-by: Florian Westphal <fw@strlen.de>	2026-02-06 13:34:41 +01:00
Votokina Victoria	c9efde1e53	nfc: hci: shdlc: Stop timers and work before freeing context llc_shdlc_deinit() purges SHDLC skb queues and frees the llc_shdlc structure while its timers and state machine work may still be active. Timer callbacks can schedule sm_work, and sm_work accesses SHDLC state and the skb queues. If teardown happens in parallel with a queued/running work item, it can lead to UAF and other shutdown races. Stop all SHDLC timers and cancel sm_work synchronously before purging the queues and freeing the context. Found by Linux Verification Center (linuxtesting.org) with SVACE. Fixes: `4a61cd6687` ("NFC: Add an shdlc llc module to llc core") Signed-off-by: Votokina Victoria <Victoria.Votokina@kaspersky.com> Link: https://patch.msgid.link/20260203113158.2008723-1-Victoria.Votokina@kaspersky.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-05 18:46:20 -08:00
Eric Dumazet	c89477ad79	inet: RAW sockets using IPPROTO_RAW MUST drop incoming ICMP Yizhou Zhao reported that simply having one RAW socket on protocol IPPROTO_RAW (255) was dangerous. socket(AF_INET, SOCK_RAW, 255); A malicious incoming ICMP packet can set the protocol field to 255 and match this socket, leading to FNHE cache changes. inner = IP(src="192.168.2.1", dst="8.8.8.8", proto=255)/Raw("TEST") pkt = IP(src="192.168.1.1", dst="192.168.2.1")/ICMP(type=3, code=4, nexthopmtu=576)/inner "man 7 raw" states: A protocol of IPPROTO_RAW implies enabled IP_HDRINCL and is able to send any IP protocol that is specified in the passed header. Receiving of all IP protocols via IPPROTO_RAW is not possible using raw sockets. Make sure we drop these malicious packets. Fixes: `1da177e4c3` ("Linux-2.6.12-rc2") Reported-by: Yizhou Zhao <zhaoyz24@mails.tsinghua.edu.cn> Link: https://lore.kernel.org/netdev/20251109134600.292125-1-zhaoyz24@mails.tsinghua.edu.cn/ Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20260203192509.682208-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-05 12:36:49 -08:00
Daniel Hodges	6a65c0cb0f	tipc: fix RCU dereference race in tipc_aead_users_dec() tipc_aead_users_dec() calls rcu_dereference(aead) twice: once to store in 'tmp' for the NULL check, and again inside the atomic_add_unless() call. Use the already-dereferenced 'tmp' pointer consistently, matching the correct pattern used in tipc_aead_users_inc() and tipc_aead_users_set(). Fixes: `fc1b6d6de2` ("tipc: introduce TIPC encryption & authentication") Cc: stable@vger.kernel.org Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Daniel Hodges <hodgesd@meta.com> Link: https://patch.msgid.link/20260203145621.17399-1-git@danielhodges.dev Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-05 12:36:31 -08:00
Jakub Kicinski	a182a62ff7	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Cross-merge networking fixes after downstream PR (net-6.19-rc9). No adjacent changes, conflicts: drivers/net/ethernet/spacemit/k1_emac.c `3125fc1701` ("net: spacemit: k1-emac: fix jumbo frame support") `f66086798f` ("net: spacemit: Remove broken flow control support") https://lore.kernel.org/aYIysFIE9ooavWia@sirena.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-05 09:54:08 -08:00
Davide Caratti	a90f6dcefc	net/sched: don't use dynamic lockdep keys with clsact/ingress/noqueue Currently we are registering one dynamic lockdep key for each allocated qdisc, to avoid false deadlock reports when mirred (or TC eBPF) redirects packets to another device while the root lock is acquired [1]. Since dynamic keys are a limited resource, we can save them at least for qdiscs that are not meant to acquire the root lock in the traffic path, or to carry traffic at all, like: - clsact - ingress - noqueue Don't register dynamic keys for the above schedulers, so that we hit MAX_LOCKDEP_KEYS later in our tests. [1] https://github.com/multipath-tcp/mptcp_net-next/issues/451 Changes in v2: - change ordering of spin_lock_init() vs. lockdep_register_key() (Jakub Kicinski) Signed-off-by: Davide Caratti <dcaratti@redhat.com> Link: https://patch.msgid.link/94448f7fa7c4f52d2ce416a4895ec87d456d7417.1770220576.git.dcaratti@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-05 09:32:45 -08:00
Eric Dumazet	85d05e2817	ipv6: change inet6_sk_rebuild_header() to use inet->cork.fl.u.ip6 TCP v6 spends a good amount of time rebuilding a fresh fl6 at each transmit in inet6_csk_xmit()/inet6_csk_route_socket(). TCP v4 caches the information in inet->cork.fl.u.ip4 instead. This patch is a first step converting IPv6 to the same strategy: Before this patch inet6_sk_rebuild_header() only validated/rebuilt a dst. Automatic variable @fl6 content was lost. After this patch inet6_sk_rebuild_header() also initializes inet->cork.fl.u.ip6, which can be reused in the future. This makes inet6_sk_rebuild_header() very similar to inet_sk_rebuild_header(). Also remove the EXPORT_SYMBOL_GPL(), inet6_sk_rebuild_header() is not called from any module. Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260204163035.4123817-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-05 09:24:10 -08:00
Eric Dumazet	22c1264415	tcp: move __reqsk_free() out of line Inlining __reqsk_free() is overkill, let's reclaim 2 Kbytes of text. $ scripts/bloat-o-meter -t vmlinux.old vmlinux.new add/remove: 2/4 grow/shrink: 2/14 up/down: 225/-2338 (-2113) Function old new delta __reqsk_free - 114 +114 sock_edemux 18 82 +64 inet_csk_listen_start 233 264 +31 __pfx___reqsk_free - 16 +16 __pfx_reqsk_queue_alloc 16 - -16 __pfx_reqsk_free 16 - -16 reqsk_queue_alloc 46 - -46 tcp_req_err 272 177 -95 reqsk_fastopen_remove 348 253 -95 cookie_bpf_check 157 62 -95 cookie_tcp_reqsk_alloc 387 290 -97 cookie_v4_check 1568 1465 -103 reqsk_free 105 - -105 cookie_v6_check 1519 1412 -107 sock_gen_put 187 78 -109 sock_pfree 212 82 -130 tcp_try_fastopen 1818 1683 -135 tcp_v4_rcv 3478 3294 -184 reqsk_put 306 90 -216 tcp_get_cookie_sock 551 318 -233 tcp_v6_rcv 3404 3141 -263 tcp_conn_request 2677 2384 -293 Total: Before=24887415, After=24885302, chg -0.01% Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260204055147.1682705-5-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-05 09:23:06 -08:00
Eric Dumazet	7d2064eb73	net: get rid of net/core/request_sock.c After DCCP removal, this file was not needed any more. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260204055147.1682705-4-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-05 09:23:05 -08:00
Eric Dumazet	a90765c6f6	tcp: move reqsk_fastopen_remove to net/ipv4/tcp_fastopen.c This function belongs to TCP stack, not to net/core/request_sock.c We get rid of the now empty request_sock.c n the following patch. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260204055147.1682705-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-05 09:23:05 -08:00
Eric Dumazet	d5c5391554	inet: move reqsk_queue_alloc() to net/ipv4/inet_connection_sock.c Only called once from inet_csk_listen_start(), it can be static. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260204055147.1682705-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-05 09:23:05 -08:00
Shigeru Yoshida	bbf4a17ad9	ipv6: Fix ECMP sibling count mismatch when clearing RTF_ADDRCONF syzbot reported a kernel BUG in fib6_add_rt2node() when adding an IPv6 route. [0] Commit `f72514b3c5` ("ipv6: clear RA flags when adding a static route") introduced logic to clear RTF_ADDRCONF from existing routes when a static route with the same nexthop is added. However, this causes a problem when the existing route has a gateway. When RTF_ADDRCONF is cleared from a route that has a gateway, that route becomes eligible for ECMP, i.e. rt6_qualify_for_ecmp() returns true. The issue is that this route was never added to the fib6_siblings list. This leads to a mismatch between the following counts: - The sibling count computed by iterating fib6_next chain, which includes the newly ECMP-eligible route - The actual siblings in fib6_siblings list, which does not include that route When a subsequent ECMP route is added, fib6_add_rt2node() hits BUG_ON(sibling->fib6_nsiblings != rt->fib6_nsiblings) because the counts don't match. Fix this by only clearing RTF_ADDRCONF when the existing route does not have a gateway. Routes without a gateway cannot qualify for ECMP anyway (rt6_qualify_for_ecmp() requires fib_nh_gw_family), so clearing RTF_ADDRCONF on them is safe and matches the original intent of the commit. [0]: kernel BUG at net/ipv6/ip6_fib.c:1217! Oops: invalid opcode: 0000 [#1] SMP KASAN PTI CPU: 0 UID: 0 PID: 6010 Comm: syz.0.17 Not tainted syzkaller #0 PREEMPT(full) Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/25/2025 RIP: 0010:fib6_add_rt2node+0x3433/0x3470 net/ipv6/ip6_fib.c:1217 [...] Call Trace: <TASK> fib6_add+0x8da/0x18a0 net/ipv6/ip6_fib.c:1532 __ip6_ins_rt net/ipv6/route.c:1351 [inline] ip6_route_add+0xde/0x1b0 net/ipv6/route.c:3946 ipv6_route_ioctl+0x35c/0x480 net/ipv6/route.c:4571 inet6_ioctl+0x219/0x280 net/ipv6/af_inet6.c:577 sock_do_ioctl+0xdc/0x300 net/socket.c:1245 sock_ioctl+0x576/0x790 net/socket.c:1366 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:597 [inline] __se_sys_ioctl+0xfc/0x170 fs/ioctl.c:583 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xfa/0xf80 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f Fixes: `f72514b3c5` ("ipv6: clear RA flags when adding a static route") Reported-by: syzbot+cb809def1baaac68ab92@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=cb809def1baaac68ab92 Tested-by: syzbot+cb809def1baaac68ab92@syzkaller.appspotmail.com Signed-off-by: Shigeru Yoshida <syoshida@redhat.com> Reviewed-by: Fernando Fernandez Mancera <fmancera@suse.de> Link: https://patch.msgid.link/20260204095837.1285552-1-syoshida@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-05 08:38:40 -08:00
Eric Dumazet	7a4cd71fa4	net: add vlan_get_protocol_offset_inline() helper skb_protocol() is bloated, and forces slow stack canaries in many fast paths. Add vlan_get_protocol_offset_inline() which deals with the non-vlan common cases. __vlan_get_protocol_offset() is now out of line. It returns a vlan_type_depth struct to avoid stack canaries in callers. struct vlan_type_depth { __be16 type; u16 depth; }; $ scripts/bloat-o-meter -t vmlinux.old vmlinux.new add/remove: 0/2 grow/shrink: 0/22 up/down: 0/-6320 (-6320) Function old new delta vlan_get_protocol_dgram 61 59 -2 __pfx_skb_protocol 16 - -16 __vlan_get_protocol_offset 307 273 -34 tap_get_user 1374 1207 -167 ip_md_tunnel_xmit 1625 1452 -173 tap_sendmsg 940 753 -187 netif_skb_features 1079 866 -213 netem_enqueue 3017 2800 -217 vlan_parse_protocol 271 50 -221 tso_start 567 344 -223 fq_dequeue 1908 1685 -223 skb_network_protocol 434 205 -229 ip6_tnl_xmit 2639 2409 -230 br_dev_queue_push_xmit 474 236 -238 skb_protocol 258 - -258 packet_parse_headers 621 357 -264 __ip6_tnl_rcv 1306 1039 -267 skb_csum_hwoffload_help 515 224 -291 ip_tunnel_xmit 2635 2339 -296 sch_frag_xmit_hook 1582 1233 -349 bpf_skb_ecn_set_ce 868 457 -411 IP6_ECN_decapsulate 1297 768 -529 ip_tunnel_rcv 2121 1489 -632 ipip6_rcv 2572 1922 -650 Total: Before=24892803, After=24886483, chg -0.03% Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260204053023.1622775-1-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-02-05 16:33:52 +01:00
Oliver Hartkopp	3ffecc14ec	can: gw: use can_gw_hops instead of sk_buff::csum_start As CAN skbs don't use IP checksums the skb->csum_start variable was used to store the can-gw CAN frame time-to-live counter together with skb->ip_summed set to CHECKSUM_UNNECESSARY. Remove the 'hack' using the skb->csum_start variable and move the content to can_skb_ext::can_gw_hops of the CAN skb extensions. The module parameter 'max_hops' has been reduced to a single byte to fit can_skb_ext::can_gw_hops as the maximum value to be stored is 6. Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de> Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Link: https://patch.msgid.link/20260201-can_skb_ext-v8-6-3635d790fe8b@hartkopp.net Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-02-05 11:58:40 +01:00
Oliver Hartkopp	9f10374bb0	can: remove private CAN skb headroom infrastructure This patch removes struct can_skb_priv which was stored at skb->head and the can_skb_reserve() helper which was used to shift skb->head. Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de> Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Link: https://patch.msgid.link/20260201-can_skb_ext-v8-5-3635d790fe8b@hartkopp.net Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-02-05 11:58:40 +01:00
Oliver Hartkopp	5a9229dbb4	can: move ifindex to CAN skb extensions When routing CAN frames over different CAN interfaces the interface index skb->iif is overwritten with every single hop. To prevent sending a CAN frame back to its originating (first) incoming CAN interface another ifindex variable is needed, which was stored in can_skb_priv::ifindex. Move the can_skb_priv::ifindex content to can_skb_ext::can_iif. Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de> Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Link: https://patch.msgid.link/20260201-can_skb_ext-v8-3-3635d790fe8b@hartkopp.net Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-02-05 11:58:40 +01:00
Oliver Hartkopp	96ea3a1e2d	can: add CAN skb extension infrastructure To remove the private CAN bus skb headroom infrastructure 8 bytes need to be stored in the skb. The skb extensions are a common pattern and an easy and efficient way to hold private data travelling along with the skb. We only need the skb_ext_add() and skb_ext_find() functions to allocate and access CAN specific content as the skb helpers to copy/clone/free skbs automatically take care of skb extensions and their final removal. This patch introduces the complete CAN skb extensions infrastructure: - add struct can_skb_ext in new file include/net/can.h - add include/net/can.h in MAINTAINERS - add SKB_EXT_CAN to skbuff.c and skbuff.h - select SKB_EXTENSIONS in Kconfig when CONFIG_CAN is enabled - check for existing CAN skb extensions in can_rcv() in af_can.c - add CAN skb extensions allocation at every skb_alloc() location - duplicate the skb extensions if cloning outgoing skbs (framelen/gw_hops) - introduce can_skb_ext_add() and can_skb_ext_find() helpers The patch also corrects an indention issue in the original code from 2018: Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202602010426.PnGrYAk3-lkp@intel.com/ Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de> Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Link: https://patch.msgid.link/20260201-can_skb_ext-v8-2-3635d790fe8b@hartkopp.net Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-02-05 11:58:39 +01:00
Oliver Hartkopp	d4fb6514ff	can: use skb hash instead of private variable in headroom The can_skb_priv::skbcnt variable is used to identify CAN skbs in the RX path analogue to the skb->hash. As the skb hash is not filled in CAN skbs move the private skbcnt value to skb->hash and set skb->sw_hash accordingly. The skb->hash is a value used for RPS to identify skbs. Use it as intended. Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de> Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Link: https://patch.msgid.link/20260201-can_skb_ext-v8-1-3635d790fe8b@hartkopp.net Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-02-05 11:58:39 +01:00
Andrew Fasano	f41c5d1510	netfilter: nf_tables: fix inverted genmask check in nft_map_catchall_activate() nft_map_catchall_activate() has an inverted element activity check compared to its non-catchall counterpart nft_mapelem_activate() and compared to what is logically required. nft_map_catchall_activate() is called from the abort path to re-activate catchall map elements that were deactivated during a failed transaction. It should skip elements that are already active (they don't need re-activation) and process elements that are inactive (they need to be restored). Instead, the current code does the opposite: it skips inactive elements and processes active ones. Compare the non-catchall activate callback, which is correct: nft_mapelem_activate(): if (nft_set_elem_active(ext, iter->genmask)) return 0; /* skip active, process inactive / With the buggy catchall version: nft_map_catchall_activate(): if (!nft_set_elem_active(ext, genmask)) continue; / skip inactive, process active */ The consequence is that when a DELSET operation is aborted, nft_setelem_data_activate() is never called for the catchall element. For NFT_GOTO verdict elements, this means nft_data_hold() is never called to restore the chain->use reference count. Each abort cycle permanently decrements chain->use. Once chain->use reaches zero, DELCHAIN succeeds and frees the chain while catchall verdict elements still reference it, resulting in a use-after-free. This is exploitable for local privilege escalation from an unprivileged user via user namespaces + nftables on distributions that enable CONFIG_USER_NS and CONFIG_NF_TABLES. Fix by removing the negation so the check matches nft_mapelem_activate(): skip active elements, process inactive ones. Fixes: `628bd3e49c` ("netfilter: nf_tables: drop map element references from preparation phase") Signed-off-by: Andrew Fasano <andrew.fasano@nist.gov> Signed-off-by: Florian Westphal <fw@strlen.de>	2026-02-05 08:36:59 +01:00
Gerd Rausch	9d27a0fb12	net/rds: Trigger rds_send_ping() more than once Even though a peer may have already received a non-zero value for "RDS_EXTHDR_NPATHS" from a node in the past, the current peer may not. Therefore it is important to initiate another rds_send_ping() after a re-connect to any peer: It is unknown at that time if we're still talking to the same instance of RDS kernel modules on the other side. Otherwise, the peer may just operate on a single lane ("c_npaths == 0"), not knowing that more lanes are available. However, if "c_with_sport_idx" is supported, we also need to check that the connection we accepted on lane#0 meets the proper source port modulo requirement, as we fan out: Since the exchange of "RDS_EXTHDR_NPATHS" and "RDS_EXTHDR_SPORT_IDX" is asynchronous, initially we have no choice but to accept an incoming connection (via "accept") in the first slot ("cp_index == 0") for backwards compatibility. But that very connection may have come from a different lane with "cp_index != 0", since the peer thought that we already understood and handled "c_with_sport_idx" properly, as indicated by a previous exchange before a module was reloaded. In short: If a module gets reloaded, we recover from that, but do not allow a downgrade to support fewer lanes. Downgrades would require us to merge messages from separate lanes, which is rather tricky with the current RDS design. Each lane has its own sequence number space and all messages would need to be re-sequenced as we merge, all while handling "RDS_FLAG_RETRANSMITTED" and "cp_retrans" properly. Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com> Signed-off-by: Allison Henderson <allison.henderson@oracle.com> Link: https://patch.msgid.link/20260203055723.1085751-9-achender@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-04 20:46:39 -08:00
Gerd Rausch	a1f53d5fb6	net/rds: Use the first lane until RDS_EXTHDR_NPATHS arrives Instead of just blocking the sender until "c_npaths" is known (it gets updated upon the receipt of a MPRDS PONG message), simply use the first lane (cp_index#0). But just using the first lane isn't enough. As soon as we enqueue messages on a different lane, we'd run the risk of out-of-order delivery of RDS messages. Earlier messages enqueued on "cp_index == 0" could be delivered later than more recent messages enqueued on "cp_index > 0", mostly because of possible head of line blocking issues causing the first lane to be slower. To avoid that, we simply take a snapshot of "cp_next_tx_seq" at the time we're about to fan-out to more lanes. Then we delay the transmission of messages enqueued on other lanes with "cp_index > 0" until cp_index#0 caught up with the delivery of new messages (from "cp_send_queue") as well as in-flight messages (from "cp_retrans") that haven't been acknowledged yet by the receiver. We also add a new counter "mprds_catchup_tx0_retries" to keep track of how many times "rds_send_xmit" had to suspend activities, because it was waiting for the first lane to catch up. Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com> Signed-off-by: Allison Henderson <allison.henderson@oracle.com> Link: https://patch.msgid.link/20260203055723.1085751-8-achender@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-04 20:46:39 -08:00
Allison Henderson	9d30ad8a8b	net/rds: Update struct rds_statistics to use u64 instead of uint64_t Quick clean up to avoid checkpatch errors when adding members to this struct (Prefer kernel type 'u64' over 'uint64_t'). No functional changes added. Signed-off-by: Allison Henderson <allison.henderson@oracle.com> Link: https://patch.msgid.link/20260203055723.1085751-7-achender@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-04 20:46:38 -08:00
Håkon Bugge	b89fc7c252	net/rds: Clear reconnect pending bit When canceling the reconnect worker, care must be taken to reset the reconnect-pending bit. If the reconnect worker has not yet been scheduled before it is canceled, the reconnect-pending bit will stay on forever. Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com> Signed-off-by: Allison Henderson <allison.henderson@oracle.com> Link: https://patch.msgid.link/20260203055723.1085751-6-achender@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-04 20:46:38 -08:00
Gerd Rausch	aa0cd656f0	net/rds: Kick-start TCP receiver after accept In cases where the server (the node with the higher IP-address) in an RDS/TCP connection is overwhelmed it is possible that the socket that was just accepted is chock-full of messages, up to the limit of what the socket receive buffer permits. Subsequently, "rds_tcp_data_ready" won't be called anymore, because there is no more space to receive additional messages. Nor was it called prior to the point of calling "rds_tcp_set_callbacks", because the "sk_data_ready" pointer didn't even point to "rds_tcp_data_ready" yet. We fix this by simply kick-starting the receive-worker for all cases where the socket state is neither "TCP_CLOSE_WAIT" nor "TCP_CLOSE". Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com> Signed-off-by: Allison Henderson <allison.henderson@oracle.com> Link: https://patch.msgid.link/20260203055723.1085751-5-achender@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-04 20:46:38 -08:00
Gerd Rausch	826c1004d4	net/rds: rds_tcp_conn_path_shutdown must not discard messages RDS/TCP differs from RDS/RDMA in that message acknowledgment is done based on TCP sequence numbers: As soon as the last byte of a message has been acknowledged by the TCP stack of a peer, rds_tcp_write_space() goes on to discard prior messages from the send queue. Which is fine, for as long as the receiver never throws any messages away. The dequeuing of messages in RDS/TCP is done either from the "sk_data_ready" callback pointing to rds_tcp_data_ready() (the most common case), or from the receive worker pointing to rds_tcp_recv_path() which is called for as long as the connection is "RDS_CONN_UP". However, as soon as rds_conn_path_drop() is called for whatever reason, including "DR_USER_RESET", "cp_state" transitions to "RDS_CONN_ERROR", and rds_tcp_restore_callbacks() ends up restoring the callbacks and thereby disabling message receipt. So messages already acknowledged to the sender were dropped. Furthermore, the "->shutdown" callback was always called with an invalid parameter ("RCV_SHUTDOWN \| SEND_SHUTDOWN == 3"), instead of the correct pre-increment value ("SHUT_RDWR == 2"). inet_shutdown() returns "-EINVAL" in such cases, rendering this call a NOOP. So we change rds_tcp_conn_path_shutdown() to do the proper "->shutdown(SHUT_WR)" call in order to signal EOF to the peer and make it transition to "TCP_CLOSE_WAIT" (RFC 793). This should make the peer also enter rds_tcp_conn_path_shutdown() and do the same. This allows us to dequeue all messages already received and acknowledged to the peer. We do so, until we know that the receive queue no longer has data (skb_queue_empty()) and that we couldn't have any data in flight anymore, because the socket transitioned to any of the states "CLOSING", "TIME_WAIT", "CLOSE_WAIT", "LAST_ACK", or "CLOSE" (RFC 793). However, if we do just that, we suddenly see duplicate RDS messages being delivered to the application. So what gives? Turns out that with MPRDS and its multitude of backend connections, retransmitted messages ("RDS_FLAG_RETRANSMITTED") can outrace the dequeuing of their original counterparts. And the duplicate check implemented in rds_recv_local() only discards duplicates if flag "RDS_FLAG_RETRANSMITTED" is set. Rather curious, because a duplicate is a duplicate; it shouldn't matter which copy is looked at and delivered first. To avoid this entire situation, we simply make the sender discard messages from the send-queue right from within rds_tcp_conn_path_shutdown(). Just like rds_tcp_write_space() would have done, were it called in time or still called. This makes sure that we no longer have messages that we know the receiver already dequeued sitting in our send-queue, and therefore avoid the entire "RDS_FLAG_RETRANSMITTED" fiasco. Now we got rid of the duplicate RDS message delivery, but we still run into cases where RDS messages are dropped. This time it is due to the delayed setting of the socket-callbacks in rds_tcp_accept_one() via either rds_tcp_reset_callbacks() or rds_tcp_set_callbacks(). By the time rds_tcp_accept_one() gets there, the socket may already have transitioned into state "TCP_CLOSE_WAIT", but rds_tcp_state_change() was never called. Subsequently, "->shutdown(SHUT_WR)" did not happen either. So the peer ends up getting stuck in state "TCP_FIN_WAIT2". We fix that by checking for states "TCP_CLOSE_WAIT", "TCP_LAST_ACK", or "TCP_CLOSE" and drop the freshly accepted socket in that case. This problem is observable by running "rds-stress --reset" frequently on either of the two sides of a RDS connection, or both while other "rds-stress" processes are exchanging data. Those "rds-stress" processes reported out-of-sequence errors, with the expected sequence number being smaller than the one actually received (due to the dropped messages). Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com> Signed-off-by: Allison Henderson <allison.henderson@oracle.com> Link: https://patch.msgid.link/20260203055723.1085751-4-achender@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-04 20:46:38 -08:00
Gerd Rausch	a20a699255	net/rds: Encode cp_index in TCP source port Upon "sendmsg", RDS/TCP selects a backend connection based on a hash calculated from the source-port ("RDS_MPATH_HASH"). However, "rds_tcp_accept_one" accepts connections in the order they arrive, which is non-deterministic. Therefore the mapping of the sender's "cp->cp_index" to that of the receiver changes if the backend connections are dropped and reconnected. However, connection state that's preserved across reconnects (e.g. "cp_next_rx_seq") relies on that sender<->receiver mapping to never change. So we make sure that client and server of the TCP connection have the exact same "cp->cp_index" across reconnects by encoding "cp->cp_index" in the lower three bits of the client's TCP source port. A new extension "RDS_EXTHDR_SPORT_IDX" is introduced, that allows the server to tell the difference between clients that do the "cp->cp_index" encoding, and legacy clients that pick source ports randomly. Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com> Signed-off-by: Allison Henderson <allison.henderson@oracle.com> Link: https://patch.msgid.link/20260203055723.1085751-3-achender@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-04 20:46:38 -08:00
Shamir Rabinovitch	46f257ee69	net/rds: new extension header: rdma bytes Introduce a new extension header type RDSV3_EXTHDR_RDMA_BYTES for an RDMA initiator to exchange rdma byte counts to its target. Currently, RDMA operations cannot precisely account how many bytes a peer just transferred via RDMA, which limits per-connection statistics and future policy (e.g., monitoring or rate/cgroup accounting of RDMA traffic). In this patch we expand rds_message_add_extension to accept multiple extensions, and add new flag to RDS header: RDS_FLAG_EXTHDR_EXTENSION, along with a new extension to RDS header: rds_ext_header_rdma_bytes. Signed-off-by: Shamir Rabinovitch <shamir.rabinovitch@oracle.com> Signed-off-by: Guangyu Sun <guangyu.sun@oracle.com> Signed-off-by: Allison Henderson <allison.henderson@oracle.com> Link: https://patch.msgid.link/20260203055723.1085751-2-achender@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-04 20:46:38 -08:00
Eric Dumazet	acd21dd2da	net_sched: sch_fq: tweak unlikely() hints in fq_dequeue() After `076433bd78` ("net_sched: sch_fq: add fast path for mostly idle qdisc") we need to remove one unlikely() because q->internal holds all the fast path packets. skb = fq_peek(&q->internal); if (unlikely(skb)) { q->internal.qlen--; Calling INET_ECN_set_ce() is very unlikely. These changes allow fq_dequeue_skb() to be (auto)inlined, thus making fq_dequeue() faster. $ scripts/bloat-o-meter -t vmlinux.0 vmlinux add/remove: 2/2 grow/shrink: 0/1 up/down: 283/-269 (14) Function old new delta INET_ECN_set_ce - 267 +267 __pfx_INET_ECN_set_ce - 16 +16 __pfx_fq_dequeue_skb 16 - -16 fq_dequeue_skb 103 - -103 fq_dequeue 1685 1535 -150 Total: Before=24886569, After=24886583, chg +0.00% Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://patch.msgid.link/20260203214716.880853-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-04 20:43:40 -08:00
Randy Dunlap	a34b0e4e21	net/iucv: clean up iucv kernel-doc warnings Fix numerous (many) kernel-doc warnings in iucv.[ch]: - convert function documentation comments to a common (kernel-doc) look, even for static functions (without "/*") - use matching parameter and parameter description names - use better wording in function descriptions (Jakub & AI) - remove duplicate kernel-doc comments from the header file (Jakub) Examples: Warning: include/net/iucv/iucv.h:210 missing initial short description on line: iucv_unregister Warning: include/net/iucv/iucv.h:216 function parameter 'handle' not described in 'iucv_unregister' Warning: include/net/iucv/iucv.h:467 function parameter 'answer' not described in 'iucv_message_send2way' Warning: net/iucv/iucv.c:727 missing initial short description on line: * iucv_cleanup_queue Build-tested with both "make htmldocs" and "make ARCH=s390 defconfig all". Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Reviewed-by: Alexandra Winter <wintera@linux.ibm.com> Link: https://patch.msgid.link/20260203075248.1177869-1-rdunlap@infradead.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-04 20:39:58 -08:00
Eric Dumazet	309dd99421	tcp: split tcp_check_space() in two parts tcp_check_space() is fat and not inlined. Move its slow path in (out of line) __tcp_check_space() and make tcp_check_space() an inline function for better TCP performance. $ scripts/bloat-o-meter -t vmlinux.old vmlinux.new add/remove: 2/2 grow/shrink: 4/0 up/down: 708/-582 (126) Function old new delta __tcp_check_space - 521 +521 tcp_rcv_established 1860 1916 +56 tcp_rcv_state_process 3342 3384 +42 tcp_event_new_data_sent 248 286 +38 tcp_data_snd_check 71 106 +35 __pfx___tcp_check_space - 16 +16 __pfx_tcp_check_space 16 - -16 tcp_check_space 566 - -566 Total: Before=24896373, After=24896499, chg +0.00% Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260203050932.3522221-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-04 20:37:06 -08:00
Eric Dumazet	7c1db78ff7	tcp: move tcp_rbtree_insert() to tcp_output.c tcp_rbtree_insert() is primarily used from tcp_output.c In tcp_input.c, only (slow path) tcp_collapse() uses it. Move it to tcp_output.c to allow its (auto)inlining to improve TCP tx fast path. $ scripts/bloat-o-meter -t vmlinux.old vmlinux.new add/remove: 0/0 grow/shrink: 4/1 up/down: 445/-115 (330) Function old new delta tcp_connect 4277 4478 +201 tcp_event_new_data_sent 162 248 +86 tcp_send_synack 780 862 +82 tcp_fragment 1185 1261 +76 tcp_collapse 1524 1409 -115 Total: Before=24896043, After=24896373, chg +0.00% Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260203045110.3499713-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-04 20:36:50 -08:00
Eric Dumazet	59b5e7f47c	tcp: use __skb_push() in __tcp_transmit_skb() We trust MAX_TCP_HEADER to be large enough. Using the inlined version of skb_push() trades 8 bytes of text for better performance of TCP TX fast path. $ scripts/bloat-o-meter -t vmlinux.old vmlinux.new add/remove: 0/0 grow/shrink: 1/0 up/down: 8/0 (8) Function old new delta __tcp_transmit_skb 3181 3189 +8 Total: Before=24896035, After=24896043, chg +0.00% Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260203044226.3489941-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-04 20:36:27 -08:00
Jakub Kicinski	333225e1e9	Some more changes, including pulls from drivers: - ath drivers: small features/cleanups - rtw drivers: mostly refactoring for rtw89 RTL8922DE support - mac80211: use hrtimers for CAC to avoid too long delays - cfg80211/mac80211: some initial UHR (Wi-Fi 8) support -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEpeA8sTs3M8SN2hR410qiO8sPaAAFAmmDNuEACgkQ10qiO8sP aADhvA/+J35p2CDkffi1KfZxxx1YdHAAlj1zjhjLzshCMCG3oWzLpOL7se5bgN/C axPLPbeCAXtsRXln083lbwtrSRexPHSVhelPDNtybLPEocQYrksV8a6V3eWXCNTR ymN4iDaO/K0gLkDRKH5T8lwZvJttA6iHi+Fm4ir+dsr0O5vwwe4CuAEPA1SuZ2rh 0lQMz6pEzsxq+sZX3p8SoBwXx147l0n6gwMNIgBTKo1tjZha4oaavdvcqq4zaZWV WCcg4YVA/dWHL0UuwtIF8uQADM43quegBBUFx63QgzfgcnHAnBk2Ckeein/bfvnv XOKlI4UJi1cxTkTJkDOrSn5IwBzVSlBXE3qEUKKnu5G3+ZgfdsnWmSPeTtOndvAE rgbwwZb2SKH1kCvL0FDZTwq/iR9KF60ZfhWIq9Sz7m6VZxJoR8QACHglYCysj2JB B1+oT53EIqP7Ob4s/GN2Yg9M0l4Lv3E6J9g6h3b8yeq9qEXVF8MaVN683rtNpec9 mUqLRlcoToB2W/qvEVESKj8jMvajYZ6TDoO7mSP3paTW3HgMC3wlPJlDc4Q/6h7e LAKEljXlv6ofNGCcCL37l6KATqSZpIZn+tpSqbELIirWlc/rnTIDU2qZRb7MA1e1 3lKdrS6pOXGS1GJr7HWuLb4cX1SukyXNeyIcZJlSFoxG4oDPvwI= =/NUu -----END PGP SIGNATURE----- Merge tag 'wireless-next-2026-02-04' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next Johannes Berg says: ==================== Some more changes, including pulls from drivers: - ath drivers: small features/cleanups - rtw drivers: mostly refactoring for rtw89 RTL8922DE support - mac80211: use hrtimers for CAC to avoid too long delays - cfg80211/mac80211: some initial UHR (Wi-Fi 8) support * tag 'wireless-next-2026-02-04' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (59 commits) wifi: brcmsmac: phy: Remove unreachable error handling code wifi: mac80211: Add eMLSR/eMLMR action frame parsing support wifi: mac80211: add initial UHR support wifi: cfg80211: add initial UHR support wifi: ieee80211: add some initial UHR definitions wifi: mac80211: use wiphy_hrtimer_work for CAC timeout wifi: mac80211: correct ieee80211-{s1g/eht}.h include guard comments wifi: ath12k: clear stale link mapping of ahvif->links_map wifi: ath12k: Add support TX hardware queue stats wifi: ath12k: Add support RX PDEV stats wifi: ath12k: Fix index decrement when array_len is zero wifi: ath12k: support OBSS PD configuration for AP mode wifi: ath12k: add WMI support for spatial reuse parameter configuration dt-bindings: net: wireless: ath11k-pci: deprecate 'firmware-name' property wifi: ath11k: add usecase firmware handling based on device compatible wifi: ath10k: sdio: add missing lock protection in ath10k_sdio_fw_crashed_dump() wifi: ath10k: fix lock protection in ath10k_wmi_event_peer_sta_ps_state_chg() wifi: ath10k: snoc: support powering on the device via pwrseq wifi: rtw89: pci: warn if SPS OCP happens for RTL8922DE wifi: rtw89: pci: restore LDO setting after device resume ... ==================== Link: https://patch.msgid.link/20260204121143.181112-3-johannes@sipsolutions.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-04 20:31:05 -08:00
David Laight	b582090005	mptcp: Change some dubious min_t(int, ...) to min() There are two: min_t(int, xxx, mptcp_wnd_end(msk) - msk->snd_nxt); Both mptcp_wnd_end(msk) and msk->snd_nxt are u64, their difference (aka the window size) might be limited to 32 bits - but that isn't knowable from this code. So checks being added to min_t() detect the potential discard of significant bits. Provided the 'avail_size' and return of mptcp_check_allowed_size() are changed to an unsigned type (size_t matches the type the caller uses) both min_t() can be changed to min(). Signed-off-by: David Laight <david.laight.linux@gmail.com> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> [ wrapped too long lines when declaring mptcp_check_allowed_size() ] Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20260203-net-next-mptcp-misc-feat-6-20-v1-6-31ec8bfc56d1@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-04 18:45:09 -08:00
Matthieu Baerts (NGI0)	d7e712b66f	mptcp: pm: align endpoint flags size with the NL specs The MPTCP Netlink specs describe the 'flags' as a u32 type. Internally, a u8 type was used. Using a u8 is currently fine, because only the 5 first bits are used. But there is also no reason not to be aligns with the specs, and to stick to a u8. Especially because there is a whole of 3 bytes after in both mptcp_pm_local and mptcp_pm_addr_entry structures. Also, setting it to a u32 will allow future flags, just in case. Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20260203-net-next-mptcp-misc-feat-6-20-v1-5-31ec8bfc56d1@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-04 18:45:09 -08:00
Paolo Abeni	2002286e68	trace: mptcp: add mptcp_rcvbuf_grow tracepoint Similar to tcp, provide a new tracepoint to better understand mptcp_rcv_space_adjust() behavior, which presents many artifacts. Note that the used format string is so long that I preferred wrap it, contrary to guidance for quoted strings. Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20260203-net-next-mptcp-misc-feat-6-20-v1-4-31ec8bfc56d1@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-04 18:45:09 -08:00
Paolo Abeni	5c4dcc52c6	mptcp: consolidate rcv space init MPTCP uses several calls of the mptcp_rcv_space_init() helper to initialize the receive space, with a catch-up call in mptcp_rcv_space_adjust(). Drop all the other strictly not needed invocations and move constant fields initialization at socket init/reset time. This removes a bit of complexity from mptcp DRS code. No functional changes intended. Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20260203-net-next-mptcp-misc-feat-6-20-v1-3-31ec8bfc56d1@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-04 18:45:09 -08:00
Paolo Abeni	70274765fe	mptcp: fix receive space timestamp initialization MPTCP initialize the receive buffer stamp in mptcp_rcv_space_init(), using the provided subflow stamp. Such helper is invoked in several places; for passive sockets, space init happened at clone time. In such scenario, MPTCP ends-up accesses the subflow stamp before its initialization, leading to quite randomic timing for the first receive buffer auto-tune event, as the timestamp for newly created subflow is not refreshed there. Fix the issue moving the stamp initialization out of the mentioned helper, at the data transfer start, and always using a fresh timestamp. Fixes: `013e3179db` ("mptcp: fix rcv space initialization") Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20260203-net-next-mptcp-misc-feat-6-20-v1-2-31ec8bfc56d1@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-04 18:45:09 -08:00
Paolo Abeni	6b32939350	mptcp: do not account for OoO in mptcp_rcvbuf_grow() MPTCP-level OoOs are physiological when multiple subflows are active concurrently and will not cause retransmissions nor are caused by drops. Accounting for them in mptcp_rcvbuf_grow() causes the rcvbuf slowly drifting towards tcp_rmem[2]. Remove such accounting. Note that subflows will still account for TCP-level OoO when the MPTCP-level rcvbuf is propagated. This also closes a subtle and very unlikely race condition with rcvspace init; active sockets with user-space holding the msk-level socket lock, could complete such initialization in the receive callback, after that the first OoO data reaches the rcvbuf and potentially triggering a divide by zero Oops. Fixes: `e118cdc34d` ("mptcp: rcvbuf auto-tuning improvement") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20260203-net-next-mptcp-misc-feat-6-20-v1-1-31ec8bfc56d1@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-04 18:45:08 -08:00
Arnd Bergmann	e25dbf561e	vmw_vsock: bypass false-positive Wnonnull warning with gcc-16 The gcc-16.0.1 snapshot produces a false-positive warning that turns into a build failure with CONFIG_WERROR: In file included from arch/x86/include/asm/string.h:6, from net/vmw_vsock/vmci_transport.c:10: In function 'vmci_transport_packet_init', inlined from '__vmci_transport_send_control_pkt.constprop' at net/vmw_vsock/vmci_transport.c:198:2: arch/x86/include/asm/string_32.h:150:25: error: argument 2 null where non-null expected because argument 3 is nonzero [-Werror=nonnull] 150 \| #define memcpy(t, f, n) __builtin_memcpy(t, f, n) \| ^~~~~~~~~~~~~~~~~~~~~~~~~ net/vmw_vsock/vmci_transport.c:164:17: note: in expansion of macro 'memcpy' 164 \| memcpy(&pkt->u.wait, wait, sizeof(pkt->u.wait)); \| ^~~~~~ arch/x86/include/asm/string_32.h:150:25: note: in a call to built-in function '__builtin_memcpy' net/vmw_vsock/vmci_transport.c:164:17: note: in expansion of macro 'memcpy' 164 \| memcpy(&pkt->u.wait, wait, sizeof(pkt->u.wait)); \| ^~~~~~ This seems relatively harmless, and it so far the only instance of this warning I have found. The __vmci_transport_send_control_pkt function is called either with wait=NULL or with one of the type values that pass 'wait' into memcpy() here, but not from the same caller. Replacing the memcpy with a struct assignment is otherwise the same but avoids the warning. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Bobby Eshleman <bobbyeshleman@meta.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Reviewed-by: Bryan Tan <bryan-bt.tan@broadcom.com> Link: https://patch.msgid.link/20260203163406.2636463-1-arnd@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-04 18:40:31 -08:00
Kuniyuki Iwashima	c26b098bf4	bpf: Don't check sk_fullsock() in bpf_skc_to_unix_sock(). AF_UNIX does not use TCP_NEW_SYN_RECV nor TCP_TIME_WAIT and checking sk->sk_family is sufficient. Let's remove sk_fullsock() and use sk_is_unix() in bpf_skc_to_unix_sock(). Acked-by: Stanislav Fomichev <sdf@fomichev.me> Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://patch.msgid.link/20260203213442.682838-3-kuniyu@google.com	2026-02-04 09:36:06 -08:00
Paolo Abeni	5c2c3c38be	net: gro: fix outer network offset The udp GRO complete stage assumes that all the packets inserted the RX have the `encapsulation` flag zeroed. Such assumption is not true, as a few H/W NICs can set such flag when H/W offloading the checksum for an UDP encapsulated traffic, the tun driver can inject GSO packets with UDP encapsulation and the problematic layout can also be created via a veth based setup. Due to the above, in the problematic scenarios, udp4_gro_complete() uses the wrong network offset (inner instead of outer) to compute the outer UDP header pseudo checksum, leading to csum validation errors later on in packet processing. Address the issue always clearing the encapsulation flag at GRO completion time. Such flag will be set again as needed for encapsulated packets by udp_gro_complete(). Fixes: `5ef31ea5d0` ("net: gro: fix udp bad offset in socket lookup by adding {inner_}network_offset to napi_gro_cb") Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/562638dbebb3b15424220e26a180274b387e2a88.1770032084.git.pabeni@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-03 19:23:41 -08:00
Eric Dumazet	f613e8b4af	net: add proper RCU protection to /proc/net/ptype Yin Fengwei reported an RCU stall in ptype_seq_show() and provided a patch. Real issue is that ptype_seq_next() and ptype_seq_show() violate RCU rules. ptype_seq_show() runs under rcu_read_lock(), and reads pt->dev to get device name without any barrier. At the same time, concurrent writers can remove a packet_type structure (which is correctly freed after an RCU grace period) and clear pt->dev without an RCU grace period. Define ptype_iter_state to carry a dev pointer along seq_net_private: struct ptype_iter_state { struct seq_net_private p; struct net_device *dev; // added in this patch }; We need to record the device pointer in ptype_get_idx() and ptype_seq_next() so that ptype_seq_show() is safe against concurrent pt->dev changes. We also need to add full RCU protection in ptype_seq_next(). (Missing READ_ONCE() when reading list.next values) Many thanks to Dong Chenchen for providing a repro. Fixes: `1da177e4c3` ("Linux-2.6.12-rc2") Fixes: `1d10f8a1f4` ("net-procfs: show net devices bound packet types") Fixes: `c353e8983e` ("net: introduce per netns packet chains") Reported-by: Yin Fengwei <fengwei_yin@linux.alibaba.com> Reported-by: Dong Chenchen <dongchenchen2@huawei.com> Closes: https://lore.kernel.org/netdev/CANn89iKRRKPnWjJmb-_3a=sq+9h6DvTQM4DBZHT5ZRGPMzQaiA@mail.gmail.com/T/#m7b80b9fc9b9267f90e0b7aad557595f686f9c50d Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Tested-by: Yin Fengwei <fengwei_yin@linux.alibaba.com> Link: https://patch.msgid.link/20260202205217.2881198-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-03 19:20:30 -08:00
David Corvaglia	6dfa3df797	net: bridge: use sysfs_emit instead of sprintf Replace sprintf with sysfs_emit in sysfs show() methods as outlined in Documentation/filesystems/sysfs.rst. sysfs_emit is preferred to sprintf in sysfs show() methods as it is safer with buffer handling. Signed-off-by: David Corvaglia <david@corvaglia.dev> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/0100019c1fc2bcc3-bc9ca2f1-22d7-4250-8441-91e4af57117b-000000@email.amazonses.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-03 19:19:42 -08:00
Oleg Nesterov	f3951e93d4	netclassid: use thread_group_leader(p) in update_classid_task() Cleanup and preparation to simplify planned future changes. Link: https://lkml.kernel.org/r/aXY_4NSP094-Cf-2@redhat.com Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Boris Brezillon <boris.brezillon@collabora.com> Cc: Christan König <christian.koenig@amd.com> Cc: David S. Miller <davem@davemloft.net> Cc: Eric Dumazet <edumazet@google.com> Cc: Felix Kuehling <felix.kuehling@amd.com> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Leon Romanovsky <leon@kernel.org> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Simon Horman <horms@kernel.org> Cc: Steven Price <steven.price@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-02-03 08:21:26 -08:00
Frederic Weisbecker	662ff1aac8	net: Keep ignoring isolated cpuset change RPS cpumask can be overriden through sysfs/syctl. The boot defined isolated CPUs are then excluded from that cpumask. However HK_TYPE_DOMAIN will soon integrate cpuset isolated CPUs updates and the RPS infrastructure needs more thoughts to be able to propagate such changes and synchronize against them. Keep handling only what was passed through "isolcpus=" for now. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Cc: David S. Miller <davem@davemloft.net> Cc: Eric Dumazet <edumazet@google.com> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Marco Crivellari <marco.crivellari@suse.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Simon Horman <horms@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Waiman Long <longman@redhat.com> Cc: netdev@vger.kernel.org	2026-02-03 15:23:33 +01:00
Chia-Yu Chang	8ae3e8e6ce	tcp: accecn: enable AccECN Enable Accurate ECN negotiation and request for incoming and outgoing connection by setting sysctl_tcp_ecn: +==============+===========================================+ \| \| Highest ECN variant (Accurate ECN, ECN, \| \| tcp_ecn \| or no ECN) to be negotiated & requested \| \| +---------------------+---------------------+ \| \| Incoming connection \| Outgoing connection \| +==============+=====================+=====================+ \| 0 \| No ECN \| No ECN \| \| 1 \| ECN \| ECN \| \| 2 \| ECN \| No ECN \| +--------------+---------------------+---------------------+ \| 3 \| Accurate ECN \| Accurate ECN \| \| 4 \| Accurate ECN \| ECN \| \| 5 \| Accurate ECN \| No ECN \| +==============+=====================+=====================+ Refer Documentation/networking/ip-sysctl.rst for more details. Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260131222515.8485-15-chia-yu.chang@nokia-bell-labs.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-02-03 15:13:25 +01:00
Chia-Yu Chang	4fa4ac5e58	tcp: accecn: add tcpi_ecn_mode and tcpi_option2 in tcp_info Add 2-bit tcpi_ecn_mode feild within tcp_info to indicate which ECN mode is negotiated: ECN_MODE_DISABLED, ECN_MODE_RFC3168, ECN_MODE_ACCECN, or ECN_MODE_PENDING. This is done by utilizing available bits from tcpi_accecn_opt_seen (reduced from 16 bits to 2 bits) and tcpi_accecn_fail_mode (reduced from 16 bits to 4 bits). Also, an extra 24-bit tcpi_options2 field is identified to represent newer options and connection features, as all 8 bits of tcpi_options field have been used. Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Co-developed-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260131222515.8485-14-chia-yu.chang@nokia-bell-labs.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-02-03 15:13:25 +01:00
Chia-Yu Chang	1247fb19ca	tcp: accecn: detect loss ACK w/ AccECN option and add TCP_ACCECN_OPTION_PERSIST Detect spurious retransmission of a previously sent ACK carrying the AccECN option after the second retransmission. Since this might be caused by the middlebox dropping ACK with options it does not recognize, disable the sending of the AccECN option in all subsequent ACKs. This patch follows Section 3.2.3.2.2 of AccECN spec (RFC9768), and a new field (accecn_opt_sent_w_dsack) is added to indicate that an AccECN option was sent with duplicate SACK info. Also, a new AccECN option sending mode is added to tcp_ecn_option sysctl: (TCP_ECN_OPTION_PERSIST), which ignores the AccECN fallback policy and persistently sends AccECN option once it fits into TCP option space. Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260131222515.8485-13-chia-yu.chang@nokia-bell-labs.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-02-03 15:13:25 +01:00
Chia-Yu Chang	4024081feb	tcp: accecn: unset ECT if receive or send ACE=0 in AccECN negotiaion Based on specification: https://tools.ietf.org/id/draft-ietf-tcpm-accurate-ecn-28.txt Based on Section 3.1.5 of AccECN spec (RFC9768), a TCP Server in AccECN mode MUST NOT set ECT on any packet for the rest of the connection, if it has received or sent at least one valid SYN or Acceptable SYN/ACK with (AE,CWR,ECE) = (0,0,0) during the handshake. In addition, a host in AccECN mode that is feeding back the IP-ECN field on a SYN or SYN/ACK MUST feed back the IP-ECN field on the latest valid SYN or acceptable SYN/ACK to arrive. Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260131222515.8485-11-chia-yu.chang@nokia-bell-labs.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-02-03 15:13:24 +01:00
Chia-Yu Chang	f326f1f17f	tcp: accecn: retransmit SYN/ACK without AccECN option or non-AccECN SYN/ACK For Accurate ECN, the first SYN/ACK sent by the TCP server shall set the ACE flag (Table 1 of RFC9768) and the AccECN option to complete the capability negotiation. However, if the TCP server needs to retransmit such a SYN/ACK (for example, because it did not receive an ACK acknowledging its SYN/ACK, or received a second SYN requesting AccECN support), the TCP server retransmits the SYN/ACK without the AccECN option. This is because the SYN/ACK may be lost due to congestion, or a middlebox may block the AccECN option. Furthermore, if this retransmission also times out, to expedite connection establishment, the TCP server should retransmit the SYN/ACK with (AE,CWR,ECE) = (0,0,0) and without the AccECN option, while maintaining AccECN feedback mode. This complies with Section 3.2.3.2.2 of the AccECN spec RFC9768. Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260131222515.8485-10-chia-yu.chang@nokia-bell-labs.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-02-03 15:13:24 +01:00
Chia-Yu Chang	f1eaea5585	tcp: add TCP_SYNACK_RETRANS synack_type Before this patch, retransmitted SYN/ACK did not have a specific synack_type; however, the upcoming patch needs to distinguish between retransmitted and non-retransmitted SYN/ACK for AccECN negotiation to transmit the fallback SYN/ACK during AccECN negotiation. Therefore, this patch introduces a new synack_type (TCP_SYNACK_RETRANS). Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260131222515.8485-9-chia-yu.chang@nokia-bell-labs.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-02-03 15:13:24 +01:00
Chia-Yu Chang	3ae62b8b4a	tcp: accecn: retransmit downgraded SYN in AccECN negotiation Based on AccECN spec (RFC9768) Section 3.1.4.1, if the sender of an AccECN SYN (the TCP Client) times out before receiving the SYN/ACK, it SHOULD attempt to negotiate the use of AccECN at least one more time by continuing to set all three TCP ECN flags (AE,CWR,ECE) = (1,1,1) on the first retransmitted SYN (using the usual retransmission time-outs). If this first retransmission also fails to be acknowledged, in deployment scenarios where AccECN path traversal might be problematic, the TCP Client SHOULD send subsequent retransmissions of the SYN with the three TCP-ECN flags cleared (AE,CWR,ECE) = (0,0,0). Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260131222515.8485-8-chia-yu.chang@nokia-bell-labs.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-02-03 15:13:24 +01:00
Chia-Yu Chang	e68c28f22f	tcp: disable RFC3168 fallback identifier for CC modules When AccECN is not successfully negociated for a TCP flow, it defaults fallback to classic ECN (RFC3168). However, L4S service will fallback to non-ECN. This patch enables congestion control module to control whether it should not fallback to classic ECN after unsuccessful AccECN negotiation. A new CA module flag (TCP_CONG_NO_FALLBACK_RFC3168) identifies this behavior expected by the CA. Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260131222515.8485-6-chia-yu.chang@nokia-bell-labs.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-02-03 15:13:24 +01:00
Chia-Yu Chang	100f946b8d	tcp: ECT_1_NEGOTIATION and NEEDS_ACCECN identifiers Two flags for congestion control (CC) module are added in this patch related to AccECN negotiation. First, a new flag (TCP_CONG_NEEDS_ACCECN) defines that the CC expects to negotiate AccECN functionality using the ECE, CWR and AE flags in the TCP header. Second, during ECN negotiation, ECT(0) in the IP header is used. This patch enables CC to control whether ECT(0) or ECT(1) should be used on a per-segment basis. A new flag (TCP_CONG_ECT_1_NEGOTIATION) defines the expected ECT value in the IP header by the CA when not-yet initialized for the connection. The detailed AccECN negotiaotn can be found in IETF RFC9768. Co-developed-by: Olivier Tilmans <olivier.tilmans@nokia.com> Signed-off-by: Olivier Tilmans <olivier.tilmans@nokia.com> Signed-off-by: Ilpo Järvinen <ij@kernel.org> Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260131222515.8485-5-chia-yu.chang@nokia-bell-labs.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-02-03 15:13:24 +01:00
Ilpo Järvinen	ab4c8b6f7f	gro: flushing when CWR is set negatively affects AccECN As AccECN may keep CWR bit asserted due to different interpretation of the bit, flushing with GRO because of CWR may effectively disable GRO until AccECN counter field changes such that CWR-bit becomes 0. There is no harm done from not immediately forwarding the CWR'ed segment with RFC3168 ECN. Signed-off-by: Ilpo Järvinen <ij@kernel.org> Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260131222515.8485-3-chia-yu.chang@nokia-bell-labs.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-02-03 15:13:24 +01:00
Ilpo Järvinen	7885ce0147	tcp: try to avoid safer when ACKs are thinned Add newly acked pkts EWMA. When ACK thinning occurs, select between safer and unsafe cep delta in AccECN processing based on it. If the packets ACKed per ACK tends to be large, don't conservatively assume ACE field overflow. This patch uses the existing 2-byte holes in the rx group for new u16 variables withtout creating more holes. Below are the pahole outcomes before and after this patch: [BEFORE THIS PATCH] struct tcp_sock { [...] u32 delivered_ecn_bytes[3]; /* 2744 12 / / XXX 4 bytes hole, try to pack / [...] __cacheline_group_end__tcp_sock_write_rx[0]; / 2816 0 / [...] / size: 3264, cachelines: 51, members: 177 / } [AFTER THIS PATCH] struct tcp_sock { [...] u32 delivered_ecn_bytes[3]; / 2744 12 / u16 pkts_acked_ewma; / 2756 2 / / XXX 2 bytes hole, try to pack / [...] __cacheline_group_end__tcp_sock_write_rx[0]; / 2816 0 / [...] / size: 3264, cachelines: 51, members: 178 */ } Signed-off-by: Ilpo Järvinen <ij@kernel.org> Co-developed-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260131222515.8485-2-chia-yu.chang@nokia-bell-labs.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-02-03 15:13:24 +01:00
David Yang	8cdb2cc9a1	net: dsa: tag_yt921x: add priority support Required by DCB/QoS support of the switch driver, since the rx packets will have non-zero priorities. Signed-off-by: David Yang <mmyangfl@gmail.com> Link: https://patch.msgid.link/20260131021854.3405036-3-mmyangfl@gmail.com Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-02-03 15:09:31 +01:00
David Yang	a63daf73a5	net: dsa: tag_yt921x: clarify priority and code fields Packet priority is part of the tag, and the priority and code fields are used by tx and rx. Make revisions to reflect the facts. Signed-off-by: David Yang <mmyangfl@gmail.com> Link: https://patch.msgid.link/20260131021854.3405036-2-mmyangfl@gmail.com Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-02-03 15:09:31 +01:00
Cosmin Ratiu	29903edf04	devlink: Refactor devlink_rate_nodes_check devlink_rate_nodes_check() was used to verify there are no devlink rate nodes created when switching the esw mode. Rate management code is about to become more complex, so refactor this function: - remove unused param 'mode'. - add a new 'rate_filter' param. - rename to devlink_rates_check(). - expose devlink_rate_is_node() to be used as a rate filter. This makes it more usable from multiple places, so use it from those places as well. Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com> Reviewed-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20260128112544.1661250-6-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-02 20:05:51 -08:00
Cosmin Ratiu	0061b5199d	devlink: Reverse locking order for nested instances Commit [1] defined the locking expectations for nested devlink instances: the nested-in devlink instance lock needs to be acquired before the nested devlink instance lock. The code handling devlink rels was architected with that assumption in mind. There are no actual users of double locking yet but that is about to change in the upcoming patches in the series. Code operating on nested devlink instances will require also obtaining the nested-in instance lock, but such code may already be called from a variety of places with the nested devlink instance lock. Then, there's no way to acquire the nested-in lock other than making sure that all callers acquire it first. Reversing the nested lock order allows incrementally acquiring the nested-in instance lock when needed (perhaps even a chain of locks up to the root) without affecting any caller. The only affected use of nesting is devlink_nl_nested_fill(), which iterates over nested devlink instances with the RCU lock, without locking them, so there's no possibility of deadlock. So this commit just updates a comment regarding the nested locks. [1] commit `c137743bce` ("devlink: introduce object and nested devlink relationship infra") Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20260128112544.1661250-4-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-02 20:05:51 -08:00
Geliang Tang	22f3bd9bf8	mptcp: implement .splice_read This patch implements .splice_read interface of mptcp struct proto_ops as mptcp_splice_read() with reference to tcp_splice_read(). Corresponding to __tcp_splice_read(), __mptcp_splice_read() is defined, invoking mptcp_read_sock() instead of tcp_read_sock(). mptcp_splice_read() is almost the same as tcp_splice_read(), except for sock_rps_record_flow(). Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20260130-net-next-mptcp-splice-v2-4-31332ba70d7f@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-02 18:15:32 -08:00
Geliang Tang	2d85088d46	tcp: export tcp_splice_state Export struct tcp_splice_state and tcp_splice_data_recv() in net/tcp.h so that they can be used by MPTCP in the next patch. Suggested-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn> Acked-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260130-net-next-mptcp-splice-v2-3-31332ba70d7f@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-02 18:15:32 -08:00
Geliang Tang	250d9766a9	mptcp: implement .read_sock Current in-kernel TCP sockets -- i.e. from nvme_tcp_try_recv() -- need to call .read_sock interface of struct proto_ops, but it's not implemented in MPTCP. This patch implements it with reference to __tcp_read_sock() and __mptcp_recvmsg_mskq(). Corresponding to tcp_recv_skb(), a new helper for MPTCP named mptcp_recv_skb() is added to peek a skb from sk->sk_receive_queue. Compared with __mptcp_recvmsg_mskq(), mptcp_read_sock() uses sk->sk_rcvbuf as the max read length. The LISTEN status is checked before the while loop, and mptcp_recv_skb() and mptcp_cleanup_rbuf() are invoked after the loop. In the loop, all flags checks for __mptcp_recvmsg_mskq() are removed. Reviewed-by: Hannes Reinecke <hare@kernel.org> Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20260130-net-next-mptcp-splice-v2-2-31332ba70d7f@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-02 18:15:31 -08:00
Geliang Tang	436510df0c	mptcp: add eat_recv_skb helper This patch extracts the free skb related code in __mptcp_recvmsg_mskq() into a new helper mptcp_eat_recv_skb(). This new helper will be used in the next patch. Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260130-net-next-mptcp-splice-v2-1-31332ba70d7f@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-02 18:15:31 -08:00
Eric Dumazet	b409a7f717	ipv6: colocate inet6_cork in inet_cork_full All inet6_cork users also use one inet_cork_full. Reduce number of parameters and increase data locality. This saves ~275 bytes of code on x86_64. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260130210303.3888261-9-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-02 17:49:30 -08:00
Eric Dumazet	fe8570186f	ipv4: use dst4_mtu() instead of dst_mtu() When we expect an IPv4 dst, use dst4_mtu() instead of dst_mtu() to save some code space. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260130210303.3888261-8-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-02 17:49:29 -08:00
Eric Dumazet	b40f0130a2	ipv6: use dst6_mtu() instead of dst_mtu() When we expect an IPv6 dst, use dst6_mtu() instead of dst_mtu() to save some code space. Due to current dst6_mtu() implementation, only convert users in IPv6 stack. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260130210303.3888261-7-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-02 17:49:29 -08:00
Eric Dumazet	94a150bf14	ipv6: use SKB_DROP_REASON_PKT_TOO_BIG in ip6_xmit() When a too big packet is dropped, use SKB_DROP_REASON_PKT_TOO_BIG. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260130210303.3888261-5-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-02 17:49:29 -08:00
Eric Dumazet	b5b1b676a3	ipv6: use __skb_push() in ip6_xmit() ip6_xmit() makes sure there is enough headroom in the skb, it can uses __skb_push() instead of the out-of-line skb_push(). Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260130210303.3888261-4-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-02 17:49:29 -08:00
Eric Dumazet	2855e49254	ipv6: add some unlikely()/likely() clauses in ip6_output.c 1) daddr is unlikely a multicast in ip6_finish_output2(). 2) ip6_finish_output_gso_slowpath_drop() should not be called often. 3) ip6_fragment() should not be called often. 4) opt is unlikely to be set. 5) ip6_xmit() and ip6_forward() mostly sends not too big packets. 6) Most __ip6_make_skb() calls are for UDP packets, not ICMPV6 ones. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260130210303.3888261-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-02 17:49:29 -08:00
Eric Dumazet	1bc46dd209	ipv6: pass proto by value to ipv6_push_nfrag_opts() and ipv6_push_frag_opts() With CONFIG_STACKPROTECTOR_STRONG=y, it is better to avoid passing a pointer to an automatic variable. Change these exported functions to return 'u8 proto' instead of void. - ipv6_push_nfrag_opts() - ipv6_push_frag_opts() For instance, replace ipv6_push_frag_opts(skb, opt, &proto); with: proto = ipv6_push_frag_opts(skb, opt, proto); Note that even after this change, ip6_xmit() has to use a stack canary because of @first_hop variable. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260130210303.3888261-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-02 17:49:28 -08:00
Daniel Hodges	74d9391e88	tipc: use kfree_sensitive() for session key material The rx->skey field contains a struct tipc_aead_key with GCM-AES encryption keys used for TIPC cluster communication. Using plain kfree() leaves this sensitive key material in freed memory pages where it could potentially be recovered. Switch to kfree_sensitive() to ensure the key material is zeroed before the memory is freed. Fixes: `1ef6f7c939` ("tipc: add automatic session key exchange") Signed-off-by: Daniel Hodges <hodgesd@meta.com> Link: https://patch.msgid.link/20260131180114.2121438-1-hodgesd@meta.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-02 17:46:51 -08:00

... 3 4 5 6 7 ...

83428 Commits