linux

Commit Graph

Author	SHA1	Message	Date
Qi Tang	24dd586bb4	net/smc: fix double-free of smc_spd_priv when tee() duplicates splice pipe buffer smc_rx_splice() allocates one smc_spd_priv per pipe_buffer and stores the pointer in pipe_buffer.private. The pipe_buf_operations for these buffers used .get = generic_pipe_buf_get, which only increments the page reference count when tee(2) duplicates a pipe buffer. The smc_spd_priv pointer itself was not handled, so after tee() both the original and the cloned pipe_buffer share the same smc_spd_priv *. When both pipes are subsequently released, smc_rx_pipe_buf_release() is called twice against the same object: 1st call: kfree(priv) sock_put(sk) smc_rx_update_cons() [correct] 2nd call: kfree(priv) sock_put(sk) smc_rx_update_cons() [UAF] KASAN reports a slab-use-after-free in smc_rx_pipe_buf_release(), which then escalates to a NULL-pointer dereference and kernel panic via smc_rx_update_consumer() when it chases the freed priv->smc pointer: BUG: KASAN: slab-use-after-free in smc_rx_pipe_buf_release+0x78/0x2a0 Read of size 8 at addr ffff888004a45740 by task smc_splice_tee_/74 Call Trace: <TASK> dump_stack_lvl+0x53/0x70 print_report+0xce/0x650 kasan_report+0xc6/0x100 smc_rx_pipe_buf_release+0x78/0x2a0 free_pipe_info+0xd4/0x130 pipe_release+0x142/0x160 __fput+0x1c6/0x490 __x64_sys_close+0x4f/0x90 do_syscall_64+0xa6/0x1a0 entry_SYSCALL_64_after_hwframe+0x77/0x7f </TASK> BUG: kernel NULL pointer dereference, address: 0000000000000020 RIP: 0010:smc_rx_update_consumer+0x8d/0x350 Call Trace: <TASK> smc_rx_pipe_buf_release+0x121/0x2a0 free_pipe_info+0xd4/0x130 pipe_release+0x142/0x160 __fput+0x1c6/0x490 __x64_sys_close+0x4f/0x90 do_syscall_64+0xa6/0x1a0 entry_SYSCALL_64_after_hwframe+0x77/0x7f </TASK> Kernel panic - not syncing: Fatal exception Beyond the memory-safety problem, duplicating an SMC splice buffer is semantically questionable: smc_rx_update_cons() would advance the consumer cursor twice for the same data, corrupting receive-window accounting. A refcount on smc_spd_priv could fix the double-free, but the cursor-accounting issue would still need to be addressed separately. The .get callback is invoked by both tee(2) and splice_pipe_to_pipe() for partial transfers; both will now return -EFAULT. Users who need to duplicate SMC socket data must use a copy-based read path. Fixes: `9014db202c` ("smc: add support for splice()") Signed-off-by: Qi Tang <tpluszz77@gmail.com> Link: https://patch.msgid.link/20260318064847.23341-1-tpluszz77@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-03-20 18:59:30 -07:00
Jiayuan Chen	6d5e453836	net/smc: fix NULL dereference and UAF in smc_tcp_syn_recv_sock() Syzkaller reported a panic in smc_tcp_syn_recv_sock() [1]. smc_tcp_syn_recv_sock() is called in the TCP receive path (softirq) via icsk_af_ops->syn_recv_sock on the clcsock (TCP listening socket). It reads sk_user_data to get the smc_sock pointer. However, when the SMC listen socket is being closed concurrently, smc_close_active() sets clcsock->sk_user_data to NULL under sk_callback_lock, and then the smc_sock itself can be freed via sock_put() in smc_release(). This leads to two issues: 1) NULL pointer dereference: sk_user_data is NULL when accessed. 2) Use-after-free: sk_user_data is read as non-NULL, but the smc_sock is freed before its fields (e.g., queued_smc_hs, ori_af_ops) are accessed. The race window looks like this (the syzkaller crash [1] triggers via the SYN cookie path: tcp_get_cookie_sock() -> smc_tcp_syn_recv_sock(), but the normal tcp_check_req() path has the same race): CPU A (softirq) CPU B (process ctx) tcp_v4_rcv() TCP_NEW_SYN_RECV: sk = req->rsk_listener sock_hold(sk) /* No lock on listener */ smc_close_active(): write_lock_bh(cb_lock) sk_user_data = NULL write_unlock_bh(cb_lock) ... smc_clcsock_release() sock_put(smc->sk) x2 -> smc_sock freed! tcp_check_req() smc_tcp_syn_recv_sock(): smc = user_data(sk) -> NULL or dangling smc->queued_smc_hs -> crash! Note that the clcsock and smc_sock are two independent objects with separate refcounts. TCP stack holds a reference on the clcsock, which keeps it alive, but this does NOT prevent the smc_sock from being freed. Fix this by using RCU and refcount_inc_not_zero() to safely access smc_sock. Since smc_tcp_syn_recv_sock() is called in the TCP three-way handshake path, taking read_lock_bh on sk_callback_lock is too heavy and would not survive a SYN flood attack. Using rcu_read_lock() is much more lightweight. - Set SOCK_RCU_FREE on the SMC listen socket so that smc_sock freeing is deferred until after the RCU grace period. This guarantees the memory is still valid when accessed inside rcu_read_lock(). - Use rcu_read_lock() to protect reading sk_user_data. - Use refcount_inc_not_zero(&smc->sk.sk_refcnt) to pin the smc_sock. If the refcount has already reached zero (close path completed), it returns false and we bail out safely. Note: smc_hs_congested() has a similar lockless read of sk_user_data without rcu_read_lock(), but it only checks for NULL and accesses the global smc_hs_wq, never dereferencing any smc_sock field, so it is not affected. Reproducer was verified with mdelay injection and smc_run, the issue no longer occurs with this patch applied. [1] https://syzkaller.appspot.com/bug?extid=827ae2bfb3a3529333e9 Fixes: `8270d9c210` ("net/smc: Limit backlog connections") Reported-by: syzbot+827ae2bfb3a3529333e9@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/67eaf9b8.050a0220.3c3d88.004a.GAE@google.com/T/ Suggested-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com> Link: https://patch.msgid.link/20260312092909.48325-1-jiayuan.chen@linux.dev Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-03-16 19:31:28 -07:00
Linus Torvalds	b9c8fc2cae	Including fixes from IPsec, Bluetooth and netfilter Current release - regressions: - wifi: fix dev_alloc_name() return value check - rds: fix recursive lock in rds_tcp_conn_slots_available Current release - new code bugs: - vsock: lock down child_ns_mode as write-once Previous releases - regressions: - core: - do not pass flow_id to set_rps_cpu() - consume xmit errors of GSO frames - netconsole: avoid OOB reads, msg is not nul-terminated - netfilter: h323: fix OOB read in decode_choice() - tcp: re-enable acceptance of FIN packets when RWIN is 0 - udplite: fix null-ptr-deref in __udp_enqueue_schedule_skb(). - wifi: brcmfmac: fix potential kernel oops when probe fails - phy: register phy led_triggers during probe to avoid AB-BA deadlock - eth: bnxt_en: fix deleting of Ntuple filters - eth: wan: farsync: fix use-after-free bugs caused by unfinished tasklets - eth: xscale: check for PTP support properly Previous releases - always broken: - tcp: fix potential race in tcp_v6_syn_recv_sock() - kcm: fix zero-frag skb in frag_list on partial sendmsg error - xfrm: - fix race condition in espintcp_close() - always flush state and policy upon NETDEV_UNREGISTER event - bluetooth: - purge error queues in socket destructors - fix response to L2CAP_ECRED_CONN_REQ - eth: mlx5: - fix circular locking dependency in dump - fix "scheduling while atomic" in IPsec MAC address query - eth: gve: fix incorrect buffer cleanup for QPL - eth: team: avoid NETDEV_CHANGEMTU event when unregistering slave - eth: usb: validate USB endpoints Signed-off-by: Paolo Abeni <pabeni@redhat.com> -----BEGIN PGP SIGNATURE----- iQJGBAABCgAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmmgYU4SHHBhYmVuaUBy ZWRoYXQuY29tAAoJECkkeY3MjxOkLBgQAINazHstJ0DoDkvmwXapRSN0Ffauyd46 oX6nfeWOT3BzZbAhZHtGgCSs4aULifJWMevtT7pq7a7PgZwMwfa47BugR1G/u5UE hCqalNjRTB/U2KmFk6eViKSacD4FvUIAyAMOotn1aEdRRAkBIJnIW/o/ZR9ZUkm0 5+UigO64aq57+FOc5EQdGjYDcTVdzW12iOZ8ZqwtSATdNd9aC+gn3voRomTEo+Fm kQinkFEPAy/YyHGmfpC/z87/RTgkYLpagmsT4ZvBJeNPrIRvFEibSpPNhuzTzg81 /BW5M8sJmm3XFiTiRp6Blv+0n6HIpKjAZMHn5c9hzX9cxPZQ24EjkXEex9ClaxLd OMef79rr1HBwqBTpIlK7xfLKCdT5Iex88s8HxXRB/Psqk9pVP469cSoK6cpyiGiP I+4WT0wn9ukTiu/yV2L2byVr1sanlu54P+UBYJpDwqq3lZ1ngWtkJ+SY369jhwAS FYIBmUSKhmWz3FEULaGpgPy4m9Fl/fzN8IFh2Buoc/Puq61HH7MAMjRty2ZSFTqj gbHrRhlkCRqubytgjsnCDPLoJF4ZYcXtpo/8ogG3641H1I+dN+DyGGVZ/ioswkks My1ds0rKqA3BHCmn+pN/qqkuopDCOB95dqOpgDqHG7GePrpa/FJ1guhxexsCd+nL Run2RcgDmd+d =HBOu -----END PGP SIGNATURE----- Merge tag 'net-7.0-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Paolo Abeni: "Including fixes from IPsec, Bluetooth and netfilter Current release - regressions: - wifi: fix dev_alloc_name() return value check - rds: fix recursive lock in rds_tcp_conn_slots_available Current release - new code bugs: - vsock: lock down child_ns_mode as write-once Previous releases - regressions: - core: - do not pass flow_id to set_rps_cpu() - consume xmit errors of GSO frames - netconsole: avoid OOB reads, msg is not nul-terminated - netfilter: h323: fix OOB read in decode_choice() - tcp: re-enable acceptance of FIN packets when RWIN is 0 - udplite: fix null-ptr-deref in __udp_enqueue_schedule_skb(). - wifi: brcmfmac: fix potential kernel oops when probe fails - phy: register phy led_triggers during probe to avoid AB-BA deadlock - eth: - bnxt_en: fix deleting of Ntuple filters - wan: farsync: fix use-after-free bugs caused by unfinished tasklets - xscale: check for PTP support properly Previous releases - always broken: - tcp: fix potential race in tcp_v6_syn_recv_sock() - kcm: fix zero-frag skb in frag_list on partial sendmsg error - xfrm: - fix race condition in espintcp_close() - always flush state and policy upon NETDEV_UNREGISTER event - bluetooth: - purge error queues in socket destructors - fix response to L2CAP_ECRED_CONN_REQ - eth: - mlx5: - fix circular locking dependency in dump - fix "scheduling while atomic" in IPsec MAC address query - gve: fix incorrect buffer cleanup for QPL - team: avoid NETDEV_CHANGEMTU event when unregistering slave - usb: validate USB endpoints" * tag 'net-7.0-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (72 commits) netfilter: nf_conntrack_h323: fix OOB read in decode_choice() dpaa2-switch: validate num_ifs to prevent out-of-bounds write net: consume xmit errors of GSO frames vsock: document write-once behavior of the child_ns_mode sysctl vsock: lock down child_ns_mode as write-once selftests/vsock: change tests to respect write-once child ns mode net/mlx5e: Fix "scheduling while atomic" in IPsec MAC address query net/mlx5: Fix missing devlink lock in SRIOV enable error path net/mlx5: E-switch, Clear legacy flag when moving to switchdev net/mlx5: LAG, disable MPESW in lag_disable_change() net/mlx5: DR, Fix circular locking dependency in dump selftests: team: Add a reference count leak test team: avoid NETDEV_CHANGEMTU event when unregistering slave net: mana: Fix double destroy_workqueue on service rescan PCI path MAINTAINERS: Update maintainer entry for QUALCOMM ETHQOS ETHERNET DRIVER dpll: zl3073x: Remove redundant cleanup in devm_dpll_init() selftests/net: packetdrill: Verify acceptance of FIN packets when RWIN is 0 tcp: re-enable acceptance of FIN packets when RWIN is 0 vsock: Use container_of() to get net namespace in sysctl handlers net: usb: kaweth: validate USB endpoints ...	2026-02-26 08:00:13 -08:00
Kees Cook	189f164e57	Convert remaining multi-line kmalloc_obj/flex GFP_KERNEL uses Conversion performed via this Coccinelle script: // SPDX-License-Identifier: GPL-2.0-only // Options: --include-headers-for-types --all-includes --include-headers --keep-comments virtual patch @gfp depends on patch && !(file in "tools") && !(file in "samples")@ identifier ALLOC = {kmalloc_obj,kmalloc_objs,kmalloc_flex, kzalloc_obj,kzalloc_objs,kzalloc_flex, kvmalloc_obj,kvmalloc_objs,kvmalloc_flex, kvzalloc_obj,kvzalloc_objs,kvzalloc_flex}; @@ ALLOC(... - , GFP_KERNEL ) $ make coccicheck MODE=patch COCCI=gfp.cocci Build and boot tested x86_64 with Fedora 42's GCC and Clang: Linux version 6.19.0+ (user@host) (gcc (GCC) 15.2.1 20260123 (Red Hat 15.2.1-7), GNU ld version 2.44-12.fc42) #1 SMP PREEMPT_DYNAMIC 1970-01-01 Linux version 6.19.0+ (user@host) (clang version 20.1.8 (Fedora 20.1.8-4.fc42), LLD 20.1.8) #1 SMP PREEMPT_DYNAMIC 1970-01-01 Signed-off-by: Kees Cook <kees@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2026-02-22 08:26:33 -08:00
Linus Torvalds	32a92f8c89	Convert more 'alloc_obj' cases to default GFP_KERNEL arguments This converts some of the visually simpler cases that have been split over multiple lines. I only did the ones that are easy to verify the resulting diff by having just that final GFP_KERNEL argument on the next line. Somebody should probably do a proper coccinelle script for this, but for me the trivial script actually resulted in an assertion failure in the middle of the script. I probably had made it a bit _too_ trivial. So after fighting that far a while I decided to just do some of the syntactically simpler cases with variations of the previous 'sed' scripts. The more syntactically complex multi-line cases would mostly really want whitespace cleanup anyway. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2026-02-21 20:03:00 -08:00
Linus Torvalds	bf4afc53b7	Convert 'alloc_obj' family to use the new default GFP_KERNEL argument This was done entirely with mindless brute force, using git grep -l '\<k[vmz]alloc_objs(., GFP_KERNEL)' \| xargs sed -i 's/$alloc_objs(.*$, GFP_KERNEL)/\1)/' to convert the new alloc_obj() users that had a simple GFP_KERNEL argument to just drop that argument. Note that due to the extreme simplicity of the scripting, any slightly more complex cases spread over multiple lines would not be triggered: they definitely exist, but this covers the vast bulk of the cases, and the resulting diff is also then easier to check automatically. For the same reason the 'flex' versions will be done as a separate conversion. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2026-02-21 17:09:51 -08:00
Kees Cook	69050f8d6d	treewide: Replace kmalloc with kmalloc_obj for non-scalar types This is the result of running the Coccinelle script from scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to avoid scalar types (which need careful case-by-case checking), and instead replace kmalloc-family calls that allocate struct or union object instances: Single allocations: kmalloc(sizeof(TYPE), ...) are replaced with: kmalloc_obj(TYPE, ...) Array allocations: kmalloc_array(COUNT, sizeof(TYPE), ...) are replaced with: kmalloc_objs(TYPE, COUNT, ...) Flex array allocations: kmalloc(struct_size(PTR, FAM, COUNT), ...) are replaced with: kmalloc_flex(PTR, FAM, COUNT, ...) (where TYPE may also be VAR) The resulting allocations no longer return "void ", instead returning "TYPE ". Signed-off-by: Kees Cook <kees@kernel.org>	2026-02-21 01:02:28 -08:00
Eric Dumazet	858d2a4f67	tcp: fix potential race in tcp_v6_syn_recv_sock() Code in tcp_v6_syn_recv_sock() after the call to tcp_v4_syn_recv_sock() is done too late. After tcp_v4_syn_recv_sock(), the child socket is already visible from TCP ehash table and other cpus might use it. Since newinet->pinet6 is still pointing to the listener ipv6_pinfo bad things can happen as syzbot found. Move the problematic code in tcp_v6_mapped_child_init() and call this new helper from tcp_v4_syn_recv_sock() before the ehash insertion. This allows the removal of one tcp_sync_mss(), since tcp_v4_syn_recv_sock() will call it with the correct context. Fixes: `1da177e4c3` ("Linux-2.6.12-rc2") Reported-by: syzbot+937b5bbb6a815b3e5d0b@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/69949275.050a0220.2eeac1.0145.GAE@google.com/ Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260217161205.2079883-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-19 14:02:19 -08:00
D. Wythe	df31a6b0a3	Revert "net/smc: Introduce TCP ULP support" This reverts commit `d7cd421da9`. As reported by Al Viro, the TCP ULP support for SMC is fundamentally broken. The implementation attempts to convert an active TCP socket into an SMC socket by modifying the underlying `struct file`, dentry, and inode in-place, which violates core VFS invariants that assume these structures are immutable for an open file, creating a risk of use after free errors and general system instability. Given the severity of this design flaw and the fact that cleaner alternatives (e.g., LD_PRELOAD, BPF) exist for legacy application transparency, the correct course of action is to remove this feature entirely. Fixes: `d7cd421da9` ("net/smc: Introduce TCP ULP support") Link: https://lore.kernel.org/netdev/Yus1SycZxcd+wHwz@ZenIV/ Reported-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: D. Wythe <alibuda@linux.alibaba.com> Reviewed-by: Tony Lu <tonylu@linux.alibaba.com> Reviewed-by: Dust Li <dust.li@linux.alibaba.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260128055452.98251-1-alibuda@linux.alibaba.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-01-30 19:21:51 -08:00
Geert Uytterhoeven	861111b698	net: smc: SMC_HS_CTRL_BPF should depend on BPF_JIT If CONFIG_BPF_SYSCALL=y, but CONFIG_BPF_JIT=n: net/smc/smc_hs_bpf.c: In function ‘bpf_smc_hs_ctrl_init’: include/linux/bpf.h:2068:50: error: statement with no effect [-Werror=unused-value] 2068 \| #define register_bpf_struct_ops(st_ops, type) ({ (void *)(st_ops); 0; }) \| ^~~~~~~~~~~~~~~~ net/smc/smc_hs_bpf.c:139:16: note: in expansion of macro ‘register_bpf_struct_ops’ 139 \| return register_bpf_struct_ops(&bpf_smc_hs_ctrl_ops, smc_hs_ctrl); \| ^~~~~~~~~~~~~~~~~~~~~~~ While this compile error is caused by a bug in <linux/bpf.h>, none of the code in net/smc/smc_hs_bpf.c becomes effective if CONFIG_BPF_JIT is not enabled. Hence add a dependency on BPF_JIT. While at it, add the missing newline at the end of the file. Fixes: `15f295f556` ("net/smc: bpf: Introduce generic hook for handshake flow") Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://patch.msgid.link/988c61e5fea280872d81b3640f1f34d0619cfbbf.1764843951.git.geert@linux-m68k.org	2025-12-04 11:07:18 -08:00
Heiko Carstens	c940be4c7c	net: Remove KMSG_COMPONENT macro The KMSG_COMPONENT macro is a leftover of the s390 specific "kernel message catalog" from 2008 [1] which never made it upstream. The macro was added to s390 code to allow for an out-of-tree patch which used this to generate unique message ids. Also this out-of-tree patch doesn't exist anymore. The pattern of how the KMSG_COMPONENT macro is used can also be found at some non s390 specific code, for whatever reasons. Besides adding an indirection it is unused. Remove the macro in order to get rid of a pointless indirection. Replace all users with the string it defines. In all cases this leads to a simple replacement like this: - #define KMSG_COMPONENT "af_iucv" - #define pr_fmt(fmt) KMSG_COMPONENT ": " fmt + #define pr_fmt(fmt) "af_iucv: " fmt [1] https://lwn.net/Articles/292650/ Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Acked-by: Alexandra Winter <wintera@linux.ibm.com> Acked-by: Julian Anastasov <ja@ssi.bg> Acked-by: Sidraya Jayagond <sidraya@linux.ibm.com> Link: https://patch.msgid.link/20251126140705.1944278-1-hca@linux.ibm.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-28 19:20:27 -08:00
Jakub Kicinski	c99ebb6132	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Cross-merge networking fixes after downstream PR (net-6.18-rc6). No conflicts, adjacent changes in: drivers/net/phy/micrel.c `96a9178a29` ("net: phy: micrel: lan8814 fix reset of the QSGMII interface") `61b7ade9ba` ("net: phy: micrel: Add support for non PTP SKUs for lan8814") and a trivial one in tools/testing/selftests/drivers/net/Makefile. Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-13 12:35:38 -08:00
D. Wythe	ec33f2e5a2	net/smc: fix mismatch between CLC header and proposal The current CLC proposal message construction uses a mix of `ini->smc_type_v1/v2` and `pclc_base->hdr.typev1/v2` to decide whether to include optional extensions (IPv6 prefix extension for v1, and v2 extension). This leads to a critical inconsistency: when `smc_clc_prfx_set()` fails - for example, in IPv6-only environments with only link-local addresses, or when the local IP address and the outgoing interface’s network address are not in the same subnet. As a result, the proposal message is assembled using the stale `ini->smc_type_v1` value—causing the IPv6 prefix extension to be included even though the header indicates v1 is not supported. The peer then receives a malformed CLC proposal where the header type does not match the payload, and immediately resets the connection. The fix ensures consistency between the CLC header flags and the actual payload by synchronizing `ini->smc_type_v1` with `pclc_base->hdr.typev1` when prefix setup fails. Fixes: `8c3dca341a` ("net/smc: build and send V2 CLC proposal") Signed-off-by: D. Wythe <alibuda@linux.alibaba.com> Reviewed-by: Alexandra Winter <wintera@linux.ibm.com> Link: https://patch.msgid.link/20251107024029.88753-1-alibuda@linux.alibaba.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-10 17:52:09 -08:00
D. Wythe	15f295f556	net/smc: bpf: Introduce generic hook for handshake flow The introduction of IPPROTO_SMC enables eBPF programs to determine whether to use SMC based on the context of socket creation, such as network namespaces, PID and comm name, etc. As a subsequent enhancement, to introduce a new generic hook that allows decisions on whether to use SMC or not at runtime, including but not limited to local/remote IP address or ports. User can write their own implememtion via bpf_struct_ops now to choose whether to use SMC or not before TCP 3rd handshake to be comleted. Signed-off-by: D. Wythe <alibuda@linux.alibaba.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Reviewed-by: Dust Li <dust.li@linux.alibaba.com> Link: https://patch.msgid.link/20251107035632.115950-3-alibuda@linux.alibaba.com	2025-11-10 11:19:41 -08:00
Kees Cook	85cb0757d7	net: Convert proto_ops connect() callbacks to use sockaddr_unsized Update all struct proto_ops connect() callback function prototypes from "struct sockaddr " to "struct sockaddr_unsized " to avoid lying to the compiler about object sizes. Calls into struct proto handlers gain casts that will be removed in the struct proto conversion patch. No binary changes expected. Signed-off-by: Kees Cook <kees@kernel.org> Link: https://patch.msgid.link/20251104002617.2752303-3-kees@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-04 19:10:32 -08:00
Kees Cook	0e50474fa5	net: Convert proto_ops bind() callbacks to use sockaddr_unsized Update all struct proto_ops bind() callback function prototypes from "struct sockaddr " to "struct sockaddr_unsized " to avoid lying to the compiler about object sizes. Calls into struct proto handlers gain casts that will be removed in the struct proto conversion patch. No binary changes expected. Signed-off-by: Kees Cook <kees@kernel.org> Link: https://patch.msgid.link/20251104002617.2752303-2-kees@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-04 19:10:32 -08:00
Halil Pasic	8f736087e5	net/smc: handle -ENOMEM from smc_wr_alloc_link_mem gracefully Currently if a -ENOMEM from smc_wr_alloc_link_mem() is handled by giving up and going the way of a TCP fallback. This was reasonable before the sizes of the allocations there were compile time constants and reasonably small. But now those are actually configurable. So instead of giving up, keep retrying with half of the requested size unless we dip below the old static sizes -- then give up! In terms of numbers that means we give up when it is certain that we at best would end up allocating less than 16 send WR buffers or less than 48 recv WR buffers. This is to avoid regressions due to having fewer buffers compared the static values of the past. Please note that SMC-R is supposed to be an optimisation over TCP, and falling back to TCP is superior to establishing an SMC connection that is going to perform worse. If the memory allocation fails (and we propagate -ENOMEM), we fall back to TCP. Preserve (modulo truncation) the ratio of send/recv WR buffer counts. Signed-off-by: Halil Pasic <pasic@linux.ibm.com> Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com> Reviewed-by: Mahanta Jambigi <mjambigi@linux.ibm.com> Reviewed-by: Sidraya Jayagond <sidraya@linux.ibm.com> Reviewed-by: Dust Li <dust.li@linux.alibaba.com> Tested-by: Mahanta Jambigi <mjambigi@linux.ibm.com> Link: https://patch.msgid.link/20251027224856.2970019-3-pasic@linux.ibm.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-10-30 13:31:43 +01:00
Halil Pasic	aef3cdb47b	net/smc: make wr buffer count configurable Think SMC_WR_BUF_CNT_SEND := SMC_WR_BUF_CNT used in send context and SMC_WR_BUF_CNT_RECV := 3 * SMC_WR_BUF_CNT used in recv context. Those get replaced with lgr->max_send_wr and lgr->max_recv_wr respective. Please note that although with the default sysctl values qp_attr.cap.max_send_wr == qp_attr.cap.max_recv_wr is maintained but can not be assumed to be generally true any more. I see no downside to that, but my confidence level is rather modest. Signed-off-by: Halil Pasic <pasic@linux.ibm.com> Reviewed-by: Sidraya Jayagond <sidraya@linux.ibm.com> Reviewed-by: Dust Li <dust.li@linux.alibaba.com> Tested-by: Mahanta Jambigi <mjambigi@linux.ibm.com> Link: https://patch.msgid.link/20251027224856.2970019-2-pasic@linux.ibm.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-10-30 13:31:43 +01:00
Dust Li	f0773d0b41	smc: rename smc_find_ism_store_rc to reflect broader usage The function smc_find_ism_store_rc() is used to record the reason why a suitable device (either ISM or RDMA) could not be found. However, its name suggests it is ISM-specific, which is misleading. Rename it to better reflect its actual usage. No functional changes. Signed-off-by: Dust Li <dust.li@linux.alibaba.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20251023020012.69609-1-dust.li@linux.alibaba.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-24 18:48:03 -07:00
Wang Liang	f584239a9e	net/smc: fix general protection fault in __smc_diag_dump The syzbot report a crash: Oops: general protection fault, probably for non-canonical address 0xfbd5a5d5a0000003: 0000 [#1] SMP KASAN NOPTI KASAN: maybe wild-memory-access in range [0xdead4ead00000018-0xdead4ead0000001f] CPU: 1 UID: 0 PID: 6949 Comm: syz.0.335 Not tainted syzkaller #0 PREEMPT(full) Hardware name: Google Compute Engine/Google Compute Engine, BIOS Google 08/18/2025 RIP: 0010:smc_diag_msg_common_fill net/smc/smc_diag.c:44 [inline] RIP: 0010:__smc_diag_dump.constprop.0+0x3ca/0x2550 net/smc/smc_diag.c:89 Call Trace: <TASK> smc_diag_dump_proto+0x26d/0x420 net/smc/smc_diag.c:217 smc_diag_dump+0x27/0x90 net/smc/smc_diag.c:234 netlink_dump+0x539/0xd30 net/netlink/af_netlink.c:2327 __netlink_dump_start+0x6d6/0x990 net/netlink/af_netlink.c:2442 netlink_dump_start include/linux/netlink.h:341 [inline] smc_diag_handler_dump+0x1f9/0x240 net/smc/smc_diag.c:251 __sock_diag_cmd net/core/sock_diag.c:249 [inline] sock_diag_rcv_msg+0x438/0x790 net/core/sock_diag.c:285 netlink_rcv_skb+0x158/0x420 net/netlink/af_netlink.c:2552 netlink_unicast_kernel net/netlink/af_netlink.c:1320 [inline] netlink_unicast+0x5a7/0x870 net/netlink/af_netlink.c:1346 netlink_sendmsg+0x8d1/0xdd0 net/netlink/af_netlink.c:1896 sock_sendmsg_nosec net/socket.c:714 [inline] __sock_sendmsg net/socket.c:729 [inline] ____sys_sendmsg+0xa95/0xc70 net/socket.c:2614 ___sys_sendmsg+0x134/0x1d0 net/socket.c:2668 __sys_sendmsg+0x16d/0x220 net/socket.c:2700 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xcd/0x4e0 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f </TASK> The process like this: (CPU1) \| (CPU2) ---------------------------------\|------------------------------- inet_create() \| // init clcsock to NULL \| sk = sk_alloc() \| \| // unexpectedly change clcsock \| inet_init_csk_locks() \| \| // add sk to hash table \| smc_inet_init_sock() \| smc_sk_init() \| smc_hash_sk() \| \| // traverse the hash table \| smc_diag_dump_proto \| __smc_diag_dump() \| // visit wrong clcsock \| smc_diag_msg_common_fill() // alloc clcsock \| smc_create_clcsk \| sock_create_kern \| With CONFIG_DEBUG_LOCK_ALLOC=y, the smc->clcsock is unexpectedly changed in inet_init_csk_locks(). The INET_PROTOSW_ICSK flag is no need by smc, just remove it. After removing the INET_PROTOSW_ICSK flag, this patch alse revert commit `6fd27ea183` ("net/smc: fix lacks of icsk_syn_mss with IPPROTO_SMC") to avoid casting smc_sock to inet_connection_sock. Reported-by: syzbot+f775be4458668f7d220e@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=f775be4458668f7d220e Tested-by: syzbot+f775be4458668f7d220e@syzkaller.appspotmail.com Fixes: `d25a92ccae` ("net/smc: Introduce IPPROTO_SMC") Signed-off-by: Wang Liang <wangliang74@huawei.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: D. Wythe <alibuda@linux.alibaba.com> Link: https://patch.msgid.link/20251017024827.3137512-1-wangliang74@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-20 17:46:06 -07:00
Julian Ruess	a612dbe8d0	dibs: Move event handling to dibs layer Add defines for all event types and subtypes an ism device is known to produce as it can be helpful for debugging purposes. Introduces a generic 'struct dibs_event' and adopt ism device driver and smc-d client accordingly. Tolerate and ignore other type and subtype values to enable future device extensions. SMC-D and ISM are now independent. struct ism_dev can be moved to drivers/s390/net/ism.h. Note that in smc, the term 'ism' is still used. Future patches could replace that with 'dibs' or 'smc-d' as appropriate. Signed-off-by: Julian Ruess <julianr@linux.ibm.com> Co-developed-by: Alexandra Winter <wintera@linux.ibm.com> Signed-off-by: Alexandra Winter <wintera@linux.ibm.com> Link: https://patch.msgid.link/20250918110500.1731261-15-wintera@linux.ibm.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-09-23 11:13:22 +02:00
Alexandra Winter	cc21191b58	dibs: Move data path to dibs layer Use struct dibs_dmb instead of struct smc_dmb and move the corresponding client tables to dibs_dev. Leave driver specific implementation details like sba in the device drivers. Register and unregister dmbs via dibs_dev_ops. A dmb is dedicated to a single client, but a dibs device can have dmbs for more than one client. Trigger dibs clients via dibs_client_ops->handle_irq(), when data is received into a dmb. For dibs_loopback replace scheduling an smcd receive tasklet with calling dibs_client_ops->handle_irq(). For loopback devices attach_dmb(), detach_dmb() and move_data() need to access the dmb tables, so move those to dibs_dev_ops in this patch as well. Remove remaining definitions of smc_loopback as they are no longer required, now that everything is in dibs_loopback. Note that struct ism_client and struct ism_dev are still required in smc until a follow-on patch moves event handling to dibs. (Loopback does not use events). Signed-off-by: Alexandra Winter <wintera@linux.ibm.com> Link: https://patch.msgid.link/20250918110500.1731261-14-wintera@linux.ibm.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-09-23 11:13:22 +02:00
Alexandra Winter	719c3b67bb	dibs: Move query_remote_gid() to dibs_dev_ops Provide the dibs_dev_ops->query_remote_gid() in ism and dibs_loopback dibs_devices. And call it in smc dibs_client. Reviewed-by: Julian Ruess <julianr@linux.ibm.com> Signed-off-by: Alexandra Winter <wintera@linux.ibm.com> Link: https://patch.msgid.link/20250918110500.1731261-13-wintera@linux.ibm.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-09-23 11:13:22 +02:00
Alexandra Winter	92a0f7bb08	dibs: Move vlan support to dibs_dev_ops It can be debated how much benefit definition of vlan ids for dibs devices brings, as the dmbs are accessible only by a single peer anyhow. But ism provides vlan support and smcd exploits it, so move it to dibs layer as an optional feature. smcd_loopback simply ignores all vlan settings, do the same in dibs_loopback. SMC-D and ISM have a method to use the invalid VLAN ID 1FFF (ISM_RESERVED_VLANID), to indicate that both communication peers support routable SMC-Dv2. Tolerate it in dibs, but move it to SMC only. Signed-off-by: Alexandra Winter <wintera@linux.ibm.com> Link: https://patch.msgid.link/20250918110500.1731261-12-wintera@linux.ibm.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-09-23 11:13:22 +02:00
Alexandra Winter	05e68d8ded	dibs: Local gid for dibs devices Define a uuid_t GID attribute to identify a dibs device. SMC uses 64 Bit and 128 Bit Global Identifiers (GIDs) per device, that need to be sent via the SMC protocol. Because the smc code uses integers, network endianness and host endianness need to be considered. Avoid this in the dibs layer by using uuid_t byte arrays. Future patches could change SMC to use uuid_t. For now conversion helper functions are introduced. ISM devices provide 64 Bit GIDs. Map them to dibs uuid_t GIDs like this: _________________________________________ \| 64 Bit ISM-vPCI GID \| 00000000_00000000 \| ----------------------------------------- If interpreted as UUID [1], this would be interpreted as the UIID variant, that is reserved for NCS backward compatibility. So it will not collide with UUIDs that were generated according to the standard. smc_loopback already uses version 4 UUIDs as 128 Bit GIDs, move that to dibs loopback. A temporary change to smc_lo_query_rgid() is required, that will be moved to dibs_loopback with a follow-on patch. Provide gid of a dibs device as sysfs read-only attribute. Link: https://datatracker.ietf.org/doc/html/rfc4122 [1] Signed-off-by: Alexandra Winter <wintera@linux.ibm.com> Reviewed-by: Julian Ruess <julianr@linux.ibm.com> Reviewed-by: Mahanta Jambigi <mjambigi@linux.ibm.com> Link: https://patch.msgid.link/20250918110500.1731261-11-wintera@linux.ibm.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-09-23 11:13:22 +02:00
Julian Ruess	8047373498	dibs: Create class dibs Create '/sys/class/dibs' to represent multiple kinds of dibs devices in sysfs. Show s390/ism devices as well as dibs_loopback devices. Show attribute fabric_id using dibs_ops.get_fabric_id(). This can help users understand which dibs devices are connected to the same fabric in different systems and which dibs devices are loopback devices (fabric_id 0xffff) Instead of using the same name as the pci device, give the ism devices their own readable names based on uid or fid from the HW definition. smc_loopback was never visible in sysfs. dibs_loopback is now represented as a virtual device. For the SMC feature "software defined pnet-id" either the ib device name or the PCI-ID (actually the parent device name) can be used for SMC-R entries. Mimic this behaviour for SMC-D, and check the parent device name as well. So device name or PCI-ID can be used for ism and device name can be used for dibs-loopback. Note that this: IB_DEVICE_NAME_MAX - 1 == smc_pnet_policy.[SMC_PNETID_IBNAME].len is the length of smcd_name. Future SW-pnetid cleanup patches to could use a meaningful define, but that would touch too much unrelated code here. Examples: --------- ism before: > ls /sys/bus/pci/devices/0000:00:00.0/0000:00:00.0 uevent ism now: > ls /sys/bus/pci/devices/0000:00:00.0/dibs/ism30 device -> ../../../0000:00:00.0/ fabric_id subsystem -> ../../../../../class/dibs/ uevent dibs loopback: > ls /sys/devices/virtual/dibs/lo/ fabric_id subsystem -> ../../../../class/dibs/ uevent dibs class: > ls -l /sys/class/dibs/ ism30 -> ../../devices/pci0000:00/0000:00:00.0/dibs/ism30/ lo -> ../../devices/virtual/dibs/lo/ For comparison: > ls -l /sys/class/net/ enc8410 -> ../../devices/qeth/0.0.8410/net/enc8410/ ens1693 -> ../../devices/pci0001:00/0001:00:00.0/net/ens1693/ lo -> ../../devices/virtual/net/lo/ Signed-off-by: Julian Ruess <julianr@linux.ibm.com> Co-developed-by: Alexandra Winter <wintera@linux.ibm.com> Signed-off-by: Alexandra Winter <wintera@linux.ibm.com> Link: https://patch.msgid.link/20250918110500.1731261-10-wintera@linux.ibm.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-09-23 11:13:22 +02:00
Julian Ruess	845c334a01	dibs: Move struct device to dibs_dev Move struct device from ism_dev and smc_lo_dev to dibs_dev, and define a corresponding release function. Free ism_dev in ism_remove() and smc_lo_dev in smc_lo_dev_remove(). Replace smcd->ops->get_dev(smcd) by using dibs->dev directly. An alternative design would be to embed dibs_dev as a field in ism_dev and do the same for other dibs device driver specific structs. However that would have the disadvantage that each dibs device driver needs to allocate dibs_dev and each dibs device driver needs a different device release function. The advantage would be that ism_dev and other device driver specific structs would be covered by device reference counts. Signed-off-by: Julian Ruess <julianr@linux.ibm.com> Co-developed-by: Alexandra Winter <wintera@linux.ibm.com> Signed-off-by: Alexandra Winter <wintera@linux.ibm.com> Reviewed-by: Mahanta Jambigi <mjambigi@linux.ibm.com> Link: https://patch.msgid.link/20250918110500.1731261-9-wintera@linux.ibm.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-09-23 11:13:22 +02:00
Alexandra Winter	69baaac936	dibs: Define dibs_client_ops and dibs_dev_ops Move the device add() and remove() functions from ism_client to dibs_client_ops and call add_dev()/del_dev() for ism devices and dibs_loopback devices. dibs_client_ops->add_dev() = smcd_register_dev() for the smc_dibs_client. This is the first step to handle ism and loopback devices alike (as dibs devices) in the smc dibs client. Define dibs_dev->ops and move smcd_ops->get_chid to dibs_dev_ops->get_fabric_id() for ism and loopback devices. See below for why this needs to be in the same patch as dibs_client_ops->add_dev(). The following changes contain intermediate steps, that will be obsoleted by follow-on patches, once more functionality has been moved to dibs: Use different smcd_ops and max_dmbs for ism and loopback. Follow-on patches will change SMC-D to directly use dibs_ops instead of smcd_ops. In smcd_register_dev() it is now necessary to identify a dibs_loopback device before smcd_dev and smcd_ops->get_chid() are available. So provide dibs_dev_ops->get_fabric_id() in this patch and evaluate it in smc_ism_is_loopback(). Call smc_loopback_init() in smcd_register_dev() and call smc_loopback_exit() in smcd_unregister_dev() to handle the functionality that is still in smc_loopback. Follow-on patches will move all smc_loopback code to dibs_loopback. In smcd_[un]register_dev() use only ism device name, this will be replaced by dibs device name by a follow-on patch. End of changes with intermediate parts. Allocate an smcd event workqueue for all dibs devices, although dibs_loopback does not generate events. Use kernel memory instead of devres memory for smcd_dev and smcd->conn. Since commit `a72178cfe8` ("net/smc: Fix dependency of SMC on ISM") an ism device and its driver can have a longer lifetime than the smc module, so smc should not rely on devres to free its resources [1]. It is now the responsibility of the smc client to free smcd and smcd->conn for all dibs devices, ism devices as well as loopback. Call client->ops->del_dev() for all existing dibs devices in dibs_unregister_client(), so all device related structures can be freed in the client. When dibs_unregister_client() is called in the context of smc_exit() or smc_core_reboot_event(), these functions have already called smc_lgrs_shutdown() which calls smc_smcd_terminate_all(smcd) and sets going_away. This is done a second time in smcd_unregister_dev(). This is analogous to how smcr is handled in these functions, by calling first smc_lgrs_shutdown() and then smc_ib_unregister_client() > smc_ib_remove_dev(), so leave it that way. It may be worth investigating, whether smc_lgrs_shutdown() is still required or useful. Remove CONFIG_SMC_LO. CONFIG_DIBS_LO now controls whether a dibs loopback device exists or not. Link: https://www.kernel.org/doc/Documentation/driver-model/devres.txt [1] Signed-off-by: Alexandra Winter <wintera@linux.ibm.com> Reviewed-by: Mahanta Jambigi <mjambigi@linux.ibm.com> Link: https://patch.msgid.link/20250918110500.1731261-8-wintera@linux.ibm.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-09-23 11:13:22 +02:00
Alexandra Winter	d324a2ca3f	dibs: Register smc as dibs_client Formally register smc as dibs client. Functionality will be moved by follow-on patches from ism_client to dibs_client until eventually ism_client can be removed. As DIBS is only a shim layer without any dependencies, we can depend SMC on DIBS without adding indirect dependencies. A follow-on patch will remove dependency of SMC on ISM. Signed-off-by: Alexandra Winter <wintera@linux.ibm.com> Reviewed-by: Julian Ruess <julianr@linux.ibm.com> Link: https://patch.msgid.link/20250918110500.1731261-5-wintera@linux.ibm.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-09-23 11:13:21 +02:00
Alexandra Winter	a4997e17d1	net/smc: Decouple sf and attached send_buf in smc_loopback Before this patch there was the following assumption in smc_loopback.c>smc_lo_move_data(): sf (signalling flag) == 0 : data is already in an attached target dmb sf == 1 : data is not yet in the target dmb This is true for the 2 callers in smc client smcd_cdc_msg_send() : sf=1 smcd_tx_rdma_writes() : sf=0 but should not be a general assumption. Add a bool to struct smc_buf_desc to indicate whether an SMC-D sndbuf_desc is an attached buffer. Don't call move_data() for attached send_buffers, because it is not necessary. Move the data in smc_lo_move_data() if len != 0 and signal when requested. Signed-off-by: Alexandra Winter <wintera@linux.ibm.com> Reviewed-by: Mahanta Jambigi <mjambigi@linux.ibm.com> Reviewed-by: Dust Li <dust.li@linux.alibaba.com> Link: https://patch.msgid.link/20250918110500.1731261-3-wintera@linux.ibm.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-09-23 11:13:21 +02:00
Alexandra Winter	884eee8e43	net/smc: Remove error handling of unregister_dmb() smcd_buf_free() calls smc_ism_unregister_dmb(lgr->smcd, buf_desc) and then unconditionally frees buf_desc. Remove the cleaning up of fields of buf_desc in smc_ism_unregister_dmb(), because it is not helpful. This removes the only usage of ISM_ERROR from the smc module. So move it to drivers/s390/net/ism.h. Signed-off-by: Alexandra Winter <wintera@linux.ibm.com> Reviewed-by: Mahanta Jambigi <mjambigi@linux.ibm.com> Reviewed-by: Dust Li <dust.li@linux.alibaba.com> Link: https://patch.msgid.link/20250918110500.1731261-2-wintera@linux.ibm.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-09-23 11:13:21 +02:00
Marco Crivellari	27ce71e1ce	net: WQ_PERCPU added to alloc_workqueue users Currently if a user enqueue a work item using schedule_delayed_work() the used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to schedule_work() that is using system_wq and queue_work(), that makes use again of WORK_CPU_UNBOUND. This lack of consistentcy cannot be addressed without refactoring the API. alloc_workqueue() treats all queues as per-CPU by default, while unbound workqueues must opt-in via WQ_UNBOUND. This default is suboptimal: most workloads benefit from unbound queues, allowing the scheduler to place worker threads where they’re needed and reducing noise when CPUs are isolated. This change adds a new WQ_PERCPU flag at the network subsystem, to explicitly request the use of the per-CPU behavior. Both flags coexist for one release cycle to allow callers to transition their calls. Once migration is complete, WQ_UNBOUND can be removed and unbound will become the implicit default. With the introduction of the WQ_PERCPU flag (equivalent to !WQ_UNBOUND), any alloc_workqueue() caller that doesn’t explicitly specify WQ_UNBOUND must now use WQ_PERCPU. All existing users have been updated accordingly. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Marco Crivellari <marco.crivellari@suse.com> Link: https://patch.msgid.link/20250918142427.309519-4-marco.crivellari@suse.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-22 17:40:30 -07:00
Marco Crivellari	5fd8bb982e	net: replace use of system_wq with system_percpu_wq Currently if a user enqueue a work item using schedule_delayed_work() the used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to schedule_work() that is using system_wq and queue_work(), that makes use again of WORK_CPU_UNBOUND. This lack of consistentcy cannot be addressed without refactoring the API. system_unbound_wq should be the default workqueue so as not to enforce locality constraints for random work whenever it's not required. Adding system_dfl_wq to encourage its use when unbound work should be used. The old system_unbound_wq will be kept for a few release cycles. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Marco Crivellari <marco.crivellari@suse.com> Link: https://patch.msgid.link/20250918142427.309519-3-marco.crivellari@suse.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-22 17:40:30 -07:00
Kuniyuki Iwashima	0b0e4d51c6	smc: Use __sk_dst_get() and dst_dev_rcu() in smc_vlan_by_tcpsk(). smc_vlan_by_tcpsk() fetches sk_dst_get(sk)->dev before RTNL and passes it to netdev_walk_all_lower_dev(), which is illegal. Also, smc_vlan_by_tcpsk_walk() does not require RTNL at all. Let's use __sk_dst_get(), dst_dev_rcu(), and netdev_walk_all_lower_dev_rcu(). Note that the returned value of smc_vlan_by_tcpsk() is not used in the caller. Fixes: `0cfdd8f92c` ("smc: connection and link group creation") Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250916214758.650211-5-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-17 18:10:22 -07:00
Kuniyuki Iwashima	235f81045c	smc: Use __sk_dst_get() and dst_dev_rcu() in smc_clc_prfx_match(). smc_clc_prfx_match() is called from smc_listen_work() and not under RCU nor RTNL. Using sk_dst_get(sk)->dev could trigger UAF. Let's use __sk_dst_get() and dst_dev_rcu(). Note that the returned value of smc_clc_prfx_match() is not used in the caller. Fixes: `a046d57da1` ("smc: CLC handshake (incl. preparation steps)") Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250916214758.650211-4-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-17 18:10:22 -07:00
Kuniyuki Iwashima	935d783e5d	smc: Use __sk_dst_get() and dst_dev_rcu() in in smc_clc_prfx_set(). smc_clc_prfx_set() is called during connect() and not under RCU nor RTNL. Using sk_dst_get(sk)->dev could trigger UAF. Let's use __sk_dst_get() and dev_dst_rcu() under rcu_read_lock() after kernel_getsockname(). Note that the returned value of smc_clc_prfx_set() is not used in the caller. While at it, we change the 1st arg of smc_clc_prfx_set[46]_rcu() not to touch dst there. Fixes: `a046d57da1` ("smc: CLC handshake (incl. preparation steps)") Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250916214758.650211-3-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-17 18:10:21 -07:00
Kuniyuki Iwashima	3d3466878a	smc: Fix use-after-free in __pnet_find_base_ndev(). syzbot reported use-after-free of net_device in __pnet_find_base_ndev(), which was called during connect(). [0] smc_pnet_find_ism_resource() fetches sk_dst_get(sk)->dev and passes down to pnet_find_base_ndev(), where RTNL is held. Then, UAF happened at __pnet_find_base_ndev() when the dev is first used. This means dev had already been freed before acquiring RTNL in pnet_find_base_ndev(). While dev is going away, dst->dev could be swapped with blackhole_netdev, and the dev's refcnt by dst will be released. We must hold dev's refcnt before calling smc_pnet_find_ism_resource(). Also, smc_pnet_find_roce_resource() has the same problem. Let's use __sk_dst_get() and dst_dev_rcu() in the two functions. [0]: BUG: KASAN: use-after-free in __pnet_find_base_ndev+0x1b1/0x1c0 net/smc/smc_pnet.c:926 Read of size 1 at addr ffff888036bac33a by task syz.0.3632/18609 CPU: 1 UID: 0 PID: 18609 Comm: syz.0.3632 Not tainted syzkaller #0 PREEMPT(full) Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 08/18/2025 Call Trace: <TASK> dump_stack_lvl+0x189/0x250 lib/dump_stack.c:120 print_address_description mm/kasan/report.c:378 [inline] print_report+0xca/0x240 mm/kasan/report.c:482 kasan_report+0x118/0x150 mm/kasan/report.c:595 __pnet_find_base_ndev+0x1b1/0x1c0 net/smc/smc_pnet.c:926 pnet_find_base_ndev net/smc/smc_pnet.c:946 [inline] smc_pnet_find_ism_by_pnetid net/smc/smc_pnet.c:1103 [inline] smc_pnet_find_ism_resource+0xef/0x390 net/smc/smc_pnet.c:1154 smc_find_ism_device net/smc/af_smc.c:1030 [inline] smc_find_proposal_devices net/smc/af_smc.c:1115 [inline] __smc_connect+0x372/0x1890 net/smc/af_smc.c:1545 smc_connect+0x877/0xd90 net/smc/af_smc.c:1715 __sys_connect_file net/socket.c:2086 [inline] __sys_connect+0x313/0x440 net/socket.c:2105 __do_sys_connect net/socket.c:2111 [inline] __se_sys_connect net/socket.c:2108 [inline] __x64_sys_connect+0x7a/0x90 net/socket.c:2108 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xfa/0x3b0 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7f47cbf8eba9 Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 a8 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007f47ccdb1038 EFLAGS: 00000246 ORIG_RAX: 000000000000002a RAX: ffffffffffffffda RBX: 00007f47cc1d5fa0 RCX: 00007f47cbf8eba9 RDX: 0000000000000010 RSI: 0000200000000280 RDI: 000000000000000b RBP: 00007f47cc011e19 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 00007f47cc1d6038 R14: 00007f47cc1d5fa0 R15: 00007ffc512f8aa8 </TASK> The buggy address belongs to the physical page: page: refcount:0 mapcount:0 mapping:0000000000000000 index:0xffff888036bacd00 pfn:0x36bac flags: 0xfff00000000000(node=0\|zone=1\|lastcpupid=0x7ff) raw: 00fff00000000000 ffffea0001243d08 ffff8880b863fdc0 0000000000000000 raw: ffff888036bacd00 0000000000000000 00000000ffffffff 0000000000000000 page dumped because: kasan: bad access detected page_owner tracks the page as freed page last allocated via order 2, migratetype Unmovable, gfp_mask 0x446dc0(GFP_KERNEL_ACCOUNT\|__GFP_ZERO\|__GFP_NOWARN\|__GFP_RETRY_MAYFAIL\|__GFP_COMP), pid 16741, tgid 16741 (syz-executor), ts 343313197788, free_ts 380670750466 set_page_owner include/linux/page_owner.h:32 [inline] post_alloc_hook+0x240/0x2a0 mm/page_alloc.c:1851 prep_new_page mm/page_alloc.c:1859 [inline] get_page_from_freelist+0x21e4/0x22c0 mm/page_alloc.c:3858 __alloc_frozen_pages_noprof+0x181/0x370 mm/page_alloc.c:5148 alloc_pages_mpol+0x232/0x4a0 mm/mempolicy.c:2416 ___kmalloc_large_node+0x5f/0x1b0 mm/slub.c:4317 __kmalloc_large_node_noprof+0x18/0x90 mm/slub.c:4348 __do_kmalloc_node mm/slub.c:4364 [inline] __kvmalloc_node_noprof+0x6d/0x5f0 mm/slub.c:5067 alloc_netdev_mqs+0xa3/0x11b0 net/core/dev.c:11812 tun_set_iff+0x532/0xef0 drivers/net/tun.c:2775 __tun_chr_ioctl+0x788/0x1df0 drivers/net/tun.c:3085 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:598 [inline] __se_sys_ioctl+0xfc/0x170 fs/ioctl.c:584 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xfa/0x3b0 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f page last free pid 18610 tgid 18608 stack trace: reset_page_owner include/linux/page_owner.h:25 [inline] free_pages_prepare mm/page_alloc.c:1395 [inline] __free_frozen_pages+0xbc4/0xd30 mm/page_alloc.c:2895 free_large_kmalloc+0x13a/0x1f0 mm/slub.c:4820 device_release+0x99/0x1c0 drivers/base/core.c:-1 kobject_cleanup lib/kobject.c:689 [inline] kobject_release lib/kobject.c:720 [inline] kref_put include/linux/kref.h:65 [inline] kobject_put+0x22b/0x480 lib/kobject.c:737 netdev_run_todo+0xd2e/0xea0 net/core/dev.c:11513 rtnl_unlock net/core/rtnetlink.c:157 [inline] rtnl_net_unlock include/linux/rtnetlink.h:135 [inline] rtnl_dellink+0x537/0x710 net/core/rtnetlink.c:3563 rtnetlink_rcv_msg+0x7cc/0xb70 net/core/rtnetlink.c:6946 netlink_rcv_skb+0x208/0x470 net/netlink/af_netlink.c:2552 netlink_unicast_kernel net/netlink/af_netlink.c:1320 [inline] netlink_unicast+0x82f/0x9e0 net/netlink/af_netlink.c:1346 netlink_sendmsg+0x805/0xb30 net/netlink/af_netlink.c:1896 sock_sendmsg_nosec net/socket.c:714 [inline] __sock_sendmsg+0x219/0x270 net/socket.c:729 ____sys_sendmsg+0x505/0x830 net/socket.c:2614 ___sys_sendmsg+0x21f/0x2a0 net/socket.c:2668 __sys_sendmsg net/socket.c:2700 [inline] __do_sys_sendmsg net/socket.c:2705 [inline] __se_sys_sendmsg net/socket.c:2703 [inline] __x64_sys_sendmsg+0x19b/0x260 net/socket.c:2703 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xfa/0x3b0 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f Memory state around the buggy address: ffff888036bac200: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ffff888036bac280: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff >ffff888036bac300: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ^ ffff888036bac380: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ffff888036bac400: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff Fixes: `0afff91c6f` ("net/smc: add pnetid support") Fixes: `1619f77058` ("net/smc: add pnetid support for SMC-D and ISM") Reported-by: syzbot+ea28e9d85be2f327b6c6@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/68c237c7.050a0220.3c6139.0036.GAE@google.com/ Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250916214758.650211-2-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-17 18:10:21 -07:00
Mahanta Jambigi	010fe36ad2	net/smc: Remove unused argument from 2 SMC functions The smc argument is not used in both smc_connect_ism_vlan_setup() & smc_connect_ism_vlan_cleanup(). Hence removing it. Signed-off-by: Mahanta Jambigi <mjambigi@linux.ibm.com> Reviewed-by: Sidraya Jayagond <sidraya@linux.ibm.com> Reviewed-by: Dust Li <dust.li@linux.alibaba.com> Link: https://patch.msgid.link/20250910063125.2112577-1-mjambigi@linux.ibm.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-14 11:55:31 -07:00
Jakub Kicinski	5ef04a7b06	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Cross-merge networking fixes after downstream PR (net-6.17-rc5). No conflicts. Adjacent changes: include/net/sock.h `c51613fa27` ("net: add sk->sk_drop_counters") `5d6b58c932` ("net: lockless sock_i_ino()") Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-04 13:33:00 -07:00
Alexandra Winter	c975e1dfcc	net/smc: Improve log message for devices w/o pnetid Explicitly state in the log message, when a device has no pnetid. "with pnetid" and "has pnetid" was misleading for devices without pnetid. Signed-off-by: Alexandra Winter <wintera@linux.ibm.com> Reviewed-by: Dust Li <dust.li@linux.alibaba.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250901145842.1718373-3-wintera@linux.ibm.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-04 08:57:04 -07:00
Mahanta Jambigi	cc282f73bc	net/smc: Remove validation of reserved bits in CLC Decline message Currently SMC code is validating the reserved bits while parsing the incoming CLC decline message & when this validation fails, its treated as a protocol error. As a result, the SMC connection is terminated instead of falling back to TCP. As per RFC7609[1] specs we shouldn't be validating the reserved bits that is part of CLC message. This patch fixes this issue. CLC Decline message format can viewed here[2]. [1] https://datatracker.ietf.org/doc/html/rfc7609#page-92 [2] https://datatracker.ietf.org/doc/html/rfc7609#page-105 Fixes: `8ade200c26` ("net/smc: add v2 format of CLC decline message") Signed-off-by: Mahanta Jambigi <mjambigi@linux.ibm.com> Reviewed-by: Sidraya Jayagond <sidraya@linux.ibm.com> Reviewed-by: Alexandra Winter <wintera@linux.ibm.com> Reviewed-by: Dust Li <dust.li@linux.alibaba.com> Link: https://patch.msgid.link/20250902082041.98996-1-mjambigi@linux.ibm.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-03 17:01:07 -07:00
James Flowers	d250f14f5f	net/smc: Replace use of strncpy on NUL-terminated string with strscpy strncpy is deprecated for use on NUL-terminated strings, as indicated in Documentation/process/deprecated.rst. strncpy NUL-pads the destination buffer and doesn't guarantee the destination buffer will be NUL terminated. Signed-off-by: James Flowers <bold.zone2373@fastmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Dust Li <dust.li@linux.alibaba.com> Reviewed-by: Mahanta Jambigi <mjambigi@linux.ibm.com> Link: https://patch.msgid.link/20250901030512.80099-1-bold.zone2373@fastmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-02 14:05:49 -07:00
Liu Jian	ba1e9421cf	net/smc: fix one NULL pointer dereference in smc_ib_is_sg_need_sync() BUG: kernel NULL pointer dereference, address: 00000000000002ec PGD 0 P4D 0 Oops: Oops: 0000 [#1] SMP PTI CPU: 28 UID: 0 PID: 343 Comm: kworker/28:1 Kdump: loaded Tainted: G OE 6.17.0-rc2+ #9 NONE Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1 04/01/2014 Workqueue: smc_hs_wq smc_listen_work [smc] RIP: 0010:smc_ib_is_sg_need_sync+0x9e/0xd0 [smc] ... Call Trace: <TASK> smcr_buf_map_link+0x211/0x2a0 [smc] __smc_buf_create+0x522/0x970 [smc] smc_buf_create+0x3a/0x110 [smc] smc_find_rdma_v2_device_serv+0x18f/0x240 [smc] ? smc_vlan_by_tcpsk+0x7e/0xe0 [smc] smc_listen_find_device+0x1dd/0x2b0 [smc] smc_listen_work+0x30f/0x580 [smc] process_one_work+0x18c/0x340 worker_thread+0x242/0x360 kthread+0xe7/0x220 ret_from_fork+0x13a/0x160 ret_from_fork_asm+0x1a/0x30 </TASK> If the software RoCE device is used, ibdev->dma_device is a null pointer. As a result, the problem occurs. Null pointer detection is added to prevent problems. Fixes: `0ef69e7884` ("net/smc: optimize for smc_sndbuf_sync_sg_for_device and smc_rmb_sync_sg_for_cpu") Signed-off-by: Liu Jian <liujian56@huawei.com> Reviewed-by: Guangguan Wang <guangguan.wang@linux.alibaba.com> Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev> Reviewed-by: D. Wythe <alibuda@linux.alibaba.com> Link: https://patch.msgid.link/20250828124117.2622624-1-liujian56@huawei.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-09-02 10:51:16 +02:00
D. Wythe	d9cef55ed4	net/smc: fix UAF on smcsk after smc_listen_out() BPF CI testing report a UAF issue: [ 16.446633] BUG: kernel NULL pointer dereference, address: 000000000000003 0 [ 16.447134] #PF: supervisor read access in kernel mod e [ 16.447516] #PF: error_code(0x0000) - not-present pag e [ 16.447878] PGD 0 P4D 0 [ 16.448063] Oops: Oops: 0000 [#1] PREEMPT SMP NOPT I [ 16.448409] CPU: 0 UID: 0 PID: 9 Comm: kworker/0:1 Tainted: G OE 6.13.0-rc3-g89e8a75fda73-dirty #4 2 [ 16.449124] Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODUL E [ 16.449502] Hardware name: QEMU Ubuntu 24.04 PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/201 4 [ 16.450201] Workqueue: smc_hs_wq smc_listen_wor k [ 16.450531] RIP: 0010:smc_listen_work+0xc02/0x159 0 [ 16.452158] RSP: 0018:ffffb5ab40053d98 EFLAGS: 0001024 6 [ 16.452526] RAX: 0000000000000001 RBX: 0000000000000002 RCX: 000000000000030 0 [ 16.452994] RDX: 0000000000000280 RSI: 00003513840053f0 RDI: 000000000000000 0 [ 16.453492] RBP: ffffa097808e3800 R08: ffffa09782dba1e0 R09: 000000000000000 5 [ 16.453987] R10: 0000000000000000 R11: 0000000000000000 R12: ffffa0978274640 0 [ 16.454497] R13: 0000000000000000 R14: 0000000000000000 R15: ffffa09782d4092 0 [ 16.454996] FS: 0000000000000000(0000) GS:ffffa097bbc00000(0000) knlGS:000000000000000 0 [ 16.455557] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003 3 [ 16.455961] CR2: 0000000000000030 CR3: 0000000102788004 CR4: 0000000000770ef 0 [ 16.456459] PKRU: 5555555 4 [ 16.456654] Call Trace : [ 16.456832] <TASK > [ 16.456989] ? __die+0x23/0x7 0 [ 16.457215] ? page_fault_oops+0x180/0x4c 0 [ 16.457508] ? __lock_acquire+0x3e6/0x249 0 [ 16.457801] ? exc_page_fault+0x68/0x20 0 [ 16.458080] ? asm_exc_page_fault+0x26/0x3 0 [ 16.458389] ? smc_listen_work+0xc02/0x159 0 [ 16.458689] ? smc_listen_work+0xc02/0x159 0 [ 16.458987] ? lock_is_held_type+0x8f/0x10 0 [ 16.459284] process_one_work+0x1ea/0x6d 0 [ 16.459570] worker_thread+0x1c3/0x38 0 [ 16.459839] ? __pfx_worker_thread+0x10/0x1 0 [ 16.460144] kthread+0xe0/0x11 0 [ 16.460372] ? __pfx_kthread+0x10/0x1 0 [ 16.460640] ret_from_fork+0x31/0x5 0 [ 16.460896] ? __pfx_kthread+0x10/0x1 0 [ 16.461166] ret_from_fork_asm+0x1a/0x3 0 [ 16.461453] </TASK > [ 16.461616] Modules linked in: bpf_testmod(OE) [last unloaded: bpf_testmod(OE) ] [ 16.462134] CR2: 000000000000003 0 [ 16.462380] ---[ end trace 0000000000000000 ]--- [ 16.462710] RIP: 0010:smc_listen_work+0xc02/0x1590 The direct cause of this issue is that after smc_listen_out_connected(), newclcsock->sk may be NULL since it will releases the smcsk. Therefore, if the application closes the socket immediately after accept, newclcsock->sk can be NULL. A possible execution order could be as follows: smc_listen_work \| userspace ----------------------------------------------------------------- lock_sock(sk) \| smc_listen_out_connected() \| \| \- smc_listen_out \| \| \| \- release_sock \| \| \|- sk->sk_data_ready() \| \| fd = accept(); \| close(fd); \| \- socket->sk = NULL; /* newclcsock->sk is NULL now */ SMC_STAT_SERV_SUCC_INC(sock_net(newclcsock->sk)) Since smc_listen_out_connected() will not fail, simply swapping the order of the code can easily fix this issue. Fixes: `3b2dec2603` ("net/smc: restructure client and server code in af_smc") Signed-off-by: D. Wythe <alibuda@linux.alibaba.com> Reviewed-by: Guangguan Wang <guangguan.wang@linux.alibaba.com> Reviewed-by: Alexandra Winter <wintera@linux.ibm.com> Reviewed-by: Dust Li <dust.li@linux.alibaba.com> Link: https://patch.msgid.link/20250818054618.41615-1-alibuda@linux.alibaba.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-19 18:27:16 -07:00
Jakub Kicinski	af2d6148d2	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Cross-merge networking fixes after downstream PR (net-6.16-rc7). Conflicts: Documentation/netlink/specs/ovpn.yaml `880d43ca9a` ("netlink: specs: clean up spaces in brackets") `af52020fc5` ("ovpn: reject unexpected netlink attributes") drivers/net/phy/phy_device.c `a44312d58e` ("net: phy: Don't register LEDs for genphy") `f0f2b992d8` ("net: phy: Don't register LEDs for genphy") https://lore.kernel.org/20250710114926.7ec3a64f@kernel.org drivers/net/wireless/intel/iwlwifi/fw/regulatory.c drivers/net/wireless/intel/iwlwifi/mld/regulatory.c `5fde0fcbd7` ("wifi: iwlwifi: mask reserved bits in chan_state_active_bitmap") `ea045a0de3` ("wifi: iwlwifi: add support for accepting raw DSM tables by firmware") net/ipv6/mcast.c `ae3264a25a` ("ipv6: mcast: Delay put pmc->idev in mld_del_delrec()") `a8594c956c` ("ipv6: mcast: Avoid a duplicate pointer check in mld_del_delrec()") https://lore.kernel.org/8cc52891-3653-4b03-a45e-05464fe495cf@kernel.org No adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-07-17 11:00:33 -07:00
Kuniyuki Iwashima	60ada4fe64	smc: Fix various oops due to inet_sock type confusion. syzbot reported weird splats [0][1] in cipso_v4_sock_setattr() while freeing inet_sk(sk)->inet_opt. The address was freed multiple times even though it was read-only memory. cipso_v4_sock_setattr() did nothing wrong, and the root cause was type confusion. The cited commit made it possible to create smc_sock as an INET socket. The issue is that struct smc_sock does not have struct inet_sock as the first member but hijacks AF_INET and AF_INET6 sk_family, which confuses various places. In this case, inet_sock.inet_opt was actually smc_sock.clcsk_data_ready(), which is an address of a function in the text segment. $ pahole -C inet_sock vmlinux struct inet_sock { ... struct ip_options_rcu * inet_opt; /* 784 8 / $ pahole -C smc_sock vmlinux struct smc_sock { ... void (clcsk_data_ready)(struct sock ); / 784 8 */ The same issue for another field was reported before. [2][3] At that time, an ugly hack was suggested [4], but it makes both INET and SMC code error-prone and hard to change. Also, yet another variant was fixed by a hacky commit `98d4435efc` ("net/smc: prevent NULL pointer dereference in txopt_get"). Instead of papering over the root cause by such hacks, we should not allow non-INET socket to reuse the INET infra. Let's add inet_sock as the first member of smc_sock. [0]: kvfree_call_rcu(): Double-freed call. rcu_head 000000006921da73 WARNING: CPU: 0 PID: 6718 at mm/slab_common.c:1956 kvfree_call_rcu+0x94/0x3f0 mm/slab_common.c:1955 Modules linked in: CPU: 0 UID: 0 PID: 6718 Comm: syz.0.17 Tainted: G W 6.16.0-rc4-syzkaller-g7482bb149b9f #0 PREEMPT Tainted: [W]=WARN Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/07/2025 pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : kvfree_call_rcu+0x94/0x3f0 mm/slab_common.c:1955 lr : kvfree_call_rcu+0x94/0x3f0 mm/slab_common.c:1955 sp : ffff8000a03a7730 x29: ffff8000a03a7730 x28: 00000000fffffff5 x27: 1fffe000184823d3 x26: dfff800000000000 x25: ffff0000c2411e9e x24: ffff0000dd88da00 x23: ffff8000891ac9a0 x22: 00000000ffffffea x21: ffff8000891ac9a0 x20: ffff8000891ac9a0 x19: ffff80008afc2480 x18: 00000000ffffffff x17: 0000000000000000 x16: ffff80008ae642c8 x15: ffff700011ede14c x14: 1ffff00011ede14c x13: 0000000000000004 x12: ffffffffffffffff x11: ffff700011ede14c x10: 0000000000ff0100 x9 : 5fa3c1ffaf0ff000 x8 : 5fa3c1ffaf0ff000 x7 : 0000000000000001 x6 : 0000000000000001 x5 : ffff8000a03a7078 x4 : ffff80008f766c20 x3 : ffff80008054d360 x2 : 0000000000000000 x1 : 0000000000000201 x0 : 0000000000000000 Call trace: kvfree_call_rcu+0x94/0x3f0 mm/slab_common.c:1955 (P) cipso_v4_sock_setattr+0x2f0/0x3f4 net/ipv4/cipso_ipv4.c:1914 netlbl_sock_setattr+0x240/0x334 net/netlabel/netlabel_kapi.c:1000 smack_netlbl_add+0xa8/0x158 security/smack/smack_lsm.c:2581 smack_inode_setsecurity+0x378/0x430 security/smack/smack_lsm.c:2912 security_inode_setsecurity+0x118/0x3c0 security/security.c:2706 __vfs_setxattr_noperm+0x174/0x5c4 fs/xattr.c:251 __vfs_setxattr_locked+0x1ec/0x218 fs/xattr.c:295 vfs_setxattr+0x158/0x2ac fs/xattr.c:321 do_setxattr fs/xattr.c:636 [inline] file_setxattr+0x1b8/0x294 fs/xattr.c:646 path_setxattrat+0x2ac/0x320 fs/xattr.c:711 __do_sys_fsetxattr fs/xattr.c:761 [inline] __se_sys_fsetxattr fs/xattr.c:758 [inline] __arm64_sys_fsetxattr+0xc0/0xdc fs/xattr.c:758 __invoke_syscall arch/arm64/kernel/syscall.c:35 [inline] invoke_syscall+0x98/0x2b8 arch/arm64/kernel/syscall.c:49 el0_svc_common+0x130/0x23c arch/arm64/kernel/syscall.c:132 do_el0_svc+0x48/0x58 arch/arm64/kernel/syscall.c:151 el0_svc+0x58/0x180 arch/arm64/kernel/entry-common.c:879 el0t_64_sync_handler+0x84/0x12c arch/arm64/kernel/entry-common.c:898 el0t_64_sync+0x198/0x19c arch/arm64/kernel/entry.S:600 [1]: Unable to handle kernel write to read-only memory at virtual address ffff8000891ac9a8 KASAN: probably user-memory-access in range [0x0000000448d64d40-0x0000000448d64d47] Mem abort info: ESR = 0x000000009600004e EC = 0x25: DABT (current EL), IL = 32 bits SET = 0, FnV = 0 EA = 0, S1PTW = 0 FSC = 0x0e: level 2 permission fault Data abort info: ISV = 0, ISS = 0x0000004e, ISS2 = 0x00000000 CM = 0, WnR = 1, TnD = 0, TagAccess = 0 GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0 swapper pgtable: 4k pages, 48-bit VAs, pgdp=0000000207144000 [ffff8000891ac9a8] pgd=0000000000000000, p4d=100000020f950003, pud=100000020f951003, pmd=0040000201000781 Internal error: Oops: 000000009600004e [#1] SMP Modules linked in: CPU: 0 UID: 0 PID: 6946 Comm: syz.0.69 Not tainted 6.16.0-rc4-syzkaller-g7482bb149b9f #0 PREEMPT Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/07/2025 pstate: 604000c5 (nZCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : kvfree_call_rcu+0x31c/0x3f0 mm/slab_common.c:1971 lr : add_ptr_to_bulk_krc_lock mm/slab_common.c:1838 [inline] lr : kvfree_call_rcu+0xfc/0x3f0 mm/slab_common.c:1963 sp : ffff8000a28a7730 x29: ffff8000a28a7730 x28: 00000000fffffff5 x27: 1fffe00018b09bb3 x26: 0000000000000001 x25: ffff80008f66e000 x24: ffff00019beaf498 x23: ffff00019beaf4c0 x22: 0000000000000000 x21: ffff8000891ac9a0 x20: ffff8000891ac9a0 x19: 0000000000000000 x18: 00000000ffffffff x17: ffff800093363000 x16: ffff80008052c6e4 x15: ffff700014514ecc x14: 1ffff00014514ecc x13: 0000000000000004 x12: ffffffffffffffff x11: ffff700014514ecc x10: 0000000000000001 x9 : 0000000000000001 x8 : ffff00019beaf7b4 x7 : ffff800080a94154 x6 : 0000000000000000 x5 : ffff8000935efa60 x4 : 0000000000000008 x3 : ffff80008052c7fc x2 : 0000000000000001 x1 : ffff8000891ac9a0 x0 : 0000000000000001 Call trace: kvfree_call_rcu+0x31c/0x3f0 mm/slab_common.c:1967 (P) cipso_v4_sock_setattr+0x2f0/0x3f4 net/ipv4/cipso_ipv4.c:1914 netlbl_sock_setattr+0x240/0x334 net/netlabel/netlabel_kapi.c:1000 smack_netlbl_add+0xa8/0x158 security/smack/smack_lsm.c:2581 smack_inode_setsecurity+0x378/0x430 security/smack/smack_lsm.c:2912 security_inode_setsecurity+0x118/0x3c0 security/security.c:2706 __vfs_setxattr_noperm+0x174/0x5c4 fs/xattr.c:251 __vfs_setxattr_locked+0x1ec/0x218 fs/xattr.c:295 vfs_setxattr+0x158/0x2ac fs/xattr.c:321 do_setxattr fs/xattr.c:636 [inline] file_setxattr+0x1b8/0x294 fs/xattr.c:646 path_setxattrat+0x2ac/0x320 fs/xattr.c:711 __do_sys_fsetxattr fs/xattr.c:761 [inline] __se_sys_fsetxattr fs/xattr.c:758 [inline] __arm64_sys_fsetxattr+0xc0/0xdc fs/xattr.c:758 __invoke_syscall arch/arm64/kernel/syscall.c:35 [inline] invoke_syscall+0x98/0x2b8 arch/arm64/kernel/syscall.c:49 el0_svc_common+0x130/0x23c arch/arm64/kernel/syscall.c:132 do_el0_svc+0x48/0x58 arch/arm64/kernel/syscall.c:151 el0_svc+0x58/0x180 arch/arm64/kernel/entry-common.c:879 el0t_64_sync_handler+0x84/0x12c arch/arm64/kernel/entry-common.c:898 el0t_64_sync+0x198/0x19c arch/arm64/kernel/entry.S:600 Code: aa1f03e2 52800023 97ee1e8d b4000195 (f90006b4) Fixes: `d25a92ccae` ("net/smc: Introduce IPPROTO_SMC") Reported-by: syzbot+40bf00346c3fe40f90f2@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/686d9b50.050a0220.1ffab7.0020.GAE@google.com/ Tested-by: syzbot+40bf00346c3fe40f90f2@syzkaller.appspotmail.com Reported-by: syzbot+f22031fad6cbe52c70e7@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/686da0f3.050a0220.1ffab7.0022.GAE@google.com/ Reported-by: syzbot+271fed3ed6f24600c364@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=271fed3ed6f24600c364 # [2] Link: https://lore.kernel.org/netdev/99f284be-bf1d-4bc4-a629-77b268522fff@huawei.com/ # [3] Link: https://lore.kernel.org/netdev/20250331081003.1503211-1-wangliang74@huawei.com/ # [4] Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: D. Wythe <alibuda@linux.alibaba.com> Reviewed-by: Wang Liang <wangliang74@huawei.com> Link: https://patch.msgid.link/20250711060808.2977529-1-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-07-14 17:12:42 -07:00
Easwar Hariharan	4814f9110e	net/smc: convert timeouts to secs_to_jiffies() Commit `b35108a51c` ("jiffies: Define secs_to_jiffies()") introduced secs_to_jiffies(). As the value here is a multiple of 1000, use secs_to_jiffies() instead of msecs_to_jiffies to avoid the multiplication. This is converted using scripts/coccinelle/misc/secs_to_jiffies.cocci with the following Coccinelle rules: @depends on patch@ expression E; @@ -msecs_to_jiffies(E * 1000) +secs_to_jiffies(E) -msecs_to_jiffies(E * MSEC_PER_SEC) +secs_to_jiffies(E) Signed-off-by: Easwar Hariharan <eahariha@linux.microsoft.com> Reviewed-by: Dust Li <dust.li@linux.alibaba.com> Reviewed-by: Guangguan Wang <guangguan.wang@linux.alibaba.com> Link: https://patch.msgid.link/20250707-netdev-secs-to-jiffies-part-2-v2-1-b7817036342f@linux.microsoft.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-07-09 19:25:01 -07:00
Eric Dumazet	935b67675a	net: make sk->sk_rcvtimeo lockless Followup of commit `285975dd67` ("net: annotate data-races around sk->sk_{rcv\|snd}timeo"). Remove lock_sock()/release_sock() from ksmbd_tcp_rcv_timeout() and add READ_ONCE()/WRITE_ONCE() where it is needed. Also SO_RCVTIMEO_OLD and SO_RCVTIMEO_NEW can call sock_set_timeout() without holding the socket lock. Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250620155536.335520-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-06-23 17:05:12 -07:00
Eric Dumazet	3169e36ae1	net: make sk->sk_sndtimeo lockless Followup of commit `285975dd67` ("net: annotate data-races around sk->sk_{rcv\|snd}timeo"). Remove lock_sock()/release_sock() from sock_set_sndtimeo(), and add READ_ONCE()/WRITE_ONCE() where it is needed. Also SO_SNDTIMEO_OLD and SO_SNDTIMEO_NEW can call sock_set_timeout() without holding the socket lock. Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250620155536.335520-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-06-23 17:05:11 -07:00
Eric Dumazet	c51da3f7a1	net: remove sock_i_uid() Difference between sock_i_uid() and sk_uid() is that after sock_orphan(), sock_i_uid() returns GLOBAL_ROOT_UID while sk_uid() returns the last cached sk->sk_uid value. None of sock_i_uid() callers care about this. Use sk_uid() which is much faster and inlined. Note that diag/dump users are calling sock_i_ino() and can not see the full benefit yet. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Lorenzo Colitti <lorenzo@google.com> Reviewed-by: Maciej Żenczykowski <maze@google.com> Link: https://patch.msgid.link/20250620133001.4090592-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-06-23 17:04:03 -07:00

1 2 3 4 5 ...

919 Commits